4.1 How does feature analysis work? - Video Tutorials & Practice Problems
Video duration:
9m
Play a video:
<v ->All right, we've talked several times previously</v> about feature analysis. What do we mean when we're talking about feature analysis? Well, we're talking about the characteristics that go into the model, and data scientists are the people that do that. Now, someday, you might be able to have the machine figure out what those characteristics are. Someday, Moore's law, as you recall, would tell us that the machine can get so fast that it can just look at every possible feature there is, but we're not there yet. So until we get there, we need those data scientists doing the feature analysis, but we also need you. What's your part in data science here? What's your part in feature analysis? Well, remember when we looked at data science and we said, "Hey, it's an interdisciplinary area." And there's several different disciplines. There's computer science, there's math and statistics, is the second one, and the third one is domain knowledge. And so feature analysis really relies on domain knowledge. And who has that? Well, that would be you, that's not the data scientist. And so feature analysis, it's really important that it be done collaboratively. It's not just the data scientists doing it, it's you and the data scientists working together, combining your domain knowledge with their understanding of data and computer science, math, all those other great stuff. It's all those things, but yes, your domain knowledge of what could be important features to solve the problem that you may have. And you remember in all the things we looked at before there were features that you could look at. So remember when we were talking about identifying spam tweets, well, you might've guessed, well, maybe the length of the tweet would be something to look at. Maybe whether it has a link or not would be something to look at. Those were things that, because you understood the problem of a spam tweet, you might've been able to help the data scientists to look at that as a potential pattern. And so that's really what we want you to think about here is how can you help the data scientists choose the right features to look at. Now understand it doesn't mean that you know the answer. It doesn't mean that you know that a particular characteristic, a particular feature is going to prove to be important. You might not know, but you might suspect that it would, and that's all that's required. Because the machine learning will actually tell us whether it actually is important or not. And it could be some of the things that you suspect are important, aren't important, that's okay. What's really important is for you to come up with the super set, come up with a big list of all of the things that you think could be important, because that's going to be a much smaller list than all of the possible features there are. And it's going to be a bigger list than all of the features that the data scientist will think of on their own. And so that's why you need to be involved in this process, because you're both gonna add things to the list that a data scientist wouldn't have thought of, but you're also gonna subtract a whole bunch of things from the list that might be potential features but that you know boy, you know, I'm really sure that those don't have anything to do with the problem we're solving. And so let me give you a weird example. Like, if we think about those spam tweets, you know, a potential feature is what letters are used in the handle for the Twitter handle that posted the tweet. Or how long are the Twitter handles? Those are both potential features. You can have a machine learning algorithm that looks at those things, but you know that has nothing to do with whether it's a spam tweet or not. Spam tweets are not gonna be done by handles that have certain letters in them, or that are longer or shorter. That's not gonna mean anything. And so you know that's not a thing to look at that would be dumb. And so don't look at that as a feature. And you might say, "Well, I would never suggest that." Of course you wouldn't, but if a data scientist might not know that, and so a data scientist might look at it that feature and waste all sorts of processing time and all sorts of time and solving the problem by looking at a feature that has nothing to do with the problem. So that's why you're so important. You're gonna both know things that might be relevant to the problem, and also those things that probably aren't relevant to the problem. That's gonna make it much faster to identify the right feature set that really does solve the problem. So let's look at two different types of artificial intelligence, and look at different types of features. So part of your data science might be looking at data analytics. So those are basically numbers. And another might look at text analytics. And both of these things are different types of features that both go into machine learning. And text analytics is often called natural language processing. And so you can imagine that text analytics is using words. It's mostly what it's looking at. It's looking at tweets. It's looking at web pages. It's looking at documents. It's looking at emails. It's looking at things that have words in them. And it's using a technique called text analytics, as opposed to data analytics, where data analytics are looking at statistics and numbers. And so you can see a list here of some potential features, but what I want to do is to show you that some of the features in the list are actually data analytics, because they're statistical. You know, higher than average, statistically significant, correlated, changes over time, suddenly changes, these are all things you can do with math and statistics. So those are data analytics, statistical analysis features. Now on the other hand, some of the other ones we showed in the list use text analysis. So contains a certain word. It's a longer than average document. The document is written at a higher grade level. It contains typos and other errors, or it contains links to other web pages. So these are all examples of text analytics, where there's not a number here it's actually using the text to actually figure out what it is that the feature might show you. And so sometimes you use text analysis to then follow it with statistical analysis because, for example, think about that one that says, "The document is longer than average." Well, first you need natural language processing to, for example, count how many words are in this document and count how many words are in all the other documents. So you need something that's looking at texts to do that. But then after that, what you're doing is you're using a statistic. You're saying I'm gonna use the number of words as my proxy for how long the document is. And then, I'm gonna compare it using a formula that says, what is the average document, and how long is it in words? And now all of a sudden you've taken text analysis and you've paired it with statistical analysis. So often, your text analysis is followed by a data analysis. And so you can use these in powerful combination with each other. Now a page on a website, as an example of a text document has thousands of text features. And so here's an example of just a few of them, right? So it could be what document format it's in, or how long the URL is, or if there are HTML tags found in the document and which ones it was. And like, what grade level it's at? What country it was tagged for? How many noun phrases are in there? Does it use brand names, right? So there's all sorts of things that you could be looking at. And these are all things that you're using with natural language processing with some type of text analytics. Now for NLP, for natural language processing, national, not natural, but national language matters. And so what do we mean by that? Well, we mean the written language it's in. So English, Spanish, French, Chinese, Japanese, all of these things matter a lot because as you can imagine, natural language processing has to understand the language. And so you have to know which languages you need to support to solve your problem. You might find that there are certain types of NLP approaches that are available in English that may not be available in all the other languages. And so you might have to start by trying to solve your problem in only the language that support that solution with NLP. Now, for data analytics, it doesn't matter at all what language things are written in, all the data is written in numbers, and so it's very simple. But when you're doing a problem that requires text analytics, you have to really focus on whether your problem requires multiple languages or not. And you have to really think about whether the techniques needed can really handle all the languages that you have.