1.3 What is data science? - Video Tutorials & Practice Problems
Video duration:
14m
Play a video:
<v ->So what is data science?</v> I'm sure it's a term you've heard and it's actually an amalgam of several different fields. So they would call that multi-disciplinary 'cause it actually brings together several different fields or disciplines. So it brings together computer science or information technology. You might hear them talk about IT, that's what they're talking about it's really computer science. Math and other types of statistics is the second discipline. And the third discipline is one you might not have thought of, the third discipline is about the domain or business knowledge. So when you're dealing with AI and marketing, it's about marketing. And you might say, marketing is part of data science. Well, it is when you're solving data science problems about marketing. And guess what? When you get a data scientist, which of those fields do you think the data scientist is not an expert in? That's right, marketing. That's where you come in. So the data scientists that you would work with is gonna be an expert in computer science and math and statistics, they'll know a lot about those disciplines but they're probably not gonna know that much about your discipline, about marketing. And so that's the first thing to think about when you're working with a data scientist, is how can you be the expert in your domain, helping the data scientist who understands computer science and who understands math. And so if you can pull all of this together into data scientists, what you're gonna be using is two kinds of data. They call it structured data and unstructured data. So what's structured data? You can think of structured data as being kind of a spreadsheet. So a spreadsheet has rows and columns, like every different field or cell in the spreadsheet has maybe a number in it or a word or some type of data. And by looking at the row and the column you know what the significance of it is. So the rows and the columns actually form the structures. So the row headings, the column headings, that's actually what the structure is of the data that lets you interpret it. So that is what structured data is. So with spreadsheets an example of structured data, a relational database is an example of structured data but what's an example of unstructured data? Hmm. Well maybe a document, a document might be an example of unstructured data because it's hard for you to kinda look at a document and actually figure anything out. You're actually reading the language, you're reading the paragraphs and you're interpreting them. And that's what gives it the meaning, it's not the structure of the paragraphs that gives it the meaning. It's actually the interpretation that you bring to the data itself. Whereas in the spreadsheet, if you read the cell without knowing what row or column it's in, you wouldn't really understand what it meant, it's the structure that helps give it meaning. There's also the idea of semi-structured data. So what do we mean by that? Well, you could have what is thought to be unstructured data, but add a little structure to it. So for example, you might have a document but maybe you're gonna add some HTML tags to them so that you might know what the title is of the document, you might know what a heading is, that adds some structure. You can even add more structure that might have semantic meaning. So what would that be? Well, maybe you would add some kind of schema information. So some of you might be familiar with schema.org. That's something that actually allows you to say, Hey, this webpage is about a product. This is the name of the product. This is the price of the product. So that's giving semantic meaning to the data. It's still a document. It's still a web page, but you're adding some structure to it. And so that's kind of semi-structured data. So this is all different types of data that a data scientist might know how to work with. And so they'll bring your domain knowledge to bear and their knowledge of algorithm systems, other types of math and statistics, other things in computer science and their understanding of structured, semi-structured and unstructured data, and all of that is gonna go into solving whatever problem you've chosen. So let's look at some examples of data science in everyday life. So when you use Amazon and you get to that page and it says, Hey, people who looked at this product eventually bought these other products. That's an example of data science. How about Netflix? When you get to the end of binge watching that series, what does it do? It says, Hey, here's three more series you might wanna watch because you liked that one. Asset managers. Do you know that a lot of the transactions that are being made by asset management companies are actually being done by robots these days. How about social media advertising? Every time you see an ad on Facebook, it's data science that's actually picking out which ad they think you're going to respond to. And hey our favorite Google Maps, we can't get anywhere without Google Maps nowadays. And so these are all things that you use all the time. If you own a mutual fund, all of these assets are being changed for you all the time. If you're watching Netflix, if you're using Amazon, if you're using Google Maps, if you're on social media, you're using all of these examples of data science every day. So why are companies pouring all this money into data science and AI? Well, it's because of the impact on their bottom line. So there's several things that you can get out of AI in marketing. One is to acquire new customers. Another is to retain customers who you've already acquired and another is to kind of grow revenue. And so let's look at an acquisition example. So Intel actually got $20 million in incremental sales from a predictive algorithm that they use with their resellers to understand who was going to buy more of their products. How about retention? Well, remember that binge watching example for Netflix, Netflix estimates that they save a billion dollars in annual revenue due to decreased churn. What's churn? Churn is people who cancel the service. And what they realized is that if they were suggesting things for you to watch that you might like, many more people are gonna keep using Netflix than if you have to constantly think of new things to look for to watch. And so a billion dollars in annual revenue so that's a pretty good retention story. But how about growth? Well, let's start with Amazon. Do you realize that 35% of Amazon sales come from those product recommendations we talked about? People who looked at this page eventually bought something else. Mondelez, the company that bought Kraft, they actually changed the way they do in-store configurations and they found that they had a hundred million dollars in revenue growth, and they eliminated thousands of days of planning that they used to do manually. Cardinal Health, they are a distributor of healthcare products, and what they found out is that personalized marketing campaigns generated 100% increase in clicks and impression also leading to more revenue. These are just a few examples of why companies are spending so much time using data science and AI. And so let's look at a few different concepts here. So if we look at artificial intelligence and we look at machine learning and we look at deep learning, all of these are examples of ways of using data science. So data science encompasses all of these things, so AI is kind of the umbrella term. So AI says that this is all of the things that computers can do using this type of data. And machine learning is actually a specific technique within AI. It's actually the one that is most in use these days. So if people say that they've implemented AI in something, it's very likely that it's machine learning, even though there are all those other techniques that we talked about as well. And machine learning actually lets the computer learn without being programmed to do so. So rather than having a set of instructions, it looks at data, it looks at examples of outcomes and it draws conclusions. That's the way machine learning works. Now deep learning is a subset of machine learning. And what deep learning does is it actually generalizes the training data that you use for machine learning. So why don't we look at an example of how these things come together? So if we were building a self-driving car, one of the things that would be a pretty good idea to do with a self-driving car is to make sure it stops when it gets to a stop sign. I'm pretty sure you agree with me on that. You may not know a lot about self-driving cars but I think we can all come together on this one. We really would like it to stop when it gets to a stop sign. And so how do we use machine learning for that? Well, we're gonna create a lot of training data. So what do we mean by training data? We're gonna give it all sorts of examples of stop signs but we're also gonna give it examples of lots of other things that aren't stop signs because we need it to be able to tell the difference between a stop sign and not a stop sign, because we don't want it stopping every time it pulls up to, for example a no parking sign, right, or it pulls up to a billboard, right? So we need to make sure we give it lots of examples of photos that it will see as it's driving down the road and tell it which ones are stop signs and which ones are not. Well, how would we tell it which ones are stop signs? Well, we're actually gonna have people do that. We're gonna have human beings label each photo as to what kind of object it is. Is it a tree? Is it a stop sign? Is it a street light? Is it a billboard? What is it? And so only the ones that are stop signs are the ones that we want the machine to really try to determine what the pattern is for it. Because what we're gonna do is we're gonna train the machine learning model on that labeled data so that what it starts to do is recognize the pattern of a stop sign. So then it can start to recognize stop signs that it has not seen pictures of because it's started to internalize what the idea is of a stop sign. What does a stop sign look like? What does a picture of a stop sign look like? We can also use deep learning. So deep learning lets us generalize the training data so that we don't have to take quite as many photos. So for example, maybe we could use deep learning to simulate different lighting conditions. So night time, whether the street light is shining directly on the sign or not, sunrise, sunset, all of those things are gonna make the photos of the stop signs look a little bit different. And so deep learning would allow us to simulate those lighting conditions on all of our photos without us having to take a photo of every single stop sign in all those different lighting conditions. And that will help for our machine learning algorithm to recognize more stop signs correctly. Now how do we use artificial intelligence? Well, when a car identifies a stop sign, it needs to decide when to apply the brakes. How soon does it apply the brakes after it recognizes the stop sign? And that might depend on how far away the stop sign is when it's recognized, it might depend on the road conditions. If it's slippery, it might need to apply the brakes sooner for example. It depends on the speed of the car, right? All of these things vary and you can use artificial intelligence to put together the right algorithm that says, Hey, we know when we're going to press the brake. Now it would be annoying to press the brake too soon and slow way down when you don't have to, it would be dangerous for it to press the brake too late. And so all of these things go into that AI algorithm that's going to decide exactly when to press the brake and with how much force. Now what do data scientists do in all this process? Well, they are going to test, they're gonna test that self-driving car to make sure that it stops at stop signs and it doesn't stop when it recognizes an object that is not a stop sign. Now suppose when they're doing the tests, they find out that, Hey, sometimes the stop signs are not being recognized, sometimes the cars are going through stop signs, that's really bad. So the data scientist starts to say, okay, under what conditions is that happening? And so they might find out that, Hey, when the stop sign had a bumper sticker placed over it, it was defaced in some way or it had spray paint on it, or the stop sign was very very old and faded or it was bent at an odd angle, all of a sudden we're not recognizing all of those. And so then one of the things they could do is they could say, okay, let's start taking pictures of many more stop signs that have those conditions associated with them. And that's how we're going to make the algorithm better at recognizing those types of stop signs. Or they might be able to program their deep learning algorithms so it could simulate being defaced with bumper stickers or spray paint or being bent at odd angles. And then what they do is they retrain the algorithm and then they run some more tests and they go through this over and over again until the self-driving car is correctly stopping at stop signs and not stopping when they recognize something that is not a stop sign. And that's an example of data science at work.