6.1 Look at the ecosystem - Video Tutorials & Practice Problems
Video duration:
4m
Play a video:
<v ->Let's start by looking at the ecosystem</v> for doing data analysis in Python. Data analysis in Python is really on the upswing, in the last few years or so. It used to be largely MATLAB and R were very common, and IDL, I've never even heard of that. I'm not in astronomy. But Python is very frequently used in grad studies, in research, in academia, in scientific analysis, and also in business analytics. The most platform that people use for doing data analysis is called Anaconda. It's actually a distribution of Python. So instead of downloading Python and you get the standard libraries that it always comes with, it also downloads other libraries that are also commonly used in data analysis. So it includes tools for writing, running, and sharing your code. It includes all of the important data science libraries built-in, so it means it's a big download but it also means that you don't have to download as many things later on that you're probably gonna use. It's easier for managing your environment and your libraries, so instead of having to use PIP or Pipenv you can do it all through Anaconda. And it supports both Python and R languages. Here's a little graphic that I found of a bunch of different technologies and libraries used in data analysis. So there are quite a few and there's way more than this. On the outer edge, there's libraries for specific fields so astronomy, biology, whatnot. And here are some common libraries that I wanted to tell you about. So, for data storage, manipulation, and calculations, NumPy is for numerical Python, Pandas is like you'll definitely need to know Pandas and I'll show you a little bit about it in another part of this lesson. And that's for creating DataFrames that are similar to how they're created in R. So you can create DataFrames and then access just a part of the data and do some calculations on them, create new columns, whatnot. There's also SciPy and StatsModels. In visualizations, Matplotlib is the most commonly used one. It is a little bit more involved and there are a lot of options. I actually like one, we're gonna be looking at one called Seaborn today, and it has a much simpler interface for using it. And for machine learning, Scikit-learn is very popular, as well as TensorFlow and Keras for making neural nets. In terms of datasets, if you're interested in this topic, you probably have your own datasets that you're interested in analyzing. But if you want to just get practice, there are a lot of available free datasets online. The Seaborn visualization library has some datasets that are there to just use and practice on. Kaggle.com is a website that runs machine learning competitions and they give you access to a bunch of data and let you rate models that predict things, and then can give out prizes for competitions if you have the most accurate models. So then there's also open data from various government organizations. So you can Google your city and open data to see if your city has open data. My city, Vancouver, does have an open data platform, and you can access it through data.vancouver.ca. And also just Google free data sets. There are a lot of different ones floating around the Iris dataset is very popular. And in the next lesson, we'll be using the tipping data set.