General Assembly's 2015 Data Science course in Washington, DC

Overview

DAT8 Course Repository

Course materials for General Assembly's Data Science course in Washington, DC (8/18/15 - 10/29/15).

Instructor: Kevin Markham (Data School blog, email newsletter, YouTube channel)

Binder

Tuesday Thursday
8/18: Introduction to Data Science 8/20: Command Line, Version Control
8/25: Data Reading and Cleaning 8/27: Exploratory Data Analysis
9/1: Visualization 9/3: Machine Learning
9/8: Getting Data 9/10: K-Nearest Neighbors
9/15: Basic Model Evaluation 9/17: Linear Regression
9/22: First Project Presentation 9/24: Logistic Regression
9/29: Advanced Model Evaluation 10/1: Naive Bayes and Text Data
10/6: Natural Language Processing 10/8: Kaggle Competition
10/13: Decision Trees 10/15: Ensembling
10/20: Advanced scikit-learn, Clustering 10/22: Regularization, Regex
10/27: Course Review 10/29: Final Project Presentation

Python Resources

Course project

Comparison of machine learning models

Comparison of model evaluation procedures and metrics

Advice for getting better at data science

Additional resources


Class 1: Introduction to Data Science

Homework:

  • Work through GA's friendly command line tutorial using Terminal (Linux/Mac) or Git Bash (Windows).
  • Read through this command line reference, and complete the pre-class exercise at the bottom. (There's nothing you need to submit once you're done.)
  • Watch videos 1 through 8 (21 minutes) of Introduction to Git and GitHub, or read sections 1.1 through 2.2 of Pro Git.
  • If your laptop has any setup issues, please work with us to resolve them by Thursday. If your laptop has not yet been checked, you should come early on Thursday, or just walk through the setup checklist yourself (and let us know you have done so).

Resources:


Class 2: Command Line and Version Control

  • Slack tour
  • Review the command line pre-class exercise (code)
  • Git and GitHub (slides)
  • Intermediate command line

Homework:

Git and Markdown Resources:

  • Pro Git is an excellent book for learning Git. Read the first two chapters to gain a deeper understanding of version control and basic commands.
  • If you want to practice a lot of Git (and learn many more commands), Git Immersion looks promising.
  • If you want to understand how to contribute on GitHub, you first have to understand forks and pull requests.
  • GitRef is my favorite reference guide for Git commands, and Git quick reference for beginners is a shorter guide with commands grouped by workflow.
  • Cracking the Code to GitHub's Growth explains why GitHub is so popular among developers.
  • Markdown Cheatsheet provides a thorough set of Markdown examples with concise explanations. GitHub's Mastering Markdown is a simpler and more attractive guide, but is less comprehensive.

Command Line Resources:

  • If you want to go much deeper into the command line, Data Science at the Command Line is a great book. The companion website provides installation instructions for a "data science toolbox" (a virtual machine with many more command line tools), as well as a long reference guide to popular command line tools.
  • If you want to do more at the command line with CSV files, try out csvkit, which can be installed via pip.

Class 3: Data Reading and Cleaning

  • Git and GitHub assorted tips (slides)
  • Review command line homework (solution)
  • Python:
    • Spyder interface
    • Looping exercise
    • Lesson on file reading with airline safety data (code, data, article)
    • Data cleaning exercise
    • Walkthrough of Python homework with Chipotle data (code, data, article)

Homework:

  • Complete the Python homework assignment with the Chipotle data, add a commented Python script to your GitHub repo, and submit a link using the homework submission form. You have until Tuesday (9/1) to complete this assignment. (Note: Pandas, which is covered in class 4, should not be used for this assignment.)

Resources:


Class 4: Exploratory Data Analysis

Homework:

Resources:


Class 5: Visualization

Homework:

  • Your project question write-up is due on Thursday.
  • Complete the Pandas homework assignment with the IMDb data. You have until Tuesday (9/8) to complete this assignment.
  • If you're not using Anaconda, install the Jupyter Notebook (formerly known as the IPython Notebook) using pip. (The Jupyter or IPython Notebook is included with Anaconda.)

Pandas Resources:

  • To learn more Pandas, read this three-part tutorial, or review these two excellent (but extremely long) notebooks on Pandas: introduction and data wrangling.
  • If you want to go really deep into Pandas (and NumPy), read the book Python for Data Analysis, written by the creator of Pandas.
  • This notebook demonstrates the different types of joins in Pandas, for when you need to figure out how to merge two DataFrames.
  • This is a nice, short tutorial on pivot tables in Pandas.
  • For working with geospatial data in Python, GeoPandas looks promising. This tutorial uses GeoPandas (and scikit-learn) to build a "linguistic street map" of Singapore.

Visualization Resources:


Class 6: Machine Learning

  • Part 2 of Visualization with Pandas and Matplotlib (notebook)
  • Brief introduction to the Jupyter/IPython Notebook
  • "Human learning" exercise:
  • Introduction to machine learning (slides)

Homework:

  • Optional: Complete the bonus exercise listed in the human learning notebook. It will take the place of any one homework you miss, past or future! This is due on Tuesday (9/8).
  • If you're not using Anaconda, install requests and Beautiful Soup 4 using pip. (Both of these packages are included with Anaconda.)

Machine Learning Resources:

IPython Notebook Resources:


Class 7: Getting Data

Homework:

  • Optional: Complete the homework exercise listed in the web scraping code. It will take the place of any one homework you miss, past or future! This is due on Tuesday (9/15).
  • Optional: If you're not using Anaconda, install Seaborn using pip. If you're using Anaconda, install Seaborn by running conda install seaborn at the command line. (Note that some students in past courses have had problems with Anaconda after installing Seaborn.)

API Resources:

  • This Python script to query the U.S. Census API was created by a former DAT student. It's a bit more complicated than the example we used in class, it's very well commented, and it may provide a useful framework for writing your own code to query APIs.
  • Mashape and Apigee allow you to explore tons of different APIs. Alternatively, a Python API wrapper is available for many popular APIs.
  • The Data Science Toolkit is a collection of location-based and text-related APIs.
  • API Integration in Python provides a very readable introduction to REST APIs.
  • Microsoft's Face Detection API, which powers How-Old.net, is a great example of how a machine learning API can be leveraged to produce a compelling web application.

Web Scraping Resources:


Class 8: K-Nearest Neighbors

Homework:

KNN Resources:

Seaborn Resources:


Class 9: Basic Model Evaluation

Homework:

Model Evaluation Resources:

Reproducibility Resources:


Class 10: Linear Regression

Homework:

  • Your first project presentation is on Tuesday (9/22)! Please submit a link to your project repository (with slides, code, data, and visualizations) by 6pm on Tuesday.
  • Complete the homework assignment with the Yelp data. This is due on Thursday (9/24).

Linear Regression Resources:

Other Resources:


Class 11: First Project Presentation

  • Project presentations!

Homework:


Class 12: Logistic Regression

Homework:

Logistic Regression Resources:

  • To go deeper into logistic regression, read the first three sections of Chapter 4 of An Introduction to Statistical Learning, or watch the first three videos (30 minutes) from that chapter.
  • For a math-ier explanation of logistic regression, watch the first seven videos (71 minutes) from week 3 of Andrew Ng's machine learning course, or read the related lecture notes compiled by a student.
  • For more on interpreting logistic regression coefficients, read this excellent guide by UCLA's IDRE and these lecture notes from the University of New Mexico.
  • The scikit-learn documentation has a nice explanation of what it means for a predicted probability to be calibrated.
  • Supervised learning superstitions cheat sheet is a very nice comparison of four classifiers we cover in the course (logistic regression, decision trees, KNN, Naive Bayes) and one classifier we do not cover (Support Vector Machines).

Confusion Matrix Resources:


Class 13: Advanced Model Evaluation

Homework:

ROC Resources:

Cross-Validation Resources:

Other Resources:


Class 14: Naive Bayes and Text Data

Homework:

  • Complete another homework assignment with the Yelp data. This is due on Tuesday (10/6).
  • Confirm that you have TextBlob installed by running import textblob from within your preferred Python environment. If it's not installed, run pip install textblob at the command line (not from within Python).

Resources:

  • Sebastian Raschka's article on Naive Bayes and Text Classification covers the conceptual material from today's class in much more detail.
  • For more on conditional probability, read these slides, or read section 2.2 of the OpenIntro Statistics textbook (15 pages).
  • For an intuitive explanation of Naive Bayes classification, read this post on airport security.
  • For more details on Naive Bayes classification, Wikipedia has two excellent articles (Naive Bayes classifier and Naive Bayes spam filtering), and Cross Validated has a good Q&A.
  • When applying Naive Bayes classification to a dataset with continuous features, it is better to use GaussianNB rather than MultinomialNB. This notebook compares their performances on such a dataset. Wikipedia has a short description of Gaussian Naive Bayes, as well as an excellent example of its usage.
  • These slides from the University of Maryland provide more mathematical details on both logistic regression and Naive Bayes, and also explain how Naive Bayes is actually a "special case" of logistic regression.
  • Andrew Ng has a paper comparing the performance of logistic regression and Naive Bayes across a variety of datasets.
  • If you enjoyed Paul Graham's article, you can read his follow-up article on how he improved his spam filter and this related paper about state-of-the-art spam filtering in 2004.
  • Yelp has found that Naive Bayes is more effective than Mechanical Turks at categorizing businesses.

Class 15: Natural Language Processing

  • Yelp review text homework due (solution)
  • Natural language processing (notebook)
  • Introduction to our Kaggle competition
    • Create a Kaggle account, join the competition using the invitation link, download the sample submission, and then submit the sample submission (which will require SMS account verification).

Homework:

  • Your draft paper is due on Thursday (10/8)! Please submit a link to your project repository (with paper, code, data, and visualizations) before class.
  • Watch Kaggle: How it Works (4 minutes) for a brief overview of the Kaggle platform.
  • Download the competition files, move them to the DAT8/data directory, and make sure you can open the CSV files using Pandas. If you have any problems opening the files, you probably need to turn off real-time virus scanning (especially Microsoft Security Essentials).
  • Optional: Come up with some theories about which features might be relevant to predicting the response, and then explore the data to see if those theories appear to be true.
  • Optional: Watch my project presentation video (16 minutes) for a tour of the end-to-end machine learning process for a Kaggle competition, including feature engineering. (Or, just read through the slides.)

NLP Resources:


Class 16: Kaggle Competition

Homework:

  • You will be assigned to review the project drafts of two of your peers. You have until Tuesday 10/20 to provide them with feedback, according to the peer review guidelines.
  • Read A Visual Introduction to Machine Learning for a brief overview of decision trees.
  • Download and install Graphviz, which will allow you to visualize decision trees in scikit-learn.
    • Windows users should also add Graphviz to your path: Go to Control Panel, System, Advanced System Settings, Environment Variables. Under system variables, edit "Path" to include the path to the "bin" folder, such as: C:\Program Files (x86)\Graphviz2.38\bin
  • Optional: Keep working on our Kaggle competition! You can make up to 5 submissions per day, and the competition doesn't close until 6:30pm ET on Tuesday 10/27 (class 21).

Resources:


Class 17: Decision Trees

Homework:

Resources:

  • scikit-learn's documentation on decision trees includes a nice overview of trees as well as tips for proper usage.
  • For a more thorough introduction to decision trees, read section 4.3 (23 pages) of Introduction to Data Mining. (Chapter 4 is available as a free download.)
  • If you want to go deep into the different decision tree algorithms, this slide deck contains A Brief History of Classification and Regression Trees.
  • The Science of Singing Along contains a neat regression tree (page 136) for predicting the percentage of an audience at a music venue that will sing along to a pop song.
  • Decision trees are common in the medical field for differential diagnosis, such as this classification tree for identifying psychosis.

Class 18: Ensembling

Resources:


Class 19: Advanced scikit-learn and Clustering

Homework:

scikit-learn Resources:

Clustering Resources:


Class 20: Regularization and Regular Expressions

Homework:

  • Your final project is due next week!
  • Optional: Make your final submissions to our Kaggle competition! It closes at 6:30pm ET on Tuesday 10/27.
  • Optional: Read this classic paper, which may help you to connect many of the topics we have studied throughout the course: A Few Useful Things to Know about Machine Learning.

Regularization Resources:

Regular Expressions Resources:


Class 21: Course Review and Final Project Presentation

Resources:


Class 22: Final Project Presentation


Additional Resources

Tidy Data

Databases and SQL

Recommendation Systems

Owner
Kevin Markham
Founder of Data School
Kevin Markham
bigdata_analyse 大数据分析项目

bigdata_analyse 大数据分析项目 wish 采用不同的技术栈,通过对不同行业的数据集进行分析,期望达到以下目标: 了解不同领域的业务分析指标 深化数据处理、数据分析、数据可视化能力 增加大数据批处理、流处理的实践经验 增加数据挖掘的实践经验

Way 2.4k Dec 30, 2022
Transform-Invariant Non-Negative Matrix Factorization

Transform-Invariant Non-Negative Matrix Factorization A comprehensive Python package for Non-Negative Matrix Factorization (NMF) with a focus on learn

EMD Group 6 Jul 01, 2022
Candlestick Pattern Recognition with Python and TA-Lib

Candlestick-Pattern-Recognition-with-Python-and-TA-Lib Goal Look at the S&P500 to try and get a better understanding of these candlestick patterns and

Ganesh Jainarain 11 Oct 07, 2022
Data Competition: automated systems that can detect whether people are not wearing masks or are wearing masks incorrectly

Table of contents Introduction Dataset Model & Metrics How to Run Quickstart Install Training Evaluation Detection DATA COMPETITION The COVID-19 pande

Thanh Dat Vu 1 Feb 27, 2022
An Indexer that works out-of-the-box when you have less than 100K stored Documents

U100KIndexer An Indexer that works out-of-the-box when you have less than 100K stored Documents. U100K means under 100K. At 100K stored Documents with

Jina AI 7 Mar 15, 2022
WithPipe is a simple utility for functional piping in Python.

A utility for functional piping in Python that allows you to access any function in any scope as a partial.

Michael Milton 1 Oct 26, 2021
ELFXtract is an automated analysis tool used for enumerating ELF binaries

ELFXtract ELFXtract is an automated analysis tool used for enumerating ELF binaries Powered by Radare2 and r2ghidra This is specially developed for PW

Monish Kumar 49 Nov 28, 2022
Calculate multilateral price indices in Python (with Pandas and PySpark).

IndexNumCalc Calculate multilateral price indices using the GEKS-T (CCDI), Time Product Dummy (TPD), Time Dummy Hedonic (TDH), Geary-Khamis (GK) metho

Dr. Usman Kayani 3 Apr 27, 2022
BigDL - Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems

Evaluate the performance of BigDL (Distributed Deep Learning on Apache Spark) in big data analysis problems.

Vo Cong Thanh 1 Jan 06, 2022
ASOUL直播间弹幕抓取&&数据分析

ASOUL直播间弹幕抓取&&数据分析(更新中) 这些文件用于爬取ASOUL直播间的弹幕(其他直播间也可以)和其他信息,以及简单的数据分析生成。

159 Dec 10, 2022
My first Python project is a simple Mad Libs program.

Python CLI Mad Libs Game My first Python project is a simple Mad Libs program. Mad Libs is a phrasal template word game created by Leonard Stern and R

Carson Johnson 1 Dec 10, 2021
Exploratory Data Analysis of the 2019 Indian General Elections using a dataset from Kaggle.

2019-indian-election-eda Exploratory Data Analysis of the 2019 Indian General Elections using a dataset from Kaggle. This project is a part of the Cou

Souradeep Banerjee 5 Oct 10, 2022
Implementation in Python of the reliability measures such as Omega.

reliabiliPy Summary Simple implementation in Python of the [reliability](https://en.wikipedia.org/wiki/Reliability_(statistics) measures for surveys:

Rafael Valero Fernández 2 Apr 27, 2022
Python utility to extract differences between two pandas dataframes.

Python utility to extract differences between two pandas dataframes.

Jaime Valero 8 Jan 07, 2023
Python package for analyzing behavioral data for Brain Observatory: Visual Behavior

Allen Institute Visual Behavior Analysis package This repository contains code for analyzing behavioral data from the Allen Brain Observatory: Visual

Allen Institute 16 Nov 04, 2022
Repository created with LinkedIn profile analysis project done

EN/en Repository created with LinkedIn profile analysis project done. The datase

Mayara Canaver 4 Aug 06, 2022
The OHSDI OMOP Common Data Model allows for the systematic analysis of healthcare observational databases.

The OHSDI OMOP Common Data Model allows for the systematic analysis of healthcare observational databases.

Bell Eapen 14 Jan 02, 2023
Very useful and necessary functions that simplify working with data

Additional-function-for-pandas Very useful and necessary functions that simplify working with data random_fill_nan(module_name, nan) - Replaces all sp

Alexander Goldian 2 Dec 02, 2021
AWS Glue ETL Code Samples

AWS Glue ETL Code Samples This repository has samples that demonstrate various aspects of the new AWS Glue service, as well as various AWS Glue utilit

AWS Samples 1.2k Jan 03, 2023
Advanced Pandas Vault — Utilities, Functions and Snippets (by @firmai).

PandasVault ⁠— Advanced Pandas Functions and Code Snippets The only Pandas utility package you would ever need. It has no exotic external dependencies

Derek Snow 374 Jan 07, 2023