🤖 ⚡ scikit-learn tips

Last update: Jan 03, 2023

Overview

🤖 ⚡ scikit-learn tips

New tips are posted on LinkedIn, Twitter, and Facebook.

👉 Sign up to receive 2 video tips by email every week! 👈

List of all tips

Click to discuss the tip on LinkedIn, click to view the Jupyter notebook for a tip, or click to watch the tip video on YouTube:

#	Description	Links
1	Use `ColumnTransformer` to apply different preprocessing to different columns
2	Seven ways to select columns using `ColumnTransformer`
3	What is the difference between "fit" and "transform"?
4	Use "fit_transform" on training data, but "transform" (only) on testing/new data
5	Four reasons to use scikit-learn (not pandas) for ML preprocessing
6	Encode categorical features using `OneHotEncoder` or `OrdinalEncoder`
7	Handle unknown categories with `OneHotEncoder` by encoding them as zeros
8	Use `Pipeline` to chain together multiple steps
9	Add a missing indicator to encode "missingness" as a feature
10	Set a "random_state" to make your code reproducible
11	Impute missing values using `KNNImputer` or `IterativeImputer`
12	What is the difference between `Pipeline` and `make_pipeline`?
13	Examine the intermediate steps in a `Pipeline`
14	`HistGradientBoostingClassifier` natively supports missing values
15	Three reasons not to use drop='first' with `OneHotEncoder`
16	Use `cross_val_score` and `GridSearchCV` on a `Pipeline`
17	Try `RandomizedSearchCV` if `GridSearchCV` is taking too long
18	Display `GridSearchCV` or `RandomizedSearchCV` results in a DataFrame
19	Important tuning parameters for `LogisticRegression`
20	Plot a confusion matrix
21	Compare multiple ROC curves in a single plot
22	Use the correct methods for each type of `Pipeline`
23	Display the intercept and coefficients for a linear model
24	Visualize a decision tree two different ways
25	Prune a decision tree to avoid overfitting
26	Use stratified sampling with `train_test_split`
27	Two ways to impute missing values for a categorical feature
28	Save a model or `Pipeline` using joblib
29	Vectorize two text columns in a `ColumnTransformer`
30	Four ways to examine the steps of a `Pipeline`
31	Shuffle your dataset when using `cross_val_score`
32	Use AUC to evaluate multiclass problems
33	Use `FunctionTransformer` to convert functions into transformers
34	Add feature selection to a `Pipeline`
35	Don't use `.values` when passing a pandas object to scikit-learn
36	Most parameters should be passed as keyword arguments
37	Create an interactive diagram of a `Pipeline` in Jupyter
38	Get the feature names output by a `ColumnTransformer`
39	Load a toy dataset into a DataFrame
40	Estimators only print parameters that have been changed
41	Drop the first category from binary features (only) with `OneHotEncoder`
42	Passthrough some columns and drop others in a `ColumnTransformer`
43	Use `OrdinalEncoder` instead of `OneHotEncoder` with tree-based models
44	Speed up `GridSearchCV` using parallel processing
45	Create feature interactions using `PolynomialFeatures`
46	Ensemble multiple models using `VotingClassifer` or `VotingRegressor`
47	Tune the parameters of a `VotingClassifer` or `VotingRegressor`
48	Access part of a `Pipeline` using slicing
49	Tune multiple models simultaneously with `GridSearchCV`
50	Adapt this pattern to solve many Machine Learning problems

You can interact with all of these notebooks online using Binder:

Note: Some of the tips do not include any code, and can only be viewed on LinkedIn.

Who creates these tips?

Hi! I'm Kevin Markham, the founder of Data School. I've been teaching data science in Python since 2014. I create these tips because I love using scikit-learn and I want to help others use it more effectively.

How can I get better at scikit-learn?

I teach three courses:

Course 1: Introduction to Machine Learning in Python with scikit-learn (4 hours, free)
Course 2: Building an Effective Machine Learning Workflow with scikit-learn (8 hours, paid)
Course 3: Machine Learning with Text in Python (14 hours, paid)

👉 Find out which course is right for you! 👈

Do you have any other tips?

Yes! In 2019, I posted 100 pandas tricks. I also created a video featuring my top 25 pandas tricks.

🤖 ⚡ scikit-learn tips

Related tags

Overview

🤖 ⚡ scikit-learn tips

List of all tips

Who creates these tips?

How can I get better at scikit-learn?

Do you have any other tips?

Owner

Kevin Markham

Decision tree is the most powerful and popular tool for classification and prediction

Python package for causal inference using Bayesian structural time-series models.

Distributed Evolutionary Algorithms in Python

Lingtrain Alignment Studio is an ML based app for texts alignment on different languages.

AutoX是一个高效的自动化机器学习工具，它主要针对于表格类型的数据挖掘竞赛。它的特点包括: 效果出色、简单易用、通用、自动化、灵活。

Kaggler is a Python package for lightweight online machine learning algorithms and utility functions for ETL and data analysis.

BigDL: Distributed Deep Learning Framework for Apache Spark

Hypernets: A General Automated Machine Learning framework to simplify the development of End-to-end AutoML toolkits in specific domains.

A repository to work on Machine Learning course. Select an algorithm to classify writer's gender, of Hebrew texts.

Breast-Cancer-Classification - Using SKLearn breast cancer dataset which contains 569 examples and 32 features classifying has been made with 6 different algorithms

Kats is a toolkit to analyze time series data, a lightweight, easy-to-use, and generalizable framework to perform time series analysis.

Apache Liminal is an end-to-end platform for data engineers & scientists, allowing them to build, train and deploy machine learning models in a robust and agile way

Scikit-Garden or skgarden is a garden for Scikit-Learn compatible decision trees and forests.

Tools for Optuna, MLflow and the integration of both.

A Multipurpose Library for Synthetic Time Series Generation in Python

mlpack: a scalable C++ machine learning library --

Evaluate on three different ML model for feature selection using Breast cancer data.

Data from "Datamodels: Predicting Predictions with Training Data"

Implementation of different ML Algorithms from scratch, written in Python 3.x

A library of extension and helper modules for Python's data analysis and machine learning libraries.

🤖 ⚡ scikit-learn tips

Related tags

Overview

🤖 ⚡ scikit-learn tips

List of all tips

Who creates these tips?

How can I get better at scikit-learn?

Do you have any other tips?

Owner

Kevin Markham

Decision tree is the most powerful and popular tool for classification and prediction

Python package for causal inference using Bayesian structural time-series models.

Distributed Evolutionary Algorithms in Python

Lingtrain Alignment Studio is an ML based app for texts alignment on different languages.

AutoX是一个高效的自动化机器学习工具，它主要针对于表格类型的数据挖掘竞赛。 它的特点包括: 效果出色、简单易用、通用、自动化、灵活。

Kaggler is a Python package for lightweight online machine learning algorithms and utility functions for ETL and data analysis.

BigDL: Distributed Deep Learning Framework for Apache Spark

Hypernets: A General Automated Machine Learning framework to simplify the development of End-to-end AutoML toolkits in specific domains.

A repository to work on Machine Learning course. Select an algorithm to classify writer's gender, of Hebrew texts.

Breast-Cancer-Classification - Using SKLearn breast cancer dataset which contains 569 examples and 32 features classifying has been made with 6 different algorithms

Kats is a toolkit to analyze time series data, a lightweight, easy-to-use, and generalizable framework to perform time series analysis.

Apache Liminal is an end-to-end platform for data engineers & scientists, allowing them to build, train and deploy machine learning models in a robust and agile way

Scikit-Garden or skgarden is a garden for Scikit-Learn compatible decision trees and forests.

Tools for Optuna, MLflow and the integration of both.

A Multipurpose Library for Synthetic Time Series Generation in Python

mlpack: a scalable C++ machine learning library --

Evaluate on three different ML model for feature selection using Breast cancer data.

Data from "Datamodels: Predicting Predictions with Training Data"

Implementation of different ML Algorithms from scratch, written in Python 3.x

A library of extension and helper modules for Python's data analysis and machine learning libraries.

AutoX是一个高效的自动化机器学习工具，它主要针对于表格类型的数据挖掘竞赛。它的特点包括: 效果出色、简单易用、通用、自动化、灵活。