Winning solution for the Galaxy Challenge on Kaggle

Overview

kaggle-galaxies

Winning solution for the Galaxy Challenge on Kaggle (http://www.kaggle.com/c/galaxy-zoo-the-galaxy-challenge).

Documentation about the method and the code is available in doc/documentation.pdf. Information on how to generate the solution file can also be found below.

Generating the solution

Install the dependencies

Instructions for installing Theano and getting it to run on the GPU can be found here. It should be possible to install NumPy, SciPy, scikit-image and pandas using pip or easy_install. To install pylearn2, simply run:

git clone git://github.com/lisa-lab/pylearn2.git

and add the resulting directory to your PYTHONPATH.

The optional dependencies listed in the documentation don't have to be installed to reproduce the winning solution: the generated data files are already provided, so they don't have to be regenerated (but of course you can if you want to). If you want to install them, please refer to their respective documentation.

Download the code

To download the code, run:

git clone git://github.com/benanne/kaggle-galaxies.git

A bunch of data files (extracted sextractor parameters, IDs files, training labels in NumPy format, ...) are also included. I decided to include these since generating them is a bit tedious and requires extra dependencies. It's about 20MB in total, so depending on your connection speed it could take a minute. Cloning the repository should also create the necessary directory structure (see doc/documentation.pdf for more info).

Download the training data

Download the data files from Kaggle. Place and extract the files in the following locations:

  • data/raw/training_solutions_rev1.csv
  • data/raw/images_train_rev1/*.jpg
  • data/raw/images_test_rev1/*.jpg

Note that the zip file with the training images is called images_training_rev1.zip, but they should go in a directory called images_train_rev1. This is just for consistency.

Create data files

This step may be skipped. The necessary data files have been included in the git repository. Nevertheless, if you wish to regenerate them (or make changes to how they are generated), here's how to do it.

  • create data/train_ids.npy by running python create_train_ids_file.py.
  • create data/test_ids.npy by running python create_test_ids_file.py.
  • create data/solutions_train.npy by running python convert_training_labels_to_npy.py.
  • create data/pysex_params_extra_*.npy.gz by running python extract_pysex_params_extra.py.
  • create data/pysex_params_gen2_*.npy.gz by running python extract_pysex_params_gen2.py.

Copy data to RAM

Copy the train and test images to /dev/shm by running:

python copy_data_to_shm.py

If you don't want to do this, you'll need to modify the realtime_augmentation.py file in a few places. Please refer to the documentation for more information.

Train the networks

To train the best single model, run:

python try_convnet_cc_multirotflip_3x69r45_maxout2048_extradense.py

On a GeForce GTX 680, this took about 67 hours to run to completion. The prediction file generated by this script, predictions/final/try_convnet_cc_multirotflip_3x69r45_maxout2048_extradense.csv.gz, should get you a score that's good enough to land in the #1 position (without any model averaging). You can similarly run the other try_*.py scripts to train the other models I used in the winning ensemble.

If you have more than 2GB of GPU memory, I recommend disabling Theano's garbage collector with allow_gc=False in your .theanorc file or in the THEANO_FLAGS environment variable, for a nice speedup. Please refer to the Theano documentation for more information on how to get the most out Theano's GPU support.

Generate augmented predictions

To generate predictions which are averaged across multiple transformations of the input, run:

python predict_augmented_npy_maxout2048_extradense.py

This takes just over 4 hours on a GeForce GTX 680, and will create two files predictions/final/augmented/valid/try_convnet_cc_multirotflip_3x69r45_maxout2048_extradense.npy.gz and predictions/final/augmented/test/try_convnet_cc_multirotflip_3x69r45_maxout2048_extradense.npy.gz. You can similarly run the corresponding predict_augmented_npy_*.py files for the other models you trained.

Blend augmented predictions

To generate blended prediction files from all the models for which you generated augmented predictions, run:

python ensemble_predictions_npy.py

The script checks which files are present in predictions/final/augmented/test/ and uses this to determine the models for which predictions are available. It will create three files:

  • predictions/final/blended/blended_predictions_uniform.npy.gz: uniform blend.
  • predictions/final/blended/blended_predictions.npy.gz: weighted linear blend.
  • predictions/final/blended/blended_predictions_separate.npy.gz: weighted linear blend, with separate weights for each question.

Convert prediction file to CSV

Finally, in order to prepare the predictions for submission, the prediction file needs to be converted from .npy.gz format to .csv.gz. Run the following to do so (or similarly for any other prediction file in .npy.gz format):

python create_submission_from_npy.py predictions/final/blended/blended_predictions_uniform.npy.gz

Submit predictions

Submit the file predictions/final/blended/blended_predictions_uniform.csv.gz on Kaggle to get it scored. Note that the process of generating this file involves considerable randomness: the weights of the networks are initialised randomly, the training data for each chunk is randomly selected, ... so I cannot guarantee that you will achieve the same score as I did. I did not use fixed random seeds. This might not have made much of a difference though, since different GPUs and CUDA toolkit versions will also introduce different rounding errors.

Owner
Sander Dieleman
Sander Dieleman
Turning images into '9-pan' palettes using KMeans clustering from sklearn.

img2palette Turning images into '9-pan' palettes using KMeans clustering from sklearn. Requirements We require: Pillow, for opening and processing ima

Samuel Vidovich 2 Jan 01, 2022
Automatically build ARIMA, SARIMAX, VAR, FB Prophet and XGBoost Models on Time Series data sets with a Single Line of Code. Now updated with Dask to handle millions of rows.

Auto_TS: Auto_TimeSeries Automatically build multiple Time Series models using a Single Line of Code. Now updated with Dask. Auto_timeseries is a comp

AutoViz and Auto_ViML 519 Jan 03, 2023
Skoot is a lightweight python library of machine learning transformer classes that interact with scikit-learn and pandas.

Skoot is a lightweight python library of machine learning transformer classes that interact with scikit-learn and pandas. Its objective is to ex

Taylor G Smith 54 Aug 20, 2022
Required for a machine learning pipeline data preprocessing and variable engineering script needs to be prepared

Feature-Engineering Required for a machine learning pipeline data preprocessing and variable engineering script needs to be prepared. When the dataset

kemalgunay 5 Apr 21, 2022
Adaptive: parallel active learning of mathematical functions

adaptive Adaptive: parallel active learning of mathematical functions. adaptive is an open-source Python library designed to make adaptive parallel fu

741 Dec 27, 2022
A series of Jupyter notebooks that walk you through the fundamentals of Machine Learning and Deep Learning in Python using Scikit-Learn, Keras and TensorFlow 2.

Machine Learning Notebooks, 3rd edition This project aims at teaching you the fundamentals of Machine Learning in python. It contains the example code

Aurélien Geron 1.6k Jan 05, 2023
Greykite: A flexible, intuitive and fast forecasting library

The Greykite library provides flexible, intuitive and fast forecasts through its flagship algorithm, Silverkite.

LinkedIn 1.7k Jan 04, 2023
Distributed deep learning on Hadoop and Spark clusters.

Note: we're lovingly marking this project as Archived since we're no longer supporting it. You are welcome to read the code and fork your own version

Yahoo 1.3k Dec 28, 2022
Iris-Heroku - Putting a Machine Learning Model into Production with Flask and Heroku

Puesta en Producción de un modelo de aprendizaje automático con Flask y Heroku L

Jesùs Guillen 1 Jun 03, 2022
Probabilistic time series modeling in Python

GluonTS - Probabilistic Time Series Modeling in Python GluonTS is a Python toolkit for probabilistic time series modeling, built around Apache MXNet (

Amazon Web Services - Labs 3.3k Jan 03, 2023
Spark development environment for k8s

Local Spark Dev Env with Docker Development environment for k8s. Using the spark-operator image to ensure it will be the same environment. Start conta

Otacilio Filho 18 Jan 04, 2022
A statistical library designed to fill the void in Python's time series analysis capabilities, including the equivalent of R's auto.arima function.

pmdarima Pmdarima (originally pyramid-arima, for the anagram of 'py' + 'arima') is a statistical library designed to fill the void in Python's time se

alkaline-ml 1.3k Dec 22, 2022
Machine Learning Algorithms

Machine-Learning-Algorithms In this project, the dataset was created through a survey opened on Google forms. The purpose of the form is to find the p

Göktuğ Ayar 3 Aug 10, 2022
Magenta: Music and Art Generation with Machine Intelligence

Magenta is a research project exploring the role of machine learning in the process of creating art and music. Primarily this involves developing new

Magenta 18.1k Dec 30, 2022
🤖 ⚡ scikit-learn tips

🤖 ⚡ scikit-learn tips New tips are posted on LinkedIn, Twitter, and Facebook. 👉 Sign up to receive 2 video tips by email every week! 👈 List of all

Kevin Markham 1.6k Jan 03, 2023
Dive into Machine Learning

Dive into Machine Learning Hi there! You might find this guide helpful if: You know Python or you're learning it 🐍 You're new to Machine Learning You

Michael Floering 11.1k Jan 03, 2023
Greykite: A flexible, intuitive and fast forecasting library

The Greykite library provides flexible, intuitive and fast forecasts through its flagship algorithm, Silverkite.

LinkedIn 1.4k Jan 15, 2022
PyHarmonize: Adding harmony lines to recorded melodies in Python

PyHarmonize: Adding harmony lines to recorded melodies in Python About To use this module, the user provides a wav file containing a melody, the key i

Julian Kappler 2 May 20, 2022
Meerkat provides fast and flexible data structures for working with complex machine learning datasets.

Meerkat makes it easier for ML practitioners to interact with high-dimensional, multi-modal data. It provides simple abstractions for data inspection, model evaluation and model training supported by

Robustness Gym 115 Dec 12, 2022
Tools for Optuna, MLflow and the integration of both.

HPOflow - Sphinx DOC Tools for Optuna, MLflow and the integration of both. Detailed documentation with examples can be found here: Sphinx DOC Table of

Telekom Open Source Software 17 Nov 20, 2022