The first machine learning framework that encourages learning ML concepts instead of memorizing class functions.

Last update: Dec 27, 2022

Overview

SeaLion

SeaLion is designed to teach today's aspiring ml-engineers the popular machine learning concepts of today in a way that gives both intuition and ways of application. We do this through concise algorithms that do the job in the least jargon possible and examples to guide you through every step of the way.

Quick Demo

SeaLion in Action

General Usage

For most classifiers you can just do (we'll use Logistic Regression as an example here) :

from sealion.regression import LogisticRegression
log_reg = LogisticRegression()

to initialize, and then to train :

log_reg.fit(X_train, y_train)

and for testing :

y_pred = log_reg.predict(X_test) 
evaluation = log_reg.evaluate(X_test, y_test)

For the unsupervised clustering algorithms you may do :

from sealion.unsupervised_clustering import KMeans
kmeans = KMeans(k = 3)

and then to fit and predict :

predictions = kmeans.fit_predict(X)

Neural networks are a bit more complicated, so you may want to check an example here.

The syntax of the APIs was designed to be easy to use and familiar to most other ML libraries. This is to make sure both beginners and experts in the field can comfortably use SeaLion. Of course, none of the source code uses other ML frameworks.

Testimonials and Reddit Posts

"Super Expansive Python ML Library"

@Peter Washington, Stanford PHD candidate in Bio-Engineering

r/Python : r/Python Post

r/learnmachinelearning : r/learningmachinelearning Post

Installation

The package is available on PyPI. Install like such :

pip install sealion

SeaLion can only support Python 3, so please make sure you are on the newest version.

General Information

SeaLion was built by Anish Lakkapragada, a freshman in high school, starting in Thanksgiving of 2020 and has continued onto early 2021. The library is meant for beginners to use when solving the standard libraries like iris, breast cancer, swiss roll, the moons dataset, MNIST, etc. The source code is not as much as most other ML libraries (only 4000 lines) so it's pretty easy to contribute to. He hopes to spread machine learning to other high schoolers through this library.

Documentation

All documentation is currently being put on a website. However useful it may be, I highly recommend you check the examples posted on GitHub here to see the usage of the APIs and how it works.

Updates for v4.1 and up!

First things first - thank you for all of the support. The two reddit posts did much better than I expected (1.6k upvotes, about 200 comments) and I got a lot of feedback and advice. Thank you to anyone who participated in r/Python or r/learnmachinelearning.

SeaLion has also taken off with the posts. We currently have had 3 issues (1 closed) and have reached 195 stars and 20 forks. I wasn't expecting this and I am grateful for everyone who has shown their appreciation for this library.

Also some issues have popped up. Most of them can be easily solved by just deleting sealion manually (going into the folder where the source is and just deleting it - not pip uninstall) and then reinstalling the usual way, but feel free to put an issue up anytime.

In versions 4.1+ we are hoping to polish the library more. Currently 4.1 comes with Bernoulli Naive Bayes and we also have added precision, recall, and the f1 metric in the utils module. We are hoping to include Gaussian Mixture Models and Batch Normalization in the future. Code examples for these new algorithms will be created within a day or two after release. Thank you!

Updates for v3.0.0!

SeaLion v3.0 and up has had a lot of major milestones.

The first thing is that all the code examples (in jupyter notebooks) for basically all of the modules in sealion are put into the examples directory. Most of them go over using actual datasets like iris, breast cancer, moons, blobs, MNIST, etc. These were all built using v3.0.8 -hopefully that clears up any confusion. I hope you enjoy them.

Perhaps the biggest change in v3.0 is how we have changed the Cython compilation. A quick primer on Cython if you are unfamiliar - you take your python code (in .py files), change it and add some return types and type declarations, put that in a .pyx file, and compile it to a .so file. The .so file is then imported in the python module which you use.

The main bug fixed was that the .so file is actually specific to the architecture of the user. I use macOS and compiled all my files in .so, so prior v3.0 I would just give those .so files to anybody else. However other architectures and OSs like Ubuntu would not be able to recognize those files. Instead what we do know is just store the .pyx files (universal for all computers) in the source code, and the first time you import sealion all of those .pyx files will get compiled into .so files (so they will work for whatever you are using.) This means the first import will take about 40 seconds, but after that it will be as quick as any other import.

Machine Learning Algorithms

The machine learning algorithms of SeaLion are listed below. Please note that the stucture of the listing isn't meant to resemble that of SeaLion's APIs. Of course, new algorithms are being made right now.

Deep Neural Networks
- Optimizers
  - Gradient Descent (and mini-batch gradient descent)
  - Momentum Optimization w/ Nesterov Accelerated Gradient
  - Stochastic gradient descent (w/ momentum + nesterov)
  - AdaGrad
  - RMSprop
  - Adam
  - Nadam
- Layers
  - Flatten (turn 2D+ data to 2D matrices)
  - Dense (fully-connected layers)
- Regularization
  - Dropout
- Activations
  - ReLU
  - Tanh
  - Sigmoid
  - Softmax
  - Leaky ReLU
  - ELU
  - SELU
  - Swish
- Loss Functions
  - MSE (for regression)
  - CrossEntropy (for classification)
- Transfer Learning
  - Save weights (in a pickle file)
  - reload them and then enter them into the same neural network
  - this is so you don't have to start training from scratch
Regression
- Linear Regression (Normal Equation, closed-form)
- Ridge Regression (L2 regularization, closed-form solution)
- Lasso Regression (L1 regularization)
- Elastic-Net Regression
- Logistic Regression
- Softmax Regression
- Exponential Regression
- Polynomial Regression
Dimensionality Reduction
- Principal Component Analysis (PCA)
- t-distributed Stochastic Neighbor Embedding (tSNE)
Unsupervised Clustering
- KMeans (w/ KMeans++)
- DBSCAN
Naive Bayes
- Multinomial Naive Bayes
- Gaussian Naive Bayes
- Bernoulli Naive Bayes
Trees
- Decision Tree (with max_branches, min_samples regularization + CART training)
Ensemble Learning
- Random Forests
- Ensemble/Voting Classifier
Nearest Neighbors
- k-nearest neighbors
Utils
- one_hot encoder function (one_hot())
- plot confusion matrix function (confusion_matrix())
- revert one hot encoding to 1D Array (revert_one_hot())
- revert softmax predictions to 1D Array (revert_softmax())

Algorithms in progress

Some of the algorithms we are working on right now.

Batch Normalization
Gaussian Mixture Models
Barnes Hut t-SNE (please, please contribute for this one)

Contributing

First, install the required libraries:

pip install -r requirements.txt

If you feel you can do something better than how it is right now in SeaLion, please do! Believe me, you will find great joy in simplifying my code (probably using numpy) and speeding it up. The major problem right now is speed, some algorithms like PCA can handle 10000+ data points, whereas tSNE is unscalable with O(n^2) time complexity. We have solved this problem with Cython + parallel processing (thanks joblib), so algorithms (aside from neural networks) are working well with <1000 points. Getting to the next level will need some help.

Most of the modules I use are numpy, pandas, joblib, and tqdm. I prefer using less dependencies in the code, so please keep it down to a minimum.

Other than that, thanks for contributing!

Acknowledgements

Plenty of articles and people helped me a long way. Some of the tougher questions I dealt with were Automatic Differentiation in neural networks, in which this tutorial helped me. I also got some help on the O(n^2) time complexity problem of the denominator of t-SNE from this article and understood the mathematical derivation for the gradients (original paper didn't go over it) from here. Also I used the PCA method from handsonml so thanks for that too Aurélien Géron. Lastly special thanks to Evan M. Kim and Peter Washington for helping make the normal equation and cauchy distribution in tSNE make sense. Also thanks to @Kento Nishi for helping me understand open-source.

Feedback, comments, or questions

If you have any feedback or something you would like to tell me, please do not hesitate to share! Feel free to comment here on github or reach out to me through [email protected]!

©Anish Lakkapragada 2021

Comments

Question on linear regression program execution
Question on linear regression program execution

When i try to execute, i see it refers to load_boston.

i see titanic_dataset. Should every data set be downloaded and loaded into Anaconda to run those ?
opened by tariqrahiman 38

Possible speedups using Cython

Hello! I saw this project on Reddit and was skeptical about whether compiling code that mainly uses NumPy with Cython provides any speedup.

It seems that it doesn't. I ran a couple of tests using this function: https://github.com/anish-lakkapragada/SeaLion/blob/1439a12880cf71888e6b858228973cbe91bc8293/sealion/cython_knn.pyx#L11-L15

I created two versions: one compiled with Cython (r2_score) and another pasted straight into Python (r2_score_python). Both contain the same exact code. I then ran a simple test in a Jupyter notebook:

y1, y2 = np.random.rand(2, 10_000_000)

%timeit r2_score(y1, y2)  # compiled with Cython
# 902 ms ± 4.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit r2_score_python(y1, y2)  # just pasted into Python
# 897 ms ± 1.91 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Not much of a difference, it seems (902 ms vs 897 ms). Somewhat surprisingly, Python + NumPy (r2_score_python) is actually faster than Cython + NumPy (r2_score).

Another solution: ugly but fast

I quickly hacked up this code:

%%cython -a --verbose
# cython: language_level=3, boundscheck=False, wraparound=False

import numpy as np
cimport numpy as np

cdef arr_sub_arr(double[:] dst, double[:] x, double[:] y):
    for i in range(x.shape[0]):
        dst[i] = x[i] - y[i]
        
cdef arr_sub_float(double[:] dst, double[:] x, double y):
    for i in range(x.shape[0]):
        dst[i] = x[i] - y
        
cdef arr_square(double[:] dst, double[:] src):    
    for i in range(src.shape[0]):
        dst[i] = src[i] * src[i]
        
cdef arr_mean(double[:] src):
    cdef Py_ssize_t n_elem = src.shape[0]
    
    return arr_sum(src) / n_elem
        
cdef arr_sum(double[:] src):
    cdef double _sum = 0.0
    
    for i in range(src.shape[0]):
        _sum += src[i]
        
    return _sum

cdef __r2_score_cython(double[:] y_pred, double[:] y_test):
    arr_sub_arr(y_pred, y_pred, y_test)
    arr_square(y_pred, y_pred)
    
    cdef double num = arr_sum(y_pred)
    cdef double y_test_mean = arr_mean(y_test)
    
    arr_sub_float(y_pred, y_test, y_test_mean)
    arr_square(y_pred, y_pred)
    cdef double denum = arr_sum(y_pred)
    
    return 1 - num / denum
    

cpdef r2_score_cython(np.ndarray y_pred, np.ndarray y_test):
    assert y_pred.ndim == y_test.ndim == 1, "Invalid number of dimenstions"
    
    cdef Py_ssize_t sh1 = y_pred.shape[0]
    cdef Py_ssize_t sh2 = y_test.shape[0]
    assert sh1 == sh2, f"Shape mismatch"
    
    return __r2_score_cython(
        np.array(y_pred),  # make a copy!
        y_test
    )

Nothing fancy - just for loops and a copy of y_pred being used to store intermediate computations, so it's also memory efficient and doesn't allocate any new arrays at all.

Here code like double[:] is a memoryview, not a NumPy array. See this Cython tutorial.

The speedup

%timeit r2_score_cython(y1, y2)
# 85.2 ms ± 153 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Your original code ran in 897 ms. The code I quickly wrote with simple loops (but with types and compiled via Cython!) ran in 85 ms, which is 10.5 times faster!

That is to say that you can get way more speed out of Cython than you're currently getting. You get a lot of speed but now have to think about memory allocation and write for loops yourself, and this also works only for floating-point numbers - not integers or any other data types (I think this can be solved using fused types).

So, if you're into Cython, you could probably think of other ways to speed up your code.

opened by ForceBru 12

Fix typo.

Changed "know" to "now" in the file SeaLion/sealion/neural_networks/layers.py, line 69. Full phrase with change:

...every neuron now has a better set of weights.

opened by phuang1024 1
Documentation and Cleanup
Cleanup

Cleanup setup.py, sealion/__init__.py, and PyPI upload workflow (.github/workflows/build.yaml)

Docs

Used sphinx for docs (which allows hosting on readthedocs).

To build and view locally:

pip install sphinx sphinx_rtd_theme cd docs make html

and open the file docs/_build/html/index.html in a browser.

I deleted a lot of the Methods sections in the docstrings because those are automatically covered by Sphinx.

I probably missed some stuff so if you notice please alert me.

Testing

Documentation build works. setup.py works. Importing the module works (builds Cython automatically). Not sure if PyPI workflow works because I can't test it.
opened by phuang1024 0
Formatting and Small Bug Fixes
Strip before splitting in SeaLion/setup.py. Used to leave a blank entry at the end of requirements list: ["module1", "module2", ""]

Markdown formatting in SeaLion/README.md, such as removing extra spaces.
opened by phuang1024 0
"pip uninstall" not removing all files

SeaLion uses cython code in .pyx files and then compiles that into .so files that are then imported in python .py files that you call. This is for speed benefits.

When you do "pip(3) install sealion" what you are doing is getting all of the files in this directory, which do not include the .so files, just the .pyx. In the first import you compile all the .pyx into .so files, I don't hand you my .so files as it is OS dependent.

This means that the generated .so (and .c and .o) files do not get deleted in "pip(3) uninstall sealion". This leads to some problems. If you reinstall sealion with a new release, then you are still going to have those .so files compiled on the old .pyx files, instead of how instead you would want the new .so compiled files on the new .pyx files. I think this is how it works, please correct me if I am wrong.

Any solutions to this? Any ideas, questions, solutions, etc. are GREATLY APPRECIATED. Thank you!
bug help wanted question

opened by anish-lakkapragada 0

Releases(v4.4.5)

v4.4.5(May 8, 2022)

plz i am desperate to run sealion in my browser with pyscript
Source code(tar.gz)
Source code(zip)
v4.4.4(May 8, 2022)

new release drop the base boast
Source code(tar.gz)
Source code(zip)
v4.4.3(May 8, 2022)

fixed skill issue
Source code(tar.gz)
Source code(zip)
v4.4.2(Oct 17, 2021)

uh idk my mans phuang1024 wanted a release
Source code(tar.gz)
Source code(zip)
v4.4.1(Jun 4, 2021)

Batch Normalization First Release!
Source code(tar.gz)
Source code(zip)
v4.4.0(May 15, 2021)

updating ridge regression, should be = (X^TX + alpha * I)X^Ty not (X^TX + alpha + I)X^Ty
Source code(tar.gz)
Source code(zip)
v4.3.9(Apr 9, 2021)

PReLU fix
Source code(tar.gz)
Source code(zip)
v4.3.8(Apr 9, 2021)

Source code(tar.gz)
Source code(zip)
v4.3.7(Apr 9, 2021)

fixing bug
Source code(tar.gz)
Source code(zip)
v4.3.6(Apr 9, 2021)

Source code(tar.gz)
Source code(zip)
v4.3.5(Apr 9, 2021)

Dropout fix
Source code(tar.gz)
Source code(zip)
v4.3.4(Apr 1, 2021)

adding safe softmax to Softmax Regression (thanks scipy)
Source code(tar.gz)
Source code(zip)
v4.3.3(Mar 23, 2021)

Source code(tar.gz)
Source code(zip)
v4.3.2(Mar 23, 2021)

PReLU + AdaBelief added
Source code(tar.gz)
Source code(zip)
v4.3.1(Mar 19, 2021)

readme edits
Source code(tar.gz)
Source code(zip)
v4.2.9(Mar 19, 2021)

Source code(tar.gz)
Source code(zip)
v4.2.8(Mar 5, 2021)

fixing inf stability
Source code(tar.gz)
Source code(zip)
v4.2.7(Mar 5, 2021)

updating setup.py to make sure gmms work
Source code(tar.gz)
Source code(zip)
v4.2.6(Mar 5, 2021)

making sure GMMs can be accessed
Source code(tar.gz)
Source code(zip)
v4.2.5(Mar 5, 2021)

4.2.5 comes with Gaussian Mixtures!
Source code(tar.gz)
Source code(zip)
v4.2.4(Feb 17, 2021)

fixing perplexity function in tSNE
Source code(tar.gz)
Source code(zip)
v4.2.3(Feb 16, 2021)

fixing something, to make sure the License OSI will show up
Source code(tar.gz)
Source code(zip)
v4.2.2(Feb 16, 2021)

Source code(tar.gz)
Source code(zip)
v4.2.1(Feb 16, 2021)

more license updates (MIT -> Apache)
Source code(tar.gz)
Source code(zip)
v4.2(Feb 16, 2021)

license update on setup.cfg file
Source code(tar.gz)
Source code(zip)
v4.1.9(Feb 15, 2021)

change to the readme
Source code(tar.gz)
Source code(zip)
v4.1.8(Feb 12, 2021)

#5 fix
Source code(tar.gz)
Source code(zip)
v4.1.7(Feb 12, 2021)

Source code(tar.gz)
Source code(zip)
v4.1.6(Feb 12, 2021)

Source code(tar.gz)
Source code(zip)
v4.1.5(Feb 12, 2021)

Source code(tar.gz)
Source code(zip)

Owner

Anish

14-year old developer with interest in machine learning and the theory that drives it. Sole author of SeaLion.

GitHub Repository https://pypi.org/project/sealion/

DCGAN LSGAN WGAN-GP DRAGAN PyTorch

Recommendation Our GAN based work for facial attribute editing - AttGAN. News 8 April 2019: We re-implement these GANs by Tensorflow 2! The old versio

408 Nov 30, 2022

Deep Learning Models for Causal Inference

Extensive tutorials for learning how to build deep learning models for causal inference using selection on observables in Tensorflow 2.

151 Dec 31, 2022

Machine Learning Models were applied to predict the mass of the brain based on gender, age ranges, and head size.

Brain Weight in Humans Variations of head sizes and brain weights in humans Kaggle dataset obtained from this link by Anubhab Swain. Image obtained fr

1 Feb 02, 2022

Single-Stage 6D Object Pose Estimation, CVPR 2020

Overview This repository contains the code for the paper Single-Stage 6D Object Pose Estimation. Yinlin Hu, Pascal Fua, Wei Wang and Mathieu Salzmann.

89 Dec 26, 2022

Experiments with Fourier layers on simulation data.

Factorized Fourier Neural Operators This repository contains the code to reproduce the results in our NeurIPS 2021 ML4PS workshop paper, Factorized Fo

57 Dec 25, 2022

Theano is a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It can use GPUs and perform efficient symbolic differentiation.

============================================================================================================ `MILA will stop developing Theano https:

9.6k Dec 31, 2022

GrailQA: Strongly Generalizable Question Answering

GrailQA is a new large-scale, high-quality KBQA dataset with 64,331 questions annotated with both answers and corresponding logical forms in different syntax (i.e., SPARQL, S-expression, etc.). It ca

76 Dec 21, 2022

CMP 414/765 course repository for Spring 2022 semester

CMP414/765: Artificial Intelligence Spring2021 This is the GitHub repository for course CMP 414/765: Artificial Intelligence taught at The City Univer

4 May 16, 2022

The AugNet Python module contains functions for the fast computation of image similarity.

AugNet AugNet: End-to-End Unsupervised Visual Representation Learning with Image Augmentation arxiv link In our work, we propose AugNet, a new deep le

74 Dec 28, 2022

A python toolbox for predictive uncertainty quantification, calibration, metrics, and visualization

Website, Tutorials, and Docs Uncertainty Toolbox A python toolbox for predictive uncertainty quantification, calibration, metrics, and visualizatio

1.4k Dec 28, 2022

WHENet - ONNX, OpenVINO, TFLite, TensorRT, EdgeTPU, CoreML, TFJS, YOLOv4/YOLOv4-tiny-3L

HeadPoseEstimation-WHENet-yolov4-onnx-openvino ONNX, OpenVINO, TFLite, TensorRT, EdgeTPU, CoreML, TFJS, YOLOv4/YOLOv4-tiny-3L 1. Usage $ git clone htt

49 Sep 21, 2022

MWPToolkit is a PyTorch-based toolkit for Math Word Problem (MWP) solving.

MWPToolkit is a PyTorch-based toolkit for Math Word Problem (MWP) solving. It is a comprehensive framework for research purpose that integrates popular MWP benchmark datasets and typical deep learnin

119 Jan 04, 2023

UMT is a unified and flexible framework which can handle different input modality combinations, and output video moment retrieval and/or highlight detection results.

Unified Multi-modal Transformers This repository maintains the official implementation of the paper UMT: Unified Multi-modal Transformers for Joint Vi

84 Jan 04, 2023

The first machine learning framework that encourages learning ML concepts instead of memorizing class functions.

Related tags

Overview

SeaLion

Quick Demo

General Usage

Testimonials and Reddit Posts

Installation

General Information

Documentation

Updates for v4.1 and up!

Updates for v3.0.0!

Machine Learning Algorithms

Algorithms in progress

Contributing

Acknowledgements

Feedback, comments, or questions

Comments

Question on linear regression program execution

Possible speedups using Cython

Another solution: ugly but fast

The speedup

Fix typo.

Documentation and Cleanup

Cleanup

Docs

Testing

Formatting and Small Bug Fixes

"pip uninstall" not removing all files

Releases(v4.4.5)

v4.4.5(May 8, 2022)

v4.4.4(May 8, 2022)

v4.4.3(May 8, 2022)

v4.4.2(Oct 17, 2021)

v4.4.1(Jun 4, 2021)

v4.4.0(May 15, 2021)

v4.3.9(Apr 9, 2021)

v4.3.8(Apr 9, 2021)

v4.3.7(Apr 9, 2021)

v4.3.6(Apr 9, 2021)

v4.3.5(Apr 9, 2021)

v4.3.4(Apr 1, 2021)

v4.3.3(Mar 23, 2021)

v4.3.2(Mar 23, 2021)

v4.3.1(Mar 19, 2021)

v4.2.9(Mar 19, 2021)

v4.2.8(Mar 5, 2021)

v4.2.7(Mar 5, 2021)

v4.2.6(Mar 5, 2021)

v4.2.5(Mar 5, 2021)

v4.2.4(Feb 17, 2021)

v4.2.3(Feb 16, 2021)

v4.2.2(Feb 16, 2021)

v4.2.1(Feb 16, 2021)

v4.2(Feb 16, 2021)

v4.1.9(Feb 15, 2021)

v4.1.8(Feb 12, 2021)

v4.1.7(Feb 12, 2021)

v4.1.6(Feb 12, 2021)

v4.1.5(Feb 12, 2021)