Enterprise Scale NLP with Hugging Face & SageMaker Workshop series

Overview

Workshop: Enterprise-Scale NLP with Hugging Face & Amazon SageMaker

Earlier this year we announced a strategic collaboration with Amazon to make it easier for companies to use Hugging Face Transformers in Amazon SageMaker, and ship cutting-edge Machine Learning features faster. We introduced new Hugging Face Deep Learning Containers (DLCs) to train and deploy Hugging Face Transformers in Amazon SageMaker.

In addition to the Hugging Face Inference DLCs, we created a Hugging Face Inference Toolkit for SageMaker. This Inference Toolkit leverages the pipelines from the transformers library to allow zero-code deployments of models, without requiring any code for pre-or post-processing.

In October and November, we held a workshop series on “Enterprise-Scale NLP with Hugging Face & Amazon SageMaker”. This workshop series consisted out of 3 parts and covers:

  • Getting Started with Amazon SageMaker: Training your first NLP Transformer model with Hugging Face and deploying it
  • Going Production: Deploying, Scaling & Monitoring Hugging Face Transformer models with Amazon SageMaker
  • MLOps: End-to-End Hugging Face Transformers with the Hub & SageMaker Pipelines

We recorded all of them so you are now able to do the whole workshop series on your own to enhance your Hugging Face Transformers skills with Amazon SageMaker or vice-versa.

Below you can find all the details of each workshop and how to get started.

🧑🏻‍💻 Github Repository: https://github.com/philschmid/huggingface-sagemaker-workshop-series

📺   Youtube Playlist: https://www.youtube.com/playlist?list=PLo2EIpI_JMQtPhGR5Eo2Ab0_Vb89XfhDJ

Note: The Repository contains instructions on how to access a temporary AWS, which was available during the workshops. To be able to do the workshop now you need to use your own or your company AWS Account.

In Addition to the workshop we created a fully dedicated Documentation for Hugging Face and Amazon SageMaker, which includes all the necessary information. If the workshop is not enough for you we also have 15 additional getting samples Notebook Github repository, which cover topics like distributed training or leveraging Spot Instances.

Workshop 1: Getting Started with Amazon SageMaker: Training your first NLP Transformer model with Hugging Face and deploying it

In Workshop 1 you will learn how to use Amazon SageMaker to train a Hugging Face Transformer model and deploy it afterwards.

  • Prepare and upload a test dataset to S3
  • Prepare a fine-tuning script to be used with Amazon SageMaker Training jobs
  • Launch a training job and store the trained model into S3
  • Deploy the model after successful training

🧑🏻‍💻 Code Assets: https://github.com/philschmid/huggingface-sagemaker-workshop-series/tree/main/workshop_1_getting_started_with_amazon_sagemaker

📺  Youtube: https://www.youtube.com/watch?v=pYqjCzoyWyo&list=PLo2EIpI_JMQtPhGR5Eo2Ab0_Vb89XfhDJ&index=6&t=5s&ab_channel=HuggingFace

Workshop 2: Going Production: Deploying, Scaling & Monitoring Hugging Face Transformer models with Amazon SageMaker

In Workshop 2 learn how to use Amazon SageMaker to deploy, scale & monitor your Hugging Face Transformer models for production workloads.

  • Run Batch Prediction on JSON files using a Batch Transform
  • Deploy a model from hf.co/models to Amazon SageMaker and run predictions
  • Configure autoscaling for the deployed model
  • Monitor the model to see avg. request time and set up alarms

🧑🏻‍💻 Code Assets: https://github.com/philschmid/huggingface-sagemaker-workshop-series/tree/main/workshop_2_going_production

📺  Youtube: https://www.youtube.com/watch?v=whwlIEITXoY&list=PLo2EIpI_JMQtPhGR5Eo2Ab0_Vb89XfhDJ&index=6&t=61s

Workshop 3: MLOps: End-to-End Hugging Face Transformers with the Hub & SageMaker Pipelines

In Workshop 3 learn how to build an End-to-End MLOps Pipeline for Hugging Face Transformers from training to production using Amazon SageMaker.

We are going to create an automated SageMaker Pipeline which:

  • processes a dataset and uploads it to s3
  • fine-tunes a Hugging Face Transformer model with the processed dataset
  • evaluates the model against an evaluation set
  • deploys the model if it performed better than a certain threshold

🧑🏻‍💻 Code Assets: https://github.com/philschmid/huggingface-sagemaker-workshop-series/tree/main/workshop_3_mlops

📺  Youtube: https://www.youtube.com/watch?v=XGyt8gGwbY0&list=PLo2EIpI_JMQtPhGR5Eo2Ab0_Vb89XfhDJ&index=7

Access Workshop AWS Account

For this workshop you’ll get access to a temporary AWS Account already pre-configured with Amazon SageMaker Notebook Instances. Follow the steps in this section to login to your AWS Account and download the workshop material.

1. To get started navigate to - https://dashboard.eventengine.run/login

setup1

Click on Accept Terms & Login

2. Click on Email One-Time OTP (Allow for up to 2 mins to receive the passcode)

setup2

3. Provide your email address

setup3

4. Enter your OTP code

setup4

5. Click on AWS Console

setup5

6. Click on Open AWS Console

setup6

7. In the AWS Console click on Amazon SageMaker

setup7

8. Click on Notebook and then on Notebook instances

setup8

9. Create a new Notebook instance

setup9

10. Configure Notebook instances

  • Make sure to increase the Volume Size of the Notebook if you want to work with big models and datasets
  • Add your IAM_Role with permissions to run your SageMaker Training And Inference Jobs
  • Add the Workshop Github Repository to the Notebook to preload the notebooks: https://github.com/philschmid/huggingface-sagemaker-workshop-series.git

setup10

11. Open the Lab and select the right kernel you want to do and have fun!

Open the workshop you want to do (workshop_1_getting_started_with_amazon_sagemaker/) and select the pytorch kernel

setup11

Owner
Philipp Schmid
Machine Learning Engineer & Tech Lead at Hugging Face👨🏻‍💻 🤗 Cloud enthusiast ☁️ AWS ML HERO 🦸🏻‍♂️ Nuremberg 🇩🇪
Philipp Schmid
The code from the whylogs workshop in DataTalks.Club on 29 March 2022

whylogs Workshop The code from the whylogs workshop in DataTalks.Club on 29 March 2022 whylogs - The open source standard for data logging (Don't forg

DataTalksClub 12 Sep 05, 2022
Auto_code_complete is a auto word-completetion program which allows you to customize it on your needs

auto_code_complete is a auto word-completetion program which allows you to customize it on your needs. the model for this program is one of the deep-learning NLP(Natural Language Process) model struc

RUO 2 Feb 22, 2022
Simple Text-To-Speech Bot For Discord

Simple Text-To-Speech Bot For Discord This is a very simple TTS bot for discord made with python. For this bot you need FFMPEG, see installation to se

1 Sep 26, 2022
Visual Automata is a Python 3 library built as a wrapper for Caleb Evans' Automata library to add more visualization features.

Visual Automata Copyright 2021 Lewi Lie Uberg Released under the MIT license Visual Automata is a Python 3 library built as a wrapper for Caleb Evans'

Lewi Uberg 55 Nov 17, 2022
Examples of using sparse attention, as in "Generating Long Sequences with Sparse Transformers"

Status: Archive (code is provided as-is, no updates expected) Update August 2020: For an example repository that achieves state-of-the-art modeling pe

OpenAI 1.3k Dec 28, 2022
Common Voice Dataset explorer

Common Voice Dataset Explorer Common Voice Dataset is by Mozilla Made during huggingface finetuning week Usage pip install -r requirements.txt streaml

Ceyda Cinarel 22 Nov 16, 2022
txtai: Build AI-powered semantic search applications in Go

txtai: Build AI-powered semantic search applications in Go txtai executes machine-learning workflows to transform data and build AI-powered semantic s

NeuML 49 Dec 06, 2022
Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings

Cải thiện Elasticsearch trong bài toán semantic search sử dụng phương pháp Sentence Embeddings Trong bài viết này mình sẽ sử dụng pretrain model SimCS

Vo Van Phuc 18 Nov 25, 2022
100+ Chinese Word Vectors 上百种预训练中文词向量

Chinese Word Vectors 中文词向量 中文 This project provides 100+ Chinese Word Vectors (embeddings) trained with different representations (dense and sparse),

embedding 10.4k Jan 09, 2023
Sequence Modeling with Structured State Spaces

Structured State Spaces for Sequence Modeling This repository provides implementations and experiments for the following papers. S4 Efficiently Modeli

HazyResearch 902 Jan 06, 2023
1 Jun 28, 2022
A simple tool to update bib entries with their official information (e.g., DBLP or the ACL anthology).

Rebiber: A tool for normalizing bibtex with official info. We often cite papers using their arXiv versions without noting that they are already PUBLIS

(Bill) Yuchen Lin 2k Jan 01, 2023
HF's ML for Audio study group

Hugging Face Machine Learning for Audio Study Group Welcome to the ML for Audio Study Group. Through a series of presentations, paper reading and disc

Vaibhav Srivastav 110 Jan 01, 2023
Labelling platform for text using distant supervision

With DataQA, you can label unstructured text documents using rule-based distant supervision.

245 Aug 05, 2022
Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET

Training COMET using seq2seq setting Use AutoModelForSeq2SeqLM in Huggingface Transformers to train COMET. The codes are modified from run_summarizati

tqfang 9 Dec 17, 2022
Translation to python of Chris Sims' optimization function

pycsminwel This is a locol minimization algorithm. Uses a quasi-Newton method with BFGS update of the estimated inverse hessian. It is robust against

Gustavo Amarante 1 Mar 21, 2022
Différents programmes créant une interface graphique a l'aide de Tkinter pour simplifier la vie des étudiants.

GP211-Grand-Projet Ce repertoire contient tout les programmes nécessaires au bon fonctionnement de notre projet-logiciel. Cette interface graphique es

1 Dec 21, 2021
Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module.

Import Subtitles for Blender VSE Addon for adding subtitle files to blender VSE as Text sequences. Using pysub2 python module. Supported formats by py

4 Feb 27, 2022
Coreference resolution for English, German and Polish, optimised for limited training data and easily extensible for further languages

Coreferee Author: Richard Paul Hudson, msg systems ag 1. Introduction 1.1 The basic idea 1.2 Getting started 1.2.1 English 1.2.2 German 1.2.3 Polish 1

msg systems ag 169 Dec 21, 2022
Python code for ICLR 2022 spotlight paper EViT: Expediting Vision Transformers via Token Reorganizations

Expediting Vision Transformers via Token Reorganizations This repository contain

Youwei Liang 101 Dec 26, 2022