AWS Glue ETL Code Samples

Overview

AWS Glue ETL Code Samples

This repository has samples that demonstrate various aspects of the new AWS Glue service, as well as various AWS Glue utilities.

You can find the AWS Glue open-source Python libraries in a separate repository at: awslabs/aws-glue-libs.

Content

  • FAQ and How-to

    Helps you get started using the many ETL capabilities of AWS Glue, and answers some of the more common questions people have.

Examples

You can run these sample job scripts on any of AWS Glue ETL jobs, container, or local environment.

  • Join and Relationalize Data in S3

    This sample ETL script shows you how to use AWS Glue to load, transform, and rewrite data in AWS S3 so that it can easily and efficiently be queried and analyzed.

  • Clean and Process

    This sample ETL script shows you how to take advantage of both Spark and AWS Glue features to clean and transform data for efficient analysis.

  • The resolveChoice Method

    This sample explores all four of the ways you can resolve choice types in a dataset using DynamicFrame's resolveChoice method.

  • Converting character encoding

    This sample ETL script shows you how to use AWS Glue job to convert character encoding.

Utilities

GlueCustomConnectors

AWS Glue provides built-in support for the most commonly used data stores such as Amazon Redshift, MySQL, MongoDB. Powered by Glue ETL Custom Connector, you can subscribe a third-party connector from AWS Marketplace or build your own connector to connect to data stores that are not natively supported.

marketplace

  • Development

    Development guide with examples of connectors with simple, intermediate, and advanced functionalities. These examples demonstrate how to implement Glue Custom Connectors based on Spark Data Source or Amazon Athena Federated Query interfaces and plug them into Glue Spark runtime.

  • Local Validation Tests

    This user guide describes validation tests that you can run locally on your laptop to integrate your connector with Glue Spark runtime.

  • Validation

    This user guide shows how to validate connectors with Glue Spark runtime in a Glue job system before deploying them for your workloads.

  • Glue Spark Script Examples

    Python scripts examples to use Spark, Amazon Athena and JDBC connectors with Glue Spark runtime.

  • Create and Publish Glue Connector to AWS Marketplace

    If you would like to partner or publish your Glue custom connector to AWS Marketplace, please refer to this guide and reach out to us at [email protected] for further details on your connector.

License Summary

This sample code is made available under the MIT-0 license. See the LICENSE file.

Owner
AWS Samples
AWS Samples
Weather analysis with Python, SQLite, SQLAlchemy, and Flask

Surf's Up Weather analysis with Python, SQLite, SQLAlchemy, and Flask Overview The purpose of this analysis was to examine weather trends (precipitati

Art Tucker 1 Sep 05, 2021
Big Data & Cloud Computing for Oceanography

DS2 Class 2022, Big Data & Cloud Computing for Oceanography Home of the 2022 ISblue Big Data & Cloud Computing for Oceanography class (IMT-A, ENSTA, I

Ocean's Big Data Mining 5 Mar 19, 2022
Statistical Rethinking course winter 2022

Statistical Rethinking (2022 Edition) Instructor: Richard McElreath Lectures: Uploaded Playlist and pre-recorded, two per week Discussion: Online, F

Richard McElreath 3.9k Dec 31, 2022
Feature Detection Based Template Matching

Feature Detection Based Template Matching The classification of the photos was made using the OpenCv template Matching method. Installation Use the pa

Muhammet Erem 2 Nov 18, 2021
Ejercicios Panda usando Pandas

Readme Below we add configuration details to locally test your application To co

1 Jan 22, 2022
MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation [ECCV2020]

MEAD: A Large-scale Audio-visual Dataset for Emotional Talking-face Generation [ECCV2020] by Kaisiyuan Wang, Qianyi Wu, Linsen Song, Zhuoqian Yang, Wa

112 Dec 28, 2022
PyClustering is a Python, C++ data mining library.

pyclustering is a Python, C++ data mining library (clustering algorithm, oscillatory networks, neural networks). The library provides Python and C++ implementations (C++ pyclustering library) of each

Andrei Novikov 1k Jan 05, 2023
Python script for transferring data between three drives in two separate stages

Waterlock Waterlock is a Python script meant for incrementally transferring data between three folder locations in two separate stages. It performs ha

David Swanlund 13 Nov 10, 2021
Code for the DH project "Dhimmis & Muslims – Analysing Multireligious Spaces in the Medieval Muslim World"

Damast This repository contains code developed for the digital humanities project "Dhimmis & Muslims – Analysing Multireligious Spaces in the Medieval

University of Stuttgart Visualization Research Center 2 Jul 01, 2022
Autopsy Module to analyze Registry Hives based on bookmarks provided by EricZimmerman for his tool RegistryExplorer

Autopsy Module to analyze Registry Hives based on bookmarks provided by EricZimmerman for his tool RegistryExplorer

Mohammed Hassan 13 Mar 31, 2022
Tkinter Izhikevich Neuron Model With Python

TKINTER IZHIKEVICH NEURON MODEL WITH PYTHON Hodgkin-Huxley Model It is a mathematical model for the generation and transmission of action potentials i

Rabia KOÇ 8 Jul 16, 2022
Employee Turnover Analysis

Employee Turnover Analysis Submission to the DataCamp competition "Can you help reduce employee turnover?"

Jannik Wiedenhaupt 1 Feb 13, 2022
Used for data processing in machine learning, and help us to construct ML model more easily from scratch

Used for data processing in machine learning, and help us to construct ML model more easily from scratch. Can be used in linear model, logistic regression model, and decision tree.

ShawnWang 0 Jul 05, 2022
Hue Editor: Open source SQL Query Assistant for Databases/Warehouses

Hue Editor: Open source SQL Query Assistant for Databases/Warehouses

Cloudera 759 Jan 07, 2023
CPSPEC is an astrophysical data reduction software for timing

CPSPEC manual Introduction CPSPEC is an astrophysical data reduction software for timing. Various timing properties, such as power spectra and cross s

Tenyo Kawamura 1 Oct 20, 2021
VevestaX is an open source Python package for ML Engineers and Data Scientists.

VevestaX Track failed and successful experiments as well as features. VevestaX is an open source Python package for ML Engineers and Data Scientists.

Vevesta 24 Dec 14, 2022
A Pythonic introduction to methods for scaling your data science and machine learning work to larger datasets and larger models, using the tools and APIs you know and love from the PyData stack (such as numpy, pandas, and scikit-learn).

This tutorial's purpose is to introduce Pythonistas to methods for scaling their data science and machine learning work to larger datasets and larger models, using the tools and APIs they know and lo

Coiled 102 Nov 10, 2022
University Challenge 2021 With Python

University Challenge 2021 This repository contains: The TeX file of the technical write-up describing the University / HYPER Challenge 2021 under late

2 Nov 27, 2021
Picka: A Python module for data generation and randomization.

Picka: A Python module for data generation and randomization. Author: Anthony Long Version: 1.0.1 - Fixed the broken image stuff. Whoops What is Picka

Anthony 108 Nov 30, 2021
Investigating EV charging data

Investigating EV charging data Introduction: Got an opportunity to work with a home monitoring technology company over the last 6 months whose goal wa

Yash 2 Apr 07, 2022