Using AWS Batch jobs to bulk copy/sync files in S3

Overview

Using AWS Batch jobs to bulk copy/sync files in S3

Use Cases

Overview

This guide details how to use AWS Batch to perform bulk copy/sync activities on files in S3. Batch allows users to run massively scalable computing jobs on AWS by provisioning the optimal compute resources needed. The user is able to focus on configuring what the job should be doing instead of provisioning infrastructure.

Here we use Batch's Managed Compute environment with the Fargate provioning model, where Batch runs containers without the user having to manage the underlying EC2 instances. We will package our application within an ECR container image that Fargate will use to launch the environment where the job runs.

With Batch configured, the user will upload a .csv file in an S3 location that has it's events being logged in Cloudtrail. This is being monitored by Eventbridge which will kick off the Batch job once the appropriate file is uploaded. The .csv file contains a list of S3 source/destination pairs to be copied/synced in the job as detailed below. For accessing S3 resources in different AWS accounts, be sure to look at the IAM Roles section below

Architecture

This is an overview of the architecture described above:

awsBatchS3SyncArch

ECR Image

The ECR Image contains our application logic to sync/copy S3 files based on the csv input. This is done in the python script s3CopySyncScript.py. Based on the CSV input, it will perform a managed transfer using the copy api if a file is given as a source/destination. If a prefix is given as source/destination, it will use the AWS CLI to perform an aws s3 sync.

The Dockerfile builds an image based on AL2, installing the AWS cli, python3, boto3, and setting other s3 configuration for optimal transfers.

Create an ECR repository and use these commands to build and push the image as latest using the CLI.

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin <<
   
    >>.dkr.ecr.us-east-1.amazonaws.com

docker build -t <<
    
     >> .

docker tag <<
     
      >>:latest <<
      
       >>.dkr.ecr.us-east-1.amazonaws.com/<<
       
        >>:latest docker push <<
        
         >>.dkr.ecr.us-east-1.amazonaws.com/<<
         
          >>:latest 
         
        
       
      
     
    
   

S3 Input Location

The .csv file will be uploaded to s3 to kick off the job. Designate an area for this. In this example, I've created a sample bucket with the prefix input and will be uploading files here. Notice also the uploaded csv is prefixed as such: s3_batch_sync_input-*.csv. Using a naming convention like this can simplify eventbridge monitoring which we'll see below.

s3InputSample

CSV File Format Details

The csv file should be formatted as such.

source destination
s3://some-bucket/source/ s3://some-bucket/dest/
s3://some-bucket/sourcetwo/ s3://some-bucket/desttwo/
s3://some-bucket/sourceindiv/individualFile.txt s3://some-bucket/dest/individualFile.txt

The first 2 rows in this example are prefixes where we want an s3 sync to occur. The last row is a specific object we want to copy directly. The bucket/AWS account does not need to be the same, as long as IAM permissions are properly applied as noted below.

Cloudtrail Monitoring

Setup a trail that will monitor the S3 location where the input file will land. In setting up the trail you can set these options as you see fit: name, an s3 location for the logs, log encryption using KMS, log validation. You will need to make sure that S3 data events for the location where the input file lands is enabled at a minimum:

cloudTrailS3DataEvents

AWS Batch

Compute Environment

This will serve as a pool that our batch jobs can pull resources from. Create a managed environment with the instance configuration set to fargate for this example. Set the max vCPUs to set an upper limit for concurrent fargate resources being used. Other configuration options for an AWS Batch Compute Environment are detailed here. Lastly pick the VPC and subnets your environment will be located in, and security groups that may need to be attached to instances. If you're using S3 VPC gateway endpoints this would be key. In our example, we're using the default VPC since we're accessing S3 through public internet. Once complete, the environment state would be ENABLED.

batchComputeEnvironment

Job Queue

AWS Batch Jobs are submitted to a job queue until compute environment resources are available for the job to run. You can have multiple queues with different priorities which pull from different compute environments. More details are here. For this example, create a queue and attach it to the compute environment previously made.

batchJobQueue

Job Definition

The Job Definition acts as a template from which to launch our individual jobs. Detailed instructions are here for additional configuration.

  • Basics

    • Enter a name for the template and pick fargate for the platform.
    • Depending on your anticipated use, set a retry strategy and timeout. For this example we set 2 job attempts and a timeout of 120 seconds.
    • We'll also put job logs in the default AWS Batch logs group in cloudwatch but it can be customized as detailed here.
  • Python Script Usage

    • Note the usage of the script described here and then set the container properties in the next step as required
    • Syntax: python s3CopySyncScript.py < > < > <
      > < >
      • header indicates whether the input csv has a header row
      • sync_delete indicates whether the --delete flag is used in case of an aws s3 sync
      • EG: Syntax: python s3CopySyncScript.py my-s3-bucket s3_batch_sync_input-my-sample.csv True True
  • Container Properties

    • In the image box, put the URI of the ECR image that was created.
    • The Command is used as the CMD instruction to execute our container. In our case, we want to execute the python script and pass it our input file details.
      • In JSON form we enter: ["python3","s3CopySyncScript.py","Ref::s3_bucket","Ref::s3_key", "True", "True"]
        • In this example, I have a header in the input and am using the --delete flag an aws s3 sync
    • For vCPUs and memory, we set 1 and 2GB to be conservative for this example. Set it as needed.
    • Job Role and Execution Role are detailed below.
    • We ticked the box for assign public IP since we're accessing S3 through the public internet and are using Fargate platform version 1.4.0
  • Parameters

    • In the python command above, notice the "Ref::s3_bucket","Ref::s3_key". These are parameters to be substituted when a job is invoked through Eventbridge.
    • In this section, we could set defaults for them or other parameters. See more details here.

batchJobDef

IAM Roles

Execution Role
The Execution role is used to setup individual ECS tasks where we run the Batch jobs and for logging. The role should have a trust relationship with ecs-tasks.amazonaws.com. In our example, the AWS managed policy AmazonECSTaskExecutionRolePolicy is attached along with an inline policy giving it permission to create log groups if needed.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents",
                "logs:DescribeLogStreams"
            ],
            "Resource": [
                "arn:aws:logs:*:*:*"
            ]
        }
    ]
}

More details about ECS task execution roles are here.

Job Role
The Job Role is an IAM role that is used provide AWS API access to individual running jobs. Here we configure access to AWS resources the job accesses, the files in S3 in our case. Here we're only accessing resources in one bucket, but be sure to configure this as needed depending on your sources/destinations. Again, the role should have a trust relationship with ecs-tasks.amazonaws.com.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucketMultipartUploads",
                "s3:ListBucket",
                "s3:ListMultipartUploadParts"
            ],
            "Resource": [
                "arn:aws:s3:::s3-batch-sync-article/*",
                "arn:aws:s3:::s3-batch-sync-article"
            ]
        }
    ]
}

If you're accessing S3 objects from/syncing to destinations in multiple accounts, Cross Account S3 Resource Access would need to be configured as detailed here. The account where the Batch jobs run can be considered Account A where a policy providing access resources in AWS Account B's buckets is attached to the IAM role. In Account B, the bucket policy would be modified to allow access from Account A's IAM role. More details about these task IAM roles can be found here.

Eventbridge Rule & Job Invokation

Create an Eventbridge rule that will invoke the AWS Batch job.

Here, the S3 uploads are being logged in cloudtrail. An eventbridge rule will invoke the job with an appropriate upload. Using the naming convention mentioned above, we can use a custom event pattern match and content filtering to only trigger on certain uploads.

{
  "source": ["aws.s3"],
  "detail-type": ["AWS API Call via CloudTrail"],
  "detail": {
    "eventSource": ["s3.amazonaws.com"],
    "eventName": ["PutObject", "CompleteMultipartUpload"],
    "requestParameters": {
      "bucketName": ["s3-batch-sync-article"],
      "key": [{
        "prefix": "input/s3_batch_sync_input-"
      }]
    }
  }
}

Here, we'll trigger the target for this rule when a file lands in the appropriate location with the required prefix.

AWS Batch Target

  • Set the target for this rule to a Batch job queue.
  • Give it the job queue and job definition set above. Provide a name for the jobs that will run.
  • Use configure input to pass details about the input file to the job. In our job, the bucket and key is required as arguments to the python script which we supply as Job Parameters.
    • Use the first input path box to get the bucket and key from the event that triggered the Eventbridge rule. This gets the bucket and key
      • {"S3BucketValue":"$.detail.requestParameters.bucketName","S3KeyValue":"$.detail.requestParameters.key"}
    • The input template box lets you pass parameters or other arguments to the job that is to be invoked. Here we pass the s3_bucket and s3_key job parameters
      • {"Parameters" : {"s3_bucket": , "s3_key": }}
    • See more details about AWS Batch Jobs as CloudWatch Events Targets here
Owner
AWS Samples
AWS Samples
IMDbPY is a Python package useful to retrieve and manage the data of the IMDb movie database about movies, people, characters and companies

IMDbPY is a Python package for retrieving and managing the data of the IMDb movie database about movies, people and companies. Revamp notice Starting

Davide Alberani 1.1k Jan 02, 2023
Python lib to control HottoH based stove devices

Project desciption This library can be used to discuss with HootoH based stove devices Actually tested and validated with a CMG Drum stove. To use thi

3 May 16, 2022
Discord Mass Report script that uses multiple tokens

Discord-Mass-Report Discord Mass Report script that uses multiple tokens, full credits to https://github.com/hoki0/Discord-mass-report who made it in

cChimney 4 Jun 08, 2022
Fairstructure - Structure your data in a FAIR way using google sheets or TSVs

Fairstructure - Structure your data in a FAIR way using google sheets or TSVs. These are then converted to LinkML, and from there other formats

Linked data Modeling Language 23 Dec 01, 2022
Free and Open Source Machine Translation API. 100% self-hosted, no limits, no ties to proprietary services. Built on top of Argos Translate.

LibreTranslate Try it online! | API Docs Free and Open Source Machine Translation API, entirely self-hosted. Unlike other APIs, it doesn't rely on pro

UAV4GEO 3.5k Jan 03, 2023
just another discord bot

boredbot just another discord bot made to learn python this bots main function is to cache teams meeting links and send them right before the classes

macky 3 Sep 03, 2021
Python SDK for IEX Cloud

iexfinance Python SDK for IEX Cloud. Architecture mirrors that of the IEX Cloud API (and its documentation). An easy-to-use toolkit to obtain data for

Addison Lynch 640 Jan 07, 2023
Ciclo 1 - MisiรณnTIC - UIS (Retos)

misiontic_uis Ciclo 1 - MisiรณnTIC - UIS Reto 1: Fundamentos del Lenguaje Python Reto 2: Estructuras de Control Condicional Reto 3: Estructuras de Cont

9 May 24, 2022
Telegram bot to check availability of vaccination slots in India.

cowincheckbot Telegram bot to check availability of vaccination slots in India. Setup Install requirements using pip3 install -r requirements.txt Crea

Muhammed Shameem 10 Jun 11, 2022
TORNADO CASH Pancakeswap Sniper BOT 2022-V1 (MAC WINDOWS ANDROID LINUX)

TORNADO CASH Pancakeswap Sniper BOT 2022-V1 (MAC WINDOWS ANDROID LINUX)

Crypto Trader 1 Jan 06, 2022
Python Dialogflow CX Scripting API (SCRAPI)

Python Dialogflow CX Scripting API (SCRAPI) A high level scripting API for bot builders, developers, and maintainers. Table of Contents Introduction W

Google Cloud Platform 39 Dec 09, 2022
A Python script to create customised Spotify playlists using the JSON, Spotipy Library and Spotify Web API, based on seed tracks in your history.

A Python script to create customised Spotify playlists using the JSON, Spotipy Library and Spotify Web API, based on seed tracks in your history.

Youngseo Park 1 Feb 01, 2022
Python client for the Datadog API

datadog-api-client-python This repository contains a Python API client for the Datadog API. The code is generated using openapi-generator and apigento

Datadog, Inc. 58 Dec 16, 2022
Fast IP address lookup

ipscoop Fast IP Scoop Table of Contents Installation CLI Getting Started Ref Installation To install ipscoop, simply: $ python3 -m pip install -U git+

6 Mar 16, 2022
Telegram bot with various Sticker Tools

Sticker Tools Bot @Sticker_Tools_Bot A star โญ from you means a lot to us! Telegram bot with various Sticker Tools Usage Deploy to Heroku Tap on above

Stark Bots 20 Dec 08, 2022
A discord http interactions framework built on top of Sanic

snowfin An async discord http interactions framework built on top of Sanic Installing for now just install the package through pip via github # Unix b

kaj 13 Dec 15, 2022
The Fastest multi spambot of Telegram ๐Ÿคž ๐Ÿคž

Revil Spam Bot The Fastest multi spambot of Telegram ๐Ÿคž ๐Ÿคž ๐š‚๐š„๐™ฟ๐™ฟ๐™พ๐š๐šƒ ๐Ÿ–ค แด„ส€แด‡แด€แด›แดส€ ๐Ÿ–ค โšก ๐“ก๐“ฎ๐“ฟ๐“ฒ๐“ต ๐“—๐“พ๐“ท๐“ฝ๐“ฎ๐“ป ๐”๐”ฒ๐”ฉ๐”ฑ๐”ฆ แบžรธโœž๏ธŽ โšก ๐“ ๐•พะผฮฟฮฟฯ„ะฝ ๐“ะธโˆ‚ ๐•ฑ

REVIL HUNTER 4 Dec 08, 2021
WaifuGen - A program made in waifuGen that generates SFW and NSFW waifus from the waifu.pics API

waifuGen A program made in waifuGen that generates SFW and NSFW waifus from the

1 Jan 05, 2022
A simple Telegram bot that converts a phone number to a direct whatsapp chat link

Open in WhatsApp I was using a great app to open a whatsapp chat with a given number directly without saving that number in my contact list, but I fel

Pathfinder 19 Dec 24, 2022
A pre-attack hacker tool which aims to find out sensitives comments in HTML comment tag and to help on reconnaissance process

Find Out in Comment Find sensetive comment out in HTML โšˆ About This is a pre-attack hacker tool that searches for sensitives words in HTML comments ta

Pablo Emรญdio S.S 8 Dec 31, 2022