Fastest Way to Download 3 Million Objects from a S3 Bucket

Fastest way to download 3 million objects from a S3 bucket

Okay, I figured out a solution based on @Matt Billenstien's hint. It uses eventlet library. The first step is most important here (monkey patching of standard IO libraries).

Run this script in the background with nohup and you're all set.

from eventlet import *
patcher.monkey_patch(all=True)

import os, sys, time
from boto.s3.connection import S3Connection
from boto.s3.bucket import Bucket

import logging

logging.basicConfig(filename="s3_download.log", level=logging.INFO)

def download_file(key_name):
# Its imp to download the key from a new connection
conn = S3Connection("KEY", "SECRET")
bucket = Bucket(connection=conn, name="BUCKET")
key = bucket.get_key(key_name)

try:
res = key.get_contents_to_filename(key.name)
except:
logging.info(key.name+":"+"FAILED")

if __name__ == "__main__":
conn = S3Connection("KEY", "SECRET")
bucket = Bucket(connection=conn, name="BUCKET")

logging.info("Fetching bucket list")
bucket_list = bucket.list(prefix="PREFIX")

logging.info("Creating a pool")
pool = GreenPool(size=20)

logging.info("Saving files in bucket...")
for key in bucket.list():
pool.spawn_n(download_file, key.key)
pool.waitall()

How to speed up download of millions of files from AWS S3

My first attempt would be to provision an instance in us-east-1 with io type EBS volume of required size. From what I see there is about 14GB of data from 2018 and 15 GB from 2019. Thus an instance with 40-50 GB should be enough. Or as pointed out in the comments, you can have two instances, one for 2018 files, and the second for 2019 files. This way you can download the two sets in parallel.

Then you attach an IAM role to the instance which allows S3 access. With this, you execute your AWS S3 sync command on the instance. The traffic between S3 and your instance should be much faster then to your local workstation.

Once you have all the files, you zip them and then download the zip file. Zip should help a lot as the IRS files are txt-based XMLs. Alternatively, maybe you could just process the files on the instance itself, without the need to download them to your local workstation.

General recommendation on speeding up transfer between S3 and instances are listed in the AWS blog:

  • How can I improve the transfer speeds for copying data between my S3 bucket and EC2 instance?

Download million files from S3 bucket

Use the AWS Command-Line Interface (CLI):

aws s3 sync s3://bucket/VER1 [name-of-local-directory]

From my experience, it will download in parallel but it won't necessarily use the full bandwidth because there is a lot of overhead for each object. (It is more efficient for large objects, since there is less overhead.)

It is possible that aws s3 sync might have problems with a large number of files. You'd have to try it to see whether it works.

If you really wanted full performance, you could write your own code that downloads in massive parallel, but the time saving would probably be lost in the time it takes you to write and test such a program.

Another option is to use aws s3 sync to download to an Amazon EC2 instance, then zip the files and simply download the zip file. That would reduce bandwidth requirements.

Faster way to Copy S3 files

The most likely reason that you can only copy 500k objects per day (thus taking about 3-4 months to copy 50M objects, which is absolutely unreasonable) is because you're doing the operations sequentially.

The vast majority of the time your code is running is spent waiting for the S3 Copy Object request to be sent to S3, processed by S3 (i.e., copying the object), and then sending the response back to you. On average, this is taking around 160ms per object (500k/day == approx. 1 per 160ms), which is reasonable.

To dramatically improve the performance of your copy operation, you should simply parallelize it: make many threads run the copies concurrently.

Once the Copy commands are not the bottleneck anymore (i.e., after you make them run concurrently), you'll encounter another bottleneck: the List Objects requests. This request runs sequentially, and returns only up to 1k keys per page, so you'll end up having to send around 50k List Object requests sequentially with the straightforward, naive code (here, "naive" == list without any prefix or delimiter, wait for the response, and list again with the provided next continuation token to get the next page).

Two possible solutions for the ListObjects bottleneck:

  • If you know the structure of your bucket pretty well (i.e., the "names of the folders", statistics on the distribution of "files" within those "folders", etc), you could try to parallelize the ListObjects requests by making each thread list a given prefix. Note that this is not a general solution, and requires intimate knowledge of the structure of the bucket, and also usually only works well if the bucket's structure had been planned out originally to support this kind of operation.

  • Alternatively, you can ask S3 to generate an inventory of your bucket. You'll have to wait at most 1 day, but you'll end up with CSV files (or ORC, or Parquet) containing information about all the objects in your bucket.

Either way, once you have the list of objects, you can have your code read the inventory (e.g., from local storage such as your local disk if you can download and store the files, or even by just sending a series of ListObjects and GetObject requests to S3 to retrieve the inventory), and then spin up a bunch of worker threads and run the S3 Copy Object operation on the objects, after deciding which ones to copy and the new object keys (i.e., your logic).

In short:

  1. grab a list of all the objects first;

  2. then launch many workers to run the copies.

One thing to watch out for here is if you launch an absurdly high number of workers and they all end up hitting the exact same partition of S3 for the copies. In such a scenario, you could end up getting some errors from S3. To reduce the likelihood of this happening, here are some things you can do:

  • instead of going sequentially over your list of objects, you could randomize it. E.g., load the inventory, put the items into a queue in a random order, and then have your workers consume from that queue. This will decrease the likelihood of overheating a single S3 partition

  • keep your workers to not more than a few hundred (a single S3 partition should be able to easily keep up with many hundreds of requests per second).

Final note: there's another thing to consider which is whether or not the bucket may be modified during your copy operation. If it could be modified, then you'll need a strategy to deal with objects that might not be copied because they weren't listed, or with objects that were copied by your code but got deleted from the source.

Amazon S3 Store Millions of Files

You say that you have "100s of millions of files", so I shall assume you have 400 million objects, making 40TB of storage. Please adjust accordingly. I have shown my calculations so that people can help identify my errors.

Initial upload

PUT requests in Amazon S3 are charged at $0.005 per 1,000 requests. Therefore, 400 million PUTs would cost $2000. (.005*400m/1000)

This cost cannot be avoided if you wish to create them all as individual objects.

Future uploads would be the same cost at $5 per million.

Storage

Standard storage costs $0.023 per GB, so storing 400 million 100KB objects would cost $920/month. (.023*400m*100/1m)

Storage costs can be reduced by using lower-cost Storage Classes.

Access

GET requests are $0.0004 per 1,000 requests, so downloading 1 million objects each month would cost 40c/month. (.0004*1m/1000)

If the data is being transferred to the Internet, Data Transfer costs of $0.09 per GB would apply. The Data Transfer cost of downloading 1 million 100KB objects would be $9/month. (.09*1m*100/1m)

Analysis

You seem to be most fearful of the initial cost of uploading 100s of millions of objects at a cost of $5 per million objects.

However, storage will also be high, and the cost of $2.30/month per million objects ($920/month for 400m objects). That ongoing cost is likely to dwarf the cost of initial uploads.

Some alternatives would be:

  • Store the data on-premises (disk storage is $100/4TB, so 400m files would require $1000 of disks, but you would want extra drives for redundancy), or
  • Store the data in a database: There are no 'PUT' costs for databases, but you would need to pay for running the database. This might work out a lower cost. or
  • Combine the data in the files (which you say you do not wish to do), but in a way that can be easily split-apart. For example, marking records by an identifier for easy extractions. or
  • Use a different storage service, such as Digital Ocean, who do not appear to have a 'PUT' cost.

Downloading very large number of files from S3

Dealing with bucket with millions of files can be very challenging unless there is some sort of 'structure' to your file names. Unfortunately this wont help any of the GUI tools so you're stuck implementing your own solution. For example:

  1. If all your files start with a date, you can use the marker header in a Get Bucket request to only return files older than a certain date.

  2. If your files are arranged in 'virtual' folders, you can user the prefix and delimiter headers to process each folder separately. (Consider doing this in parallel to speed things up)

Even if you have no structure all is not lost. The S3 clients hang because they are trying to hold the entire 2 million file listing in memory. You could download the object listing 1000 files at a time but save this to a file/database etc. It'll take a long time to get all 2 million but once you are done just loop through your saved list and download as necessary.

Better yet, if you are able to 'index' your files in a database as they are added to S3, you can just use that to determine which files to download.

how to download files from s3 bucket based on files modified date?

I have a Better solution or a function which could do this automatically. Just pass In the Bucket name and Download path name.

from boto3.session import Session
from datetime import date, timedelta
import boto3
import re

def Download_pdf_specifc_date_subfolder(bucket_name,download_path)
ACCESS_KEY = 'XYZ'
SECRET_KEY = 'ABC'
Bucket_name=bucket_name

# code to create a session
session = Session(aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY)
s3 = session.resource('s3')
bucket = s3.Bucket(Bucket_name)

# code to get the yesterdays date
yesterday = date.today() - timedelta(days=1)
x=yesterday.strftime('20%y-%m-%d')
print(x)

#code to add the files to a list which needs to be downloaded
files_to_downloaded = []
#code to take all the files from s3 under a specific bucket
for fileObject in bucket.objects.all():
file_name = str(fileObject.key)
last_modified=str(fileObject.last_modified)
last_modified=last_modified.split()
if last_modified[0]==x:
# Enter the specific bucketname in the regex in place of Airports to filter only the particluar subfolder
if re.findall(r"Airports/[a-zA-Z]+", file_name):
files_to_downloaded.append(file_name)

# code to Download into a specific Folder
for fileObject in bucket.objects.all():
file_name = str(fileObject.key)
if file_name in files_to_downloaded:
print(file_name)
d_path=download_path + file_name
print(d_path)
bucket.download_file(file_name,d_path)

Download_pdf_specifc_date_subfolder(bucket_name,download_path)

Ultimately the function will give the results in the specific Folder with the files to be downloaded.



Related Topics



Leave a reply



Submit