Move Files Between Two Aws S3 Buckets Using Boto3

Move files between folder on Amazon S3 using boto3 ( The specified bucket does not exist error)

The CopySource parameter is defined as:

The string form is {bucket}/{key} or {bucket}/{key}?versionId={versionId} if you want to copy a specific version. You can also provide this value as a dictionary. The dictionary format is recommended over the string format because it is more explicit. The dictionary format is: {'Bucket': 'bucket', 'Key': 'key', 'VersionId': 'id'}. Note that the VersionId key is optional and may be omitted.

Therefore this line:

CopySource='my-folder/my.txt')

should include the bucket name at the start:

CopySource='dev-files/my-folder/my.txt')

how to copy files and folders from one S3 bucket to another S3 using python boto3

S3 does not have any concept of folder/directories. It follows a flat structure.
For example it seems,
On UI you see 2 files inside test_folder with named file1.txt and file2.txt, but actually two files will have key as
"test_folder/file1.txt" and "test_folder/file2.txt".
Each file is stored with this naming convention.

You can use code snippet given below to copy each key to some other bucket.

import boto3
s3_client = boto3.client('s3')
resp = s3_client.list_objects_v2(Bucket='mybucket')
keys = []
for obj in resp['Contents']:
keys.append(obj['Key'])



s3_resource = boto3.resource('s3')
for key in keys:
copy_source = {
'Bucket': 'mybucket',
'Key': key
}
bucket = s3_resource.Bucket('otherbucket')
bucket.copy(copy_source, 'otherkey')

If your source bucket contains many keys, and this is a one time activity, then I suggest you to checkout this link.

If this needs to be done for every insert event on your bucket and you need to copy that to another bucket, you can checkout this approach.

How do I move/copy files in s3 using boto3 asynchronously?

I use the following. You can copy into a python file and run it from the command line. I have a PC with 8 cores, so it's faster than my little EC2 instance with 1 VPC.

It uses the multiprocessing library, so you'd want to read up on that if you aren't familiar. It's relatively straightforward. There's a batch delete that I've commented out because you really don't want to accidentally delete the wrong directory. You can use whatever methods you want to list the keys or iterate through the objects, but this works for me.

from multiprocessing import Pool
from itertools import repeat
import boto3
import os
import math
s3sc = boto3.client('s3')
s3sr = boto3.resource('s3')
num_proc = os.cpu_count()

def get_list_of_keys_from_prefix(bucket, prefix):
"""gets list of keys for given bucket and prefix"""
keys_list = []
paginator = s3sr.meta.client.get_paginator('list_objects_v2')
for page in paginator.paginate(Bucket=bucket, Prefix=prefix, Delimiter='/'):
keys = [content['Key'] for content in page.get('Contents')]
keys_list.extend(keys)
if prefix in keys_list:
keys_list.remove(prefix)
return keys_list

def batch_delete_s3(keys_list, bucket):
total_keys = len(keys_list)
chunk_size = 1000
num_batches = math.ceil(total_keys / chunk_size)
for b in range(0, num_batches):
batch_to_delete = []
for k in keys_list[chunk_size*b:chunk_size*b+chunk_size]:
batch_to_delete.append({'Key': k})
s3sc.delete_objects(Bucket=bucket, Delete={'Objects': batch_to_delete,'Quiet': True})

def copy_s3_to_s3(from_bucket, from_key, to_bucket, to_key):
copy_source = {'Bucket': from_bucket, 'Key': from_key}
s3sr.meta.client.copy(copy_source, to_bucket, to_key)

def upload_multiprocess(from_bucket, keys_list_from, to_bucket, keys_list_to, num_proc=4):
with Pool(num_proc) as pool:
r = pool.starmap(copy_s3_to_s3, zip(repeat(from_bucket), keys_list_from, repeat(to_bucket), keys_list_to), 15)
pool.close()
pool.join()
return r

if __name__ == '__main__':
__spec__= None

from_bucket = 'from-bucket'
from_prefix = 'from/prefix/'
to_bucket = 'to-bucket'
to_prefix = 'to/prefix/'

keys_list_from = get_list_of_keys_from_prefix(from_bucket, from_prefix)
keys_list_to = [to_prefix + k.rsplit('/')[-1] for k in keys_list_from]

rs = upload_multiprocess(from_bucket, keys_list_from, to_bucket, keys_list_to, num_proc=num_proc)

# batch_delete_s3(keys_list_from, from_bucket)


Related Topics



Leave a reply



Submit