Retrieving Subfolders Names in S3 Bucket from Boto3

Retrieving subfolders names in S3 bucket from boto3

S3 is an object storage, it doesn't have real directory structure. The "/" is rather cosmetic.
One reason that people want to have a directory structure, because they can maintain/prune/add a tree to the application. For S3, you treat such structure as sort of index or search tag.

To manipulate object in S3, you need boto3.client or boto3.resource, e.g.
To list all object

import boto3 
s3 = boto3.client("s3")
all_objects = s3.list_objects(Bucket = 'bucket-name')

http://boto3.readthedocs.org/en/latest/reference/services/s3.html#S3.Client.list_objects

In fact, if the s3 object name is stored using '/' separator. The more recent version of list_objects (list_objects_v2) allows you to limit the response to keys that begin with the specified prefix.

To limit the items to items under certain sub-folders:

    import boto3 
s3 = boto3.client("s3")
response = s3.list_objects_v2(
Bucket=BUCKET,
Prefix ='DIR1/DIR2',
MaxKeys=100 )

Documentation

Another option is using python os.path function to extract the folder prefix. Problem is that this will require listing objects from undesired directories.

import os
s3_key = 'first-level/1456753904534/part-00014'
filename = os.path.basename(s3_key)
foldername = os.path.dirname(s3_key)

# if you are not using conventional delimiter like '#'
s3_key = 'first-level#1456753904534#part-00014'
filename = s3_key.split("#")[-1]

A reminder about boto3 : boto3.resource is a nice high level API. There are pros and cons using boto3.client vs boto3.resource. If you develop internal shared library, using boto3.resource will give you a blackbox layer over the resources used.

How to get only the sub folder names from a S3 bucket

You can obtain the folder names by using ListObjects() while passing a Delimiter value. It then returns the list of folders as CommonPrefixes.

Here's some sample code:

import boto3

s3_client = boto3.client('s3')
objects = s3_client.list_objects_v2(Bucket='my-bucket', Delimiter='/', Prefix ='photos/')

for prefix in objects['CommonPrefixes']:
print(prefix['Prefix'])

How to retrieve file names from S3 bucket and all of the subfolders

How I solve this is, I was able to put all objects in a list like so:

all_objects = list(s3.Bucket('Bucket_Name').objects.all())

looped through all objects:
for x in all_objects:
# appended all objects to different list

create a dictionary of lists like dict = {'x': list1, 'y': list2, ....}
created a dataframe (df = pd.DataFrame(dict))

List all of the folder name from S3 bucket

If you use boto3 s3 slient, you can get a list of folders:

import boto3
s3 = boto3.client('s3')

result = s3.list_objects(Bucket=BUCKET_NAME, Delimiter='/')
for prefix in result.get('CommonPrefixes', list()):
print(prefix.get('Prefix', ''))

How to get ALL subdirectories, all levels deep except files in AWS S3 with python boto3

import os

# list_objects returns a dictionary. The 'Contents' key contains a
# list of full paths including the file name stored in the bucket
# for example: data/subdir1/subdir3/subdir4/subdir5/file2.csv
objects = s3_client.list_objects(Bucket='bucket_name')['Contents']

# here we iterate over the fullpaths and using
# os.path.dirname we get the fullpath excluding the filename
for obj in objects:
print(os.path.dirname(obj['Key'])

To make this a unique sorted list of directory "paths", we would use sort a set comprehension inline. Sets are unique, and sorted will convert this to a list.

See https://docs.python.org/3/tutorial/datastructures.html#sets

import os
paths = sorted({os.path.dirname(obj['Key']) for obj in objects})



Related Topics



Leave a reply



Submit