Retrieving subfolders names in S3 bucket from boto3
S3 is an object storage, it doesn't have real directory structure. The "/" is rather cosmetic.
One reason that people want to have a directory structure, because they can maintain/prune/add a tree to the application. For S3, you treat such structure as sort of index or search tag.
To manipulate object in S3, you need boto3.client or boto3.resource, e.g.
To list all object
import boto3
s3 = boto3.client("s3")
all_objects = s3.list_objects(Bucket = 'bucket-name')
http://boto3.readthedocs.org/en/latest/reference/services/s3.html#S3.Client.list_objects
In fact, if the s3 object name is stored using '/' separator. The more recent version of list_objects (list_objects_v2) allows you to limit the response to keys that begin with the specified prefix.
To limit the items to items under certain sub-folders:
import boto3
s3 = boto3.client("s3")
response = s3.list_objects_v2(
Bucket=BUCKET,
Prefix ='DIR1/DIR2',
MaxKeys=100 )
Documentation
Another option is using python os.path function to extract the folder prefix. Problem is that this will require listing objects from undesired directories.
import os
s3_key = 'first-level/1456753904534/part-00014'
filename = os.path.basename(s3_key)
foldername = os.path.dirname(s3_key)
# if you are not using conventional delimiter like '#'
s3_key = 'first-level#1456753904534#part-00014'
filename = s3_key.split("#")[-1]
A reminder about boto3 : boto3.resource is a nice high level API. There are pros and cons using boto3.client vs boto3.resource. If you develop internal shared library, using boto3.resource will give you a blackbox layer over the resources used.
How to get only the sub folder names from a S3 bucket
You can obtain the folder names by using ListObjects()
while passing a Delimiter
value. It then returns the list of folders as CommonPrefixes
.
Here's some sample code:
import boto3
s3_client = boto3.client('s3')
objects = s3_client.list_objects_v2(Bucket='my-bucket', Delimiter='/', Prefix ='photos/')
for prefix in objects['CommonPrefixes']:
print(prefix['Prefix'])
How to retrieve file names from S3 bucket and all of the subfolders
How I solve this is, I was able to put all objects in a list like so:
all_objects = list(s3.Bucket('Bucket_Name').objects.all())
looped through all objects:
for x in all_objects:
# appended all objects to different list
create a dictionary of lists like dict = {'x': list1, 'y': list2, ....}
created a dataframe (df = pd.DataFrame(dict))
List all of the folder name from S3 bucket
If you use boto3
s3 slient, you can get a list of folders:
import boto3
s3 = boto3.client('s3')
result = s3.list_objects(Bucket=BUCKET_NAME, Delimiter='/')
for prefix in result.get('CommonPrefixes', list()):
print(prefix.get('Prefix', ''))
How to get ALL subdirectories, all levels deep except files in AWS S3 with python boto3
import os
# list_objects returns a dictionary. The 'Contents' key contains a
# list of full paths including the file name stored in the bucket
# for example: data/subdir1/subdir3/subdir4/subdir5/file2.csv
objects = s3_client.list_objects(Bucket='bucket_name')['Contents']
# here we iterate over the fullpaths and using
# os.path.dirname we get the fullpath excluding the filename
for obj in objects:
print(os.path.dirname(obj['Key'])
To make this a unique sorted list of directory "paths", we would use sort a set comprehension inline. Sets are unique, and sorted will convert this to a list.
See https://docs.python.org/3/tutorial/datastructures.html#sets
import os
paths = sorted({os.path.dirname(obj['Key']) for obj in objects})
Related Topics
Only Reading First N Rows of CSV File With CSV Reader in Python
How to Print Superscript in Python
Python: How to Calculate the Sum of Numbers from a File
Finding Index of an Item Closest to the Value in a List That'S Not Entirely Sorted
How to Remove Unused Packages from Virtualenv
How to Tell Python to Convert Integers into Words
Send a File Through Sockets in Python
Calculate the Lcm of a List of Given Numbers in Python
Running an Excel Macro Via Python
How to Convert Column With String Type to Int Form in Pyspark Data Frame
How to Get the Name of an Object
Python3 Tkinter Set Image Size
Pandas Series With Different Lengths
Read Multiple Images on a Folder in Opencv (Python)
How to Write 2 Lists of Items in 2 Columns Instead of 2 Arrays