Listing Directories at a Given Level in Amazon S3

Amazon S3 listing directories

Where you have keys that have no content S3 considers them "Common Prefixes":

http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/s3/model/ObjectListing.html#getCommonPrefixes%28%29

public List getCommonPrefixes()

Gets the common prefixes included in this object listing. Common
prefixes are only present if a delimiter was specified in the original
request.

Each common prefix represents a set of keys in the S3 bucket that have
been condensed and omitted from the object summary results. This
allows applications to organize and browse their keys hierarchically,
similar to how a file system organizes files into directories.

For example, consider a bucket that contains the following keys:

"foo/bar/baz"

"foo/bar/bash"

"foo/bar/bang"

"foo/boo"

If calling listObjects with the prefix="foo/" and the delimiter="/" on
this bucket, the returned S3ObjectListing will contain one entry in
the common prefixes list ("foo/bar/") and none of the keys beginning
with that common prefix will be included in the object summaries list.

Returns: The list of common prefixes included in this object listing,
which might be an empty list if no common prefixes were found.

Retrieving subfolders names in S3 bucket from boto3

S3 is an object storage, it doesn't have real directory structure. The "/" is rather cosmetic.
One reason that people want to have a directory structure, because they can maintain/prune/add a tree to the application. For S3, you treat such structure as sort of index or search tag.

To manipulate object in S3, you need boto3.client or boto3.resource, e.g.
To list all object

import boto3 
s3 = boto3.client("s3")
all_objects = s3.list_objects(Bucket = 'bucket-name')

http://boto3.readthedocs.org/en/latest/reference/services/s3.html#S3.Client.list_objects

In fact, if the s3 object name is stored using '/' separator. The more recent version of list_objects (list_objects_v2) allows you to limit the response to keys that begin with the specified prefix.

To limit the items to items under certain sub-folders:

    import boto3 
s3 = boto3.client("s3")
response = s3.list_objects_v2(
Bucket=BUCKET,
Prefix ='DIR1/DIR2',
MaxKeys=100 )

Documentation

Another option is using python os.path function to extract the folder prefix. Problem is that this will require listing objects from undesired directories.

import os
s3_key = 'first-level/1456753904534/part-00014'
filename = os.path.basename(s3_key)
foldername = os.path.dirname(s3_key)

# if you are not using conventional delimiter like '#'
s3_key = 'first-level#1456753904534#part-00014'
filename = s3_key.split("#")[-1]

A reminder about boto3 : boto3.resource is a nice high level API. There are pros and cons using boto3.client vs boto3.resource. If you develop internal shared library, using boto3.resource will give you a blackbox layer over the resources used.

S3: get all files at a specific directory level

Set the delimiter argument to / in your request. See GET Bucket (List Objects) documentation.

Listing files in a specific folder of a AWS S3 bucket

Everything in S3 is an object. To you, it may be files and folders. But to S3, they're just objects.

Objects that end with the delimiter (/ in most cases) are usually perceived as a folder, but it's not always the case. It depends on the application. Again, in your case, you're interpretting it as a folder. S3 is not. It's just another object.

In your case above, the object users/<user-id>/contacts/<contact-id>/ exists in S3 as a distinct object, but the object users/<user-id>/ does not. That's the difference in your responses. Why they're like that, we cannot tell you, but someone made the object in one case, and didn't in the other. You don't see it in the AWS Management Console because the console is interpreting it as a folder and hiding it from you.

Since S3 just sees these things as objects, it won't "exclude" certain things for you. It's up to the client to deal with the objects as they should be dealt with.

Your Solution

Since you're the one that doesn't want the folder objects, you can exclude it yourself by checking the last character for a /. If it is, then ignore the object from the response.

How to get a list of all folders that list in a specific s3 location using spark in databricks?

just use dbutils.fs.ls(ls_path)

Amazon S3: How to get a list of folders in the bucket?

For the sake of example, assume I have a bucket in the USEast1 region called MyBucketName, with the following keys:

 temp/
temp/foobar.txt
temp/txt/
temp/txt/test1.txt
temp/txt/test2.txt
temp2/

Working with folders can be confusing because S3 does not natively support a hierarchy structure -- rather, these are simply keys like any other S3 object. Folders are simply an abstraction available in the S3 web console to make it easier to navigate a bucket. So when we're working programatically, we want to find keys matching the dimensions of a 'folder' (delimiter '/', size = 0) because they will likely be 'folders' as presented to us by the S3 console.

Note for both examples: I'm using the AWSSDK.S3 version 3.1 NuGet package.

Example 1: All folders in a bucket

This code is modified from this basic example in the S3 documentation to list all keys in a bucket. The example below will identify all keys that end with the delimiter character /, and are also empty.

IAmazonS3 client;
using (client = new AmazonS3Client(Amazon.RegionEndpoint.USEast1))
{
// Build your request to list objects in the bucket
ListObjectsRequest request = new ListObjectsRequest
{
BucketName = "MyBucketName"
};

do
{
// Build your call out to S3 and store the response
ListObjectsResponse response = client.ListObjects(request);

// Filter through the response to find keys that:
// - end with the delimiter character '/'
// - are empty.
IEnumerable<S3Object> folders = response.S3Objects.Where(x =>
x.Key.EndsWith(@"/") && x.Size == 0);

// Do something with your output keys. For this example, we write to the console.
folders.ToList().ForEach(x => System.Console.WriteLine(x.Key));

// If the response is truncated, we'll make another request
// and pull the next batch of keys
if (response.IsTruncated)
{
request.Marker = response.NextMarker;
}
else
{
request = null;
}
} while (request != null);
}

Expected output to console:

temp/
temp/txt/
temp2/

Example 2: Folders matching a specified prefix

You could further limit this to only retrieve folders matching a specified Prefix by setting the Prefix property on ListObjectsRequest.

ListObjectsRequest request = new ListObjectsRequest
{
BucketName = "MyBucketName",
Prefix = "temp/"
};

When applied to Example 1, we would expect the following output:

temp/
temp/txt/

Further reading:

  • S3 Documentation - Working With Folders
  • .NET SDK Documentation - ListObjects

How do I get the top-level directories of a bucket in S3?

I see there's a CommonPrefixes property on ListObjectsResponse.

using (var client = new AmazonS3Client())
{
var listObjectsResponse = client.ListObjects(new ListObjectsRequest
{
BucketName = bucket,
Prefix = "2",
Delimiter = "/",
});

// Prints: 2017
Console.WriteLine(listObjectsResponse.CommonPrefixes[0]);
}


Related Topics



Leave a reply



Submit