How Many Objects Are Returned by Aws S3API List-Objects

How can I tell how many objects I've stored in an S3 bucket?

There is no way, unless you

  1. list them all in batches of 1000 (which can be slow and suck bandwidth - amazon seems to never compress the XML responses), or

  2. log into your account on S3, and go Account - Usage. It seems the billing dept knows exactly how many objects you have stored!

Simply downloading the list of all your objects will actually take some time and cost some money if you have 50 million objects stored.

Also see this thread about StorageObjectCount - which is in the usage data.

An S3 API to get at least the basics, even if it was hours old, would be great.

list-objects-v2 with --query and --max-items

The issue is that NextToken occurs once in the result set, while Contents is a list with multiple items.

This command will return both values:

aws s3api list-objects-v2 --bucket my-bucket --max-items 10 --query [NextToken,Contents[].Key]

The output is:

[
"eyJDb250aW51YXRpb25Ub2tlbiI6IG51bGwsICJib3RvX3RydW5jYXRlX2Ftb3VudCI6IDEwfQ==",
[
"foo1.docx",
"foo2.jpg",
"2019/06/02/foo3.txt",
"2019/06/02/foo4.js",
"2019/06/02/foo5.py",
"2019/06/02/foo6.html",
"foo7.pdf",
"CreateThumbnail.zip",
"Jbookmarks.html",
"basepart/20191222_1114/foo9.csv"
]
]

The first part is the Token, plus a list of object keys.

AWS S3: Cost of listing all object versions

No, each object/version listed is not treated as a separate list request. You're only paying for the API requests to S3 (at something like $0.005 per 1000 API requests). A single API request will return many (up to 1000) objects/versions that match the indicated prefix. The prefix filtering itself happens server-side in S3.

The way to get a handle on this is to understand that AWS SDK calls ultimately result in API requests to AWS service endpoints e.g. S3 APIs. What you need to do is work out how your SDK client requests map to the underlying API requests to determine what is likely happening.

If your request is a simple 'list objects in my bucket' case, the boto3 SDK is going to make one or more ListObjectsV2 API calls. I say "or more" because the SDK may need to make more than one API request because API requests typically yield a maximum number of results (e.g. 1000 objects in a ListObjectsV2 response). If there are 2500 objects in the bucket, for example, then three ListObjectsV2 requests would need to be made to the S3 API.

If your request is 'list objects in my bucket with a given prefix', then you need to know what capabilities are present on the ListObjectsV2 API call. Importantly, prefix is one of the parameters. This is how you know that S3 itself is doing the filtering on your supplied prefix (where you have indicated .filter(Prefix=key) in your code). If this were not a feature of the underlying S3 API, then your SDK (boto3 etc.) would be the one doing the filtering on prefix and that would be a much more expensive and vastly slower operation, because the SDK would have to list all objects, potentially resulting in many more LIST requests, and filter them client-side. Note: the ListObjectVersions API is similar to ListObjectsV2 in this regard and both support prefix.

Also, note that VersionId, Size, and LastModifed are all attributes that appear in the ListObjectVersions response, so no further API requests are needed to fetch this information.

So, in your case, assuming that there are fewer than 1000 object versions that match your indicated prefix, I believe that this equates to one S3 API request to ListObjectVersions (and this is considered a LIST request rather than a GET request for billing afaik, even though it is a GET HTTP request to https://mybucket.s3.amazonaws.com/?versions under the covers).

AWS SDK V2 S3 fetch object is not fetching objects more than 1000

As you indicated, AWS will only return up to 1000 of the objects in a bucket:

Returns some or all (up to 1,000) of the objects in a bucket.

Amazon S3 lists objects in alphabetical order. You can take advantage of this fact and provide a marker to the key that should be used to start with in the next requests, if appropriate:

try {

ListObjectsRequest listObjects = ListObjectsRequest
.builder()
.bucket(bucketName)
.build()
;

ListObjectsResponse listObjectsResponse = null;
String lastKey = null;

do {
if ( listObjectsResponse != null ) {
listObjectsRequest = listObjectsRequest.toBuilder()
.marker(lastKey)
.build()
;
}

listObjectsResponse = s3.listObjects(listObjectsRequest);

List<S3Object> objects = listObjectsResponse.contents();

// Iterate over results
for (ListIterator iterVals = objects.listIterator(); iterVals.hasNext(); ) {
S3Object myValue = (S3Object) iterVals.next();
String key = myValue.key();
System.out.print("\n The name of the key is " + key);
// Update the value of the last key processed
lastKey = key;
}
} while ( listObjectsResponse.isTruncated() );
} catch (S3Exception e) {
System.err.println(e.awsErrorDetails().errorMessage());
System.exit(1);
}

Something very similar can be achieved with the v2 of the list objects API ListObjectsV2Request startAfter method.

With v2, you can use ListObjectsV2Response and continuation token as well. Something similar to:

try {

ListObjectsV2Request listObjects = ListObjectsV2Request
.builder()
.bucket(bucketName)
.build()
;

ListObjectsV2Response listObjectsResponse = null;
String nextContinuationToken = null;

do {
if ( listObjectsResponse != null ) {
listObjectsRequest = listObjectsRequest.toBuilder()
.continuationToken(nextContinuationToken)
.build()
;
}

listObjectsResponse = s3.listObjectsV2(listObjectsRequest);
nextContinuationToken = listObjectsResponse.nextContinuationToken();

List<S3Object> objects = listObjectsResponse.contents();

// Iterate over results
for (ListIterator iterVals = objects.listIterator(); iterVals.hasNext(); ) {
S3Object myValue = (S3Object) iterVals.next();
String key = myValue.key();
System.out.print("\n The name of the key is " + key);
}
} while ( listObjectsResponse.isTruncated() );
} catch (S3Exception e) {
System.err.println(e.awsErrorDetails().errorMessage());
System.exit(1);
}

Finally, you can use the listObjectsV2Paginator method to iterate over the results in a similar way like listNextBatchOfObjects was used in the v1 of the API. See for instance this related v1 code and these 1 2 related SO questions.

All the mappings between operations from v1 and v2 versions of the API are documented here.

Maximum number of CommonPrefixes and MaxKeys in S3 list objects

To answer my own question, maximum number of CommonPrefixes and MaxKeys is 1000.

Caution, TOGETHER 1000.
This means that you can have 0 Keys displayed, and maximum 1000 CommonPrefixes or
990 Keys displayed, and maximum 10 CommonPrefixes

S3 Bucket AWS CLI takes forever to get specific files

2.5m+ objects in an Amazon S3 bucket is indeed a large number of objects!

When listing the contents of an Amazon S3 bucket, the S3 API only returns 1000 objects per API call. Therefore, when the AWS CLI (or CloudBerry, etc) is listing the objects in the S3 bucket it requires 2500+ API calls. This is most probably the reason why the request is taking so long (and possibly failing due to lack of memory to store the results).

You can possibly reduce the time by specifying a Prefix, which reduces the number of objects returned from the API calls. This would help if the objects you want to copy are all in a sub-folder.

Failing that, you could use Amazon S3 Inventory, which can provide a daily or weekly CSV file listing all objects. You could then extract from that CSV file a list of objects you want to copy (eg use Excel or write a program to parse the file). Then, specifically copy those objects using aws s3 cp or from a programming language. For example, a Python program could parse the script and then use download_file() to download each of the desired objects.

The simple fact is that a flat-structure Amazon S3 bucket with 2.5m+ objects will always be difficult to list. If possible, I would encourage you to use 'folders' to structure the bucket so that you would only need to list portions of the bucket at a time.



Related Topics



Leave a reply



Submit