How to extract files in S3 on the fly with boto3?
Amazon S3 is a storage service. There is no in-built capability to manipulate the content of files.
However, you could use an AWS Lambda function to retrieve an object from S3, unzip it, then upload content back up again. However, please note that there is limit of 500MB in temporary disk space for Lambda, so avoid unzipping too much data.
You could configure the S3 bucket to trigger the Lambda function when a new file is created in the bucket. The Lambda function would then:
- Use boto3 (assuming you like Python) to download the new file
- Use the
zipfile
Python library to extract files - Use boto3 to upload the resulting file(s)
Sample code
import boto3
s3 = boto3.client('s3', use_ssl=False)
s3.upload_fileobj(
Fileobj=gzip.GzipFile(
None,
'rb',
fileobj=BytesIO(
s3.get_object(Bucket=bucket, Key=gzip_key)['Body'].read())),
Bucket=bucket,
Key=uncompressed_key)
How to unzip (or rather untar) in-place on Amazon S3?
With AWS lambda you can do this! You won't have the extra costs with downloading and uploading to AWS network.
You should follow this blog but then unzip instead of zipping!
https://www.antstack.io/blog/create-zip-using-lambda-with-files-streamed-from-s3/
How to read the content from tar file in lambda function in AWS
You can read tar file content with the following approach:
tar_file_obj = s3.get_object(Bucket='abc-logs',Key=Key)
tar_content = tar_file_obj ['Body'].read()
with tarfile.open(fileobj = BytesIO(tar_content)) as tar:
for tar_resource in tar:
if (tar_resource.isfile()):
inner_file_bytes = tar.extractfile(tar_resource).read()
Handling Streaming TarArchiveEntry to S3 Bucket from a .tar.gz file
That's true, AWS closes an InputStream
provided to PutObjectRequest
, and I don't know of a way to instruct AWS not to do so.
However, you can wrap the TarArchiveInputStream
with a CloseShieldInputStream
from Commons IO, like that:
InputStream shieldedInput = new CloseShieldInputStream(tarInput);
s3Client.putObject(new PutObjectRequest(bucketName, bucketFolder + fileName, shieldedInput, metadata));
When AWS closes the provided CloseShieldInputStream
, the underlying TarArchiveInputStream
will remain open.
PS. I don't know what ByteArrayInputStream(tarInput.getCurrentEntry())
does but it looks very strange. I ignored it for the purpose of this answer.
Related Topics
The Import Org.Junit Cannot Be Resolved
Notifydatasetchanged() Makes the List Refresh and Scroll Jumps Back to the Top
How to Make a Java Program Quit When "Q" Is Inputted, Issue Is the Default Input Variable Is Double
Unit Testing Private Functions in Junit With Mockito
Optimizing Multiple If-Else Condition in Java
How to Encrypt a String/Stream With Bouncycastle Pgp Without Starting With a File
How to Write Unit Test for a Setter Method Which Does Not Have a Getter Method
How to Split Single Row into Multiple Rows in Spark Dataframe Using Java
How to Externalize Application.Properties in Tomcat Webserver for Spring
How to Reconnect Kafka Producer Once Closed
How to Create Comma Separated String in Single Quotes from Arraylist of String in Java
@Valid Annotation Is Not Working as Expected
Spring Boot API Call With Multiple @Requestparam
How to Use a Regex to Search Backwards Effectively
Javac Option to Compile All Java Files Under a Given Directory Recursively
How to Force Java Server to Accept Only Tls 1.2 and Reject Tls 1.0 and Tls 1.1 Connections