Aws Lambda: How to Extract a Tgz File in a S3 Bucket and Put It in Another S3 Bucket

How to extract files in S3 on the fly with boto3?

Amazon S3 is a storage service. There is no in-built capability to manipulate the content of files.

However, you could use an AWS Lambda function to retrieve an object from S3, unzip it, then upload content back up again. However, please note that there is limit of 500MB in temporary disk space for Lambda, so avoid unzipping too much data.

You could configure the S3 bucket to trigger the Lambda function when a new file is created in the bucket. The Lambda function would then:

  • Use boto3 (assuming you like Python) to download the new file
  • Use the zipfile Python library to extract files
  • Use boto3 to upload the resulting file(s)

Sample code

import boto3

s3 = boto3.client('s3', use_ssl=False)
s3.upload_fileobj(
Fileobj=gzip.GzipFile(
None,
'rb',
fileobj=BytesIO(
s3.get_object(Bucket=bucket, Key=gzip_key)['Body'].read())),
Bucket=bucket,
Key=uncompressed_key)

How to unzip (or rather untar) in-place on Amazon S3?

With AWS lambda you can do this! You won't have the extra costs with downloading and uploading to AWS network.

You should follow this blog but then unzip instead of zipping!
https://www.antstack.io/blog/create-zip-using-lambda-with-files-streamed-from-s3/

How to read the content from tar file in lambda function in AWS

You can read tar file content with the following approach:

tar_file_obj = s3.get_object(Bucket='abc-logs',Key=Key)
tar_content = tar_file_obj ['Body'].read()

with tarfile.open(fileobj = BytesIO(tar_content)) as tar:
for tar_resource in tar:
if (tar_resource.isfile()):
inner_file_bytes = tar.extractfile(tar_resource).read()

Handling Streaming TarArchiveEntry to S3 Bucket from a .tar.gz file

That's true, AWS closes an InputStream provided to PutObjectRequest, and I don't know of a way to instruct AWS not to do so.

However, you can wrap the TarArchiveInputStream with a CloseShieldInputStream from Commons IO, like that:

InputStream shieldedInput = new CloseShieldInputStream(tarInput);

s3Client.putObject(new PutObjectRequest(bucketName, bucketFolder + fileName, shieldedInput, metadata));

When AWS closes the provided CloseShieldInputStream, the underlying TarArchiveInputStream will remain open.


PS. I don't know what ByteArrayInputStream(tarInput.getCurrentEntry()) does but it looks very strange. I ignored it for the purpose of this answer.



Related Topics



Leave a reply



Submit