Reading Contents of a Gzip File from a Aws S3 in Python

Contents of a gzip file from a AWS S3 in Python only returning null bytes

That's a tar.gz file, i.e. a tar archive that's been compressed with the gzip algorithm.

If you just read it with gzip.GzipFile(), you still have a binary tar archive you need to interpret.

Use the tarfile module to read it; tar archives, like zips, can contain multiple files, one of which is the .jsonl file you end up seeing.

Read gzip file from s3 bucket

gzip.open expects a filename or an already opened file object, but you are passing it the downloaded data directly. Try using gzip.decompress instead:

filedata = fileobj['Body'].read()
uncompressed = gzip.decompress(filedata)

How can I decode a .gz file from S3 using an AWS Lambda function?

You're correct - you can't decode this into text. You'll want something like:

import io
import gzip
import json

import boto3
from urllib.parse import unquote_plus

def handler_name(event, context):
s3client = boto3.client('s3')
for record in event['Records']:
bucket = record['s3']['bucket']['name']
key = unquote_plus(record['s3']['object']['key'])

response = s3client.get_object(Bucket=bucket, Key=key)
content = response['Body'].read()
with gzip.GzipFile(fileobj=io.BytesIO(content), mode='rb') as fh:
yourJson = json.load(fh)

You can then use the yourJson variable to read the JSON.

compress .txt file on s3 location to .gz file

It would appear that you are writing an AWS Lambda function.

A simpler program flow would probably be:

  • Download the file to /tmp/ using s3_client.download_file()
  • Gzip the file
  • Upload the file to S3 using s3.client_upload_file()
  • Delete the files in /tmp/

Also, please note that the AWS Lambda function might be invoked with multiple objects being passed via the event. However, your code is currently only processing the first record with event['Records'][0]. The program should loop through these records like this:

for record in event['Records']:

source_bucket = record['s3']['bucket']['name']
file_key_name = record['s3']['object']['key']
...


Related Topics



Leave a reply



Submit