How to Read a CSV File from an S3 Bucket Using Pandas in Python

Read a csv file from aws s3 using boto and pandas

Here is what I have done to successfully read the df from a csv on S3.

import pandas as pd
import boto3

bucket = "yourbucket"
file_name = "your_file.csv"

s3 = boto3.client('s3')
# 's3' is a key word. create connection to S3 using default config and all buckets within S3

obj = s3.get_object(Bucket= bucket, Key= file_name)
# get object and file (key) from bucket

initial_df = pd.read_csv(obj['Body']) # 'Body' is a key word

Python - How to read CSV file retrieved from S3 bucket?

csv.reader does not require a file. It can use anything that iterates through lines, including files and lists.

So you don't need a filename. Just pass the lines from response['Body'] directly into the reader. One way to do that is

lines = response['Body'].read().splitlines(True)
reader = csv.reader(lines)

Reading a subset of csv files from S3 bucket using lambda and boto3

Prefixes cannot contain wildcard characters.

You should use:

prefix = 'folder/data_overview_`

If you need to further limit to only CSV files, then you will need to do that with an if statement within your Python code.

Load CSV file into Pandas from s3 using chunksize

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Clearly says that

filepath_or_bufferstr, path object or file-like object Any valid
string path is acceptable. The string could be a URL. Valid URL
schemes include http, ftp, s3, gs, and file. For file URLs, a host is
expected. A local file could be: file://localhost/path/to/table.csv.

If you want to pass in a path object, pandas accepts any os.PathLike.

By file-like object, we refer to objects with a read() method, such as
a file handle (e.g. via builtin open function) or StringIO.

When reading in chunk, pandas return you iterator object, you need to iterate through it..
Something like:

for df in pd.read_csv('s3://<<bucket-name>>/<<filename>>',chunksize = 100000):
process df chunk..

And if you think it's because the chunksize is large, you can consider trying it for the first chunk only for a small chunksize like this:

for df in pd.read_csv('s3://<<bucket-name>>/<<filename>>',chunksize = 1000):
print(df.head())
break

Reading a csv from S3 and uploading it back after updates

Was able to do it like this.

csv_buffer = io.StringIO()
for line in existing_data:
csv_buffer.write(','.join(line) + '\n')
s3.put_object(Bucket=bucket_name, Key=myKey, Body=csv_buffer.getvalue())

How to convert S3 bucket content(.csv format) into a dataframe in AWS Lambda

You can use io.BytesIO to get the bytes data into memory and after that use pandasread_csv to transform it into a dataframe. Note that there is some strange SSL download limit for dataframes that will lead to issue when downloading data > 2GB. That is why I have used this chunking in the code below.

import io
obj = s3.get_object(Bucket=bucketname, Key=csv_filename)
# This should prevent the 2GB download limit from a python ssl internal
chunks = (chunk for chunk in obj["Body"].iter_chunks(chunk_size=1024**3))
data = io.BytesIO(b"".join(chunks)) # This keeps everything fully in memory
df = pd.read_csv(data) # here you can provide also some necessary args and kwargs

AWS Lambda - read csv and convert to pandas dataframe

I believe that your problem is likely tied to this line - df=pd.DataFrame( list(reader(data))) in your function. The answer below should allow you to read the csv file into the pandas dataframe for processes.

import boto3
import pandas as pd
from io import BytesIO

s3_client = boto3.client('s3')

def lambda_handler(event, context):
try:
bucket_name = event["Records"][0]["s3"]["bucket"]["name"]
s3_file_name = event["Records"][0]["s3"]["object"]["key"]
resp = s3_client.get_object(Bucket=bucket_name, Key=s3_file_name)

###########################################
# one of these methods should work for you.
# Method 1
# df_s3_data = pd.read_csv(resp['Body'], sep=',')
#
# Method 2
# df_s3_data = pd.read_csv(BytesIO(resp['Body'].read().decode('utf-8')))
###########################################
print(df_s3_data.head())

except Exception as err:
print(err)


Related Topics



Leave a reply



Submit