Read a csv file from aws s3 using boto and pandas
Here is what I have done to successfully read the df
from a csv
on S3.
import pandas as pd
import boto3
bucket = "yourbucket"
file_name = "your_file.csv"
s3 = boto3.client('s3')
# 's3' is a key word. create connection to S3 using default config and all buckets within S3
obj = s3.get_object(Bucket= bucket, Key= file_name)
# get object and file (key) from bucket
initial_df = pd.read_csv(obj['Body']) # 'Body' is a key word
Python - How to read CSV file retrieved from S3 bucket?
csv.reader does not require a file. It can use anything that iterates through lines, including files and lists.
So you don't need a filename. Just pass the lines from response['Body'] directly into the reader. One way to do that is
lines = response['Body'].read().splitlines(True)
reader = csv.reader(lines)
Reading a subset of csv files from S3 bucket using lambda and boto3
Prefixes cannot contain wildcard characters.
You should use:
prefix = 'folder/data_overview_`
If you need to further limit to only CSV files, then you will need to do that with an if
statement within your Python code.
Load CSV file into Pandas from s3 using chunksize
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html
Clearly says that
filepath_or_bufferstr, path object or file-like object Any valid
string path is acceptable. The string could be a URL. Valid URL
schemes include http, ftp, s3, gs, and file. For file URLs, a host is
expected. A local file could be: file://localhost/path/to/table.csv.If you want to pass in a path object, pandas accepts any os.PathLike.
By file-like object, we refer to objects with a read() method, such as
a file handle (e.g. via builtin open function) or StringIO.
When reading in chunk, pandas return you iterator object, you need to iterate through it..
Something like:
for df in pd.read_csv('s3://<<bucket-name>>/<<filename>>',chunksize = 100000):
process df chunk..
And if you think it's because the chunksize is large, you can consider trying it for the first chunk only for a small chunksize like this:
for df in pd.read_csv('s3://<<bucket-name>>/<<filename>>',chunksize = 1000):
print(df.head())
break
Reading a csv from S3 and uploading it back after updates
Was able to do it like this.
csv_buffer = io.StringIO()
for line in existing_data:
csv_buffer.write(','.join(line) + '\n')
s3.put_object(Bucket=bucket_name, Key=myKey, Body=csv_buffer.getvalue())
How to convert S3 bucket content(.csv format) into a dataframe in AWS Lambda
You can use io.BytesIO
to get the bytes data into memory and after that use pandasread_csv
to transform it into a dataframe. Note that there is some strange SSL download limit for dataframes that will lead to issue when downloading data > 2GB. That is why I have used this chunking in the code below.
import io
obj = s3.get_object(Bucket=bucketname, Key=csv_filename)
# This should prevent the 2GB download limit from a python ssl internal
chunks = (chunk for chunk in obj["Body"].iter_chunks(chunk_size=1024**3))
data = io.BytesIO(b"".join(chunks)) # This keeps everything fully in memory
df = pd.read_csv(data) # here you can provide also some necessary args and kwargs
AWS Lambda - read csv and convert to pandas dataframe
I believe that your problem is likely tied to this line - df=pd.DataFrame( list(reader(data))) in your function. The answer below should allow you to read the csv file into the pandas dataframe for processes.
import boto3
import pandas as pd
from io import BytesIO
s3_client = boto3.client('s3')
def lambda_handler(event, context):
try:
bucket_name = event["Records"][0]["s3"]["bucket"]["name"]
s3_file_name = event["Records"][0]["s3"]["object"]["key"]
resp = s3_client.get_object(Bucket=bucket_name, Key=s3_file_name)
###########################################
# one of these methods should work for you.
# Method 1
# df_s3_data = pd.read_csv(resp['Body'], sep=',')
#
# Method 2
# df_s3_data = pd.read_csv(BytesIO(resp['Body'].read().decode('utf-8')))
###########################################
print(df_s3_data.head())
except Exception as err:
print(err)
Related Topics
How to Write List Elements into a Tab-Separated File
How to Append a List Withoud Adding the Quote
Save Variables in Every Iteration of for Loop and Load Them Later
Python Convert Comma Separated List to Pandas Dataframe
Delete Every Non Utf-8 Symbols from String
Check If a Key Exists in a Bucket in S3 Using Boto3
Python How to Use Excelwriter to Write into an Existing Worksheet
How to Count the Amount of Sentences in a Paragraph in Python
What Does Sqlite3.Operationalerror: Near "-": Syntax Error Mean
Pandas Filtering for Multiple Substrings in Series
How to Extract Hours and Minutes from a Datetime.Datetime Object
How to Index a Middle Character in a List in Python
Saving the Output of a Python Program
Vary the Color of Each Bar in Bargraph Using Particular Value