Optimal way to store data from Pandas to Snowflake
The optimal way that ilja-everila pointed out is “copy into...” as SF required the csv to be staged on cloud before transformation I was hesitant to do it but it seems like that is the only option given that the performance is in 5-10 minutes for 6.5million records.
Snowflake pandas pd_writer writes out tables with NULLs
Turns out, the documentation (arguably, Snowflake's weakest point) is out of sync with reality. This is the real issue: https://github.com/snowflakedb/snowflake-connector-python/issues/329. All it needs is a single character in the column name to be upper case and it works perfectly.
My workaround is to simply do: df.columns = map(str.upper, df.columns)
before invoking to_sql
.
Writing data into snowflake using Python
The absolute fastest way to load data into Snowflake is from a file on either internal or external stage. Period. All connectors have the ability to insert the data with standard insert commands, but this will not perform as well. That said, many of the Snowflake drivers are now transparently using PUT/COPY commands to load large data to Snowflake via internal stage. If this is what you are after, then you can leverage the pandas write_pandas
command to load data from a pandas dataframe to Snowflake in a single command. Behind the scenes, it will execute the PUT and COPY INTO for you.
https://docs.snowflake.com/en/user-guide/python-connector-api.html#label-python-connector-api-write-pandas
I highly recommend this pattern over INSERT commands in any driver. And I would also recommend transforms be done AFTER loading to Snowflake, not before.
Passing variant data to Snowflake
Well first steps lets check our base SQL is in a acceptable form:
create table MY_TABLE(MY_VARIANT_DATA variant);
insert into my_table(my_variant_data) values
(parse_json('{"key1":"value1","key2":1234}'));
Invalid expression [PARSE_JSON('{"key1":"value1","key2":1234}')] in VALUES clause
ok, so lets make that a SELECT and PARSE_JSON:
insert into my_table(my_variant_data) select
(parse_json('{"key1":"value1","key2":1234}'));
number of rows inserted |
---|
1 |
What is the best way to run a report in notebooks when connected to snowflake connector?
My question is, my data is called from a api that results in json files, new data is no bigger that 75 MB a day 8 columns, with two aggregate calls to the data, done in the sql call. If I run these visualizations monthly, is it better to aggregate the information in Snowflake, or locally?
I would flatten your data in python or Snowflake - depending on which you feel more comfortable using or how complex the data is. You can just do everything on the straight json, although I would rarely look to design something that way myself (it's going to be the slowest to query.)
As far as aggregating the data, I'd always do that on Snowflake. If you would like to slice and dice the data various ways, you may look to design a data mart data model and have your dashboard simply aggregate data on the fly via queries. Snowflake should be pretty good with that, but for additional speed then aggregating it up to months may be a good idea too.
You can probably mature your process from being local python script driven too something like serverless lambda and event driven wwith a scheduler as well.
Related Topics
How to Randomly Partition a List into N Nearly Equal Parts
How to Insert a Word into a List in Python
Replace Values of a Numpy Index Array With Values of a List
How to Build Reports With Python Pandas
Use Python Extract Images from Excel Sheets
Check Type: How to Check If Something Is a Rdd or a Dataframe
Get Row Value of Maximum Count After Applying Group by in Pandas
Issue in Using Win32Com to Access Excel File
Finding Length of the Longest List in an Irregular List of Lists
How to Read Gz Compressed File by Pyspark
Optimal Way to Store Data from Pandas to Snowflake
Check If List Is Ascending or Descending (Using For)
Arrange a Text File Using Python
Python - Returning Nan When Trying to Predict With Keras
Running Command With Paramiko Exec_Command Causes Process to Sleep Before Finishing