Optimal Way to Store Data from Pandas to Snowflake

Optimal way to store data from Pandas to Snowflake

The optimal way that ilja-everila pointed out is “copy into...” as SF required the csv to be staged on cloud before transformation I was hesitant to do it but it seems like that is the only option given that the performance is in 5-10 minutes for 6.5million records.

Snowflake pandas pd_writer writes out tables with NULLs

Turns out, the documentation (arguably, Snowflake's weakest point) is out of sync with reality. This is the real issue: https://github.com/snowflakedb/snowflake-connector-python/issues/329. All it needs is a single character in the column name to be upper case and it works perfectly.

My workaround is to simply do: df.columns = map(str.upper, df.columns) before invoking to_sql.

Writing data into snowflake using Python

The absolute fastest way to load data into Snowflake is from a file on either internal or external stage. Period. All connectors have the ability to insert the data with standard insert commands, but this will not perform as well. That said, many of the Snowflake drivers are now transparently using PUT/COPY commands to load large data to Snowflake via internal stage. If this is what you are after, then you can leverage the pandas write_pandas command to load data from a pandas dataframe to Snowflake in a single command. Behind the scenes, it will execute the PUT and COPY INTO for you.

https://docs.snowflake.com/en/user-guide/python-connector-api.html#label-python-connector-api-write-pandas

I highly recommend this pattern over INSERT commands in any driver. And I would also recommend transforms be done AFTER loading to Snowflake, not before.

Passing variant data to Snowflake

Well first steps lets check our base SQL is in a acceptable form:

create table MY_TABLE(MY_VARIANT_DATA variant);
insert into my_table(my_variant_data) values 
(parse_json('{"key1":"value1","key2":1234}'));

Invalid expression [PARSE_JSON('{"key1":"value1","key2":1234}')] in VALUES clause

ok, so lets make that a SELECT and PARSE_JSON:

insert into my_table(my_variant_data) select 
(parse_json('{"key1":"value1","key2":1234}'));












number of rows inserted
1

What is the best way to run a report in notebooks when connected to snowflake connector?

My question is, my data is called from a api that results in json files, new data is no bigger that 75 MB a day 8 columns, with two aggregate calls to the data, done in the sql call. If I run these visualizations monthly, is it better to aggregate the information in Snowflake, or locally?

I would flatten your data in python or Snowflake - depending on which you feel more comfortable using or how complex the data is. You can just do everything on the straight json, although I would rarely look to design something that way myself (it's going to be the slowest to query.)

As far as aggregating the data, I'd always do that on Snowflake. If you would like to slice and dice the data various ways, you may look to design a data mart data model and have your dashboard simply aggregate data on the fly via queries. Snowflake should be pretty good with that, but for additional speed then aggregating it up to months may be a good idea too.

You can probably mature your process from being local python script driven too something like serverless lambda and event driven wwith a scheduler as well.



Related Topics



Leave a reply



Submit