Speed Up CSV Import

Speed up csv import

I don't think it will get much faster.

That said, some testing shows that a significant part of time is spent for the transcoding (about 15% for my test case). So if you could skip that (e.g. by creating the CSV in UTF-8 already) you would see some improvement.

Besides, according to ruby-doc.org the "primary" interface for reading CSVs is
foreach, so this should be preferred:

def csv_import
import 'csv'
CSV.foreach("/#{Rails.public_path}/uploads/shate.csv", {:encoding => 'ISO-8859-15:UTF-8', :col_sep => ';', :row_sep => :auto, :headers => :first_row}) do | row |
# use row here...
end
end

Update

You could also try splitting the parsing into several threads. I reached some performance increase experimenting with this code (treatment of heading left out):

N = 10000
def csv_import
all_lines = File.read("/#{Rails.public_path}/uploads/shate.csv").lines
# parts will contain the parsed CSV data of the different chunks/slices
# threads will contain the threads
parts, threads = [], []
# iterate over chunks/slices of N lines of the CSV file
all_lines.each_slice(N) do | plines |
# add an array object for the current chunk to parts
parts << result = []
# create a thread for parsing the current chunk, hand it over the chunk
# and the current parts sub-array
threads << Thread.new(plines.join, result) do | tsrc, tresult |
# parse the chunk
parsed = CSV.parse(tsrc, {:encoding => 'ISO-8859-15:UTF-8', :col_sep => ";", :row_sep => :auto})
# add the parsed data to the parts sub-array
tresult.replace(parsed.to_a)
end
end
# wait for all threads to finish
threads.each(&:join)
# merge all the parts sub-arrays into one big array and iterate over it
parts.flatten(1).each do | row |
# use row (Array)
end
end

This splits the input into chunks of 10000 lines and creates a parsing thread for each of the chunks. Each threads gets handed over a sub-array in the array parts for storing its result. When all threads are finished (after threads.each(&:join)) the results of all chunks in parts are joint and that's it.

How to speed up Python CSV Read to MySQL Write

I created a pseudo-random CSV file where each row is of the style "111.222.333.444,555.666.777.888,A continent". The file contains 33 million rows. The following code was able to insert all rows into a MySQL database table in ~3 minutes:-

import mysql.connector
import time
import concurrent.futures
import csv
import itertools

CSVFILE='/Users/Andy/iplist.csv'
CHUNK=10_000

def doBulkInsert(rows):
with mysql.connector.connect(user='andy', password='monster', host='localhost', database='andy') as connection:
connection.cursor().executemany(f'INSERT INTO ips (ip_start, ip_end, continent) VALUES (%s, %s, %s)', rows)
connection.commit()

def main():
_s = time.perf_counter()
with open(CSVFILE) as csvfile:
csvdata = csv.reader(csvfile)
_s = time.perf_counter()
with concurrent.futures.ThreadPoolExecutor() as executor:
while (data := list(itertools.islice(csvdata, CHUNK))):
executor.submit(doBulkInsert, data)
executor.shutdown(wait=True)
print(f'Duration = {time.perf_counter()-_s}')

if __name__ == '__main__':
main()

How to improve the speed of insertion of the csv data in a database in php?

Instead of inserting data into database for every row, try inserting in batches.

You can always do a bulk insert, that can take n(use 1000) number of entries and insert it into the table.

https://www.mysqltutorial.org/mysql-insert-multiple-rows/

This will result in reduction of the DB calls, thereby reducing the overall time.

And for 80k entries there is a possibility that you might exceed the memory limit too.

You can overcome that using generators in php.
https://medium.com/@aashish.gaba097/database-seeding-with-large-files-in-laravel-be5b2aceaa0b

Although, this is in Laravel, but the code that reads from csv is independent (the one that uses generator) and the logic can be used here.

speed up the process of import multiple csv into python dataframe

you may try the following - read only those columns that really need, use list comprehension and call pd.concat([ ... ], ignore_index=True) once, because it's pretty slow:

# there is no sense to read columns that you don't need
# specify the column list (EXCLUDING: 'aPaye','MethodePaiement','ArgentPercu')
cols = ['col1', 'col2', 'etc.']
date_cols = ['HeurePrevue','HeureDebutTrajet','HeureArriveeSurSite','HeureEffective']

df = pd.concat(
[pd.read_csv(f, sep = ';', dayfirst = True, usecols=cols,
parse_dates=date_cols)
for f in allFiles
],
ignore_index=True
)

this should work if you have enough memory to store two resulting DFs...

How to speed up creating MySQL table from large CSV file?

You can import the CSV to MySQL by query:

LOAD DATA INFILE 'data.csv' 
INTO TABLE my_table
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;


Related Topics



Leave a reply



Submit