Speed up csv import
I don't think it will get much faster.
That said, some testing shows that a significant part of time is spent for the transcoding (about 15% for my test case). So if you could skip that (e.g. by creating the CSV in UTF-8 already) you would see some improvement.
Besides, according to ruby-doc.org the "primary" interface for reading CSVs isforeach
, so this should be preferred:
def csv_import
import 'csv'
CSV.foreach("/#{Rails.public_path}/uploads/shate.csv", {:encoding => 'ISO-8859-15:UTF-8', :col_sep => ';', :row_sep => :auto, :headers => :first_row}) do | row |
# use row here...
end
end
UpdateYou could also try splitting the parsing into several threads. I reached some performance increase experimenting with this code (treatment of heading left out):
N = 10000
def csv_import
all_lines = File.read("/#{Rails.public_path}/uploads/shate.csv").lines
# parts will contain the parsed CSV data of the different chunks/slices
# threads will contain the threads
parts, threads = [], []
# iterate over chunks/slices of N lines of the CSV file
all_lines.each_slice(N) do | plines |
# add an array object for the current chunk to parts
parts << result = []
# create a thread for parsing the current chunk, hand it over the chunk
# and the current parts sub-array
threads << Thread.new(plines.join, result) do | tsrc, tresult |
# parse the chunk
parsed = CSV.parse(tsrc, {:encoding => 'ISO-8859-15:UTF-8', :col_sep => ";", :row_sep => :auto})
# add the parsed data to the parts sub-array
tresult.replace(parsed.to_a)
end
end
# wait for all threads to finish
threads.each(&:join)
# merge all the parts sub-arrays into one big array and iterate over it
parts.flatten(1).each do | row |
# use row (Array)
end
end
This splits the input into chunks of 10000 lines and creates a parsing thread for each of the chunks. Each threads gets handed over a sub-array in the array parts
for storing its result. When all threads are finished (after threads.each(&:join)
) the results of all chunks in parts
are joint and that's it. How to speed up Python CSV Read to MySQL Write
I created a pseudo-random CSV file where each row is of the style "111.222.333.444,555.666.777.888,A continent". The file contains 33 million rows. The following code was able to insert all rows into a MySQL database table in ~3 minutes:-
import mysql.connector
import time
import concurrent.futures
import csv
import itertools
CSVFILE='/Users/Andy/iplist.csv'
CHUNK=10_000
def doBulkInsert(rows):
with mysql.connector.connect(user='andy', password='monster', host='localhost', database='andy') as connection:
connection.cursor().executemany(f'INSERT INTO ips (ip_start, ip_end, continent) VALUES (%s, %s, %s)', rows)
connection.commit()
def main():
_s = time.perf_counter()
with open(CSVFILE) as csvfile:
csvdata = csv.reader(csvfile)
_s = time.perf_counter()
with concurrent.futures.ThreadPoolExecutor() as executor:
while (data := list(itertools.islice(csvdata, CHUNK))):
executor.submit(doBulkInsert, data)
executor.shutdown(wait=True)
print(f'Duration = {time.perf_counter()-_s}')
if __name__ == '__main__':
main()
How to improve the speed of insertion of the csv data in a database in php?
Instead of inserting data into database for every row, try inserting in batches.
You can always do a bulk insert, that can take n(use 1000) number of entries and insert it into the table.
https://www.mysqltutorial.org/mysql-insert-multiple-rows/
This will result in reduction of the DB calls, thereby reducing the overall time.
And for 80k entries there is a possibility that you might exceed the memory limit too.
You can overcome that using generators in php.
https://medium.com/@aashish.gaba097/database-seeding-with-large-files-in-laravel-be5b2aceaa0b
Although, this is in Laravel, but the code that reads from csv is independent (the one that uses generator) and the logic can be used here.
speed up the process of import multiple csv into python dataframe
you may try the following - read only those columns that really need, use list comprehension and call pd.concat([ ... ], ignore_index=True)
once, because it's pretty slow:
# there is no sense to read columns that you don't need
# specify the column list (EXCLUDING: 'aPaye','MethodePaiement','ArgentPercu')
cols = ['col1', 'col2', 'etc.']
date_cols = ['HeurePrevue','HeureDebutTrajet','HeureArriveeSurSite','HeureEffective']
df = pd.concat(
[pd.read_csv(f, sep = ';', dayfirst = True, usecols=cols,
parse_dates=date_cols)
for f in allFiles
],
ignore_index=True
)
this should work if you have enough memory to store two resulting DFs... How to speed up creating MySQL table from large CSV file?
You can import the CSV to MySQL by query:
LOAD DATA INFILE 'data.csv'
INTO TABLE my_table
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS;
Related Topics
Timeout When Installing Ruby Gems
Automatically Run Rspec When Plain-Old Ruby (Not Rails) Files Change
Rails Has_Many Through Form with Additional Attributes
How to Downgrade My Rails Version
Ruby a Clever Way to Execute a Function on a Condition
Does The Rails Orm Limit The Ability to Perform Aggregations
How to Get Records Created at The Current Month
In Ruby, How to Be Warned of Duplicate Keys in Hashes When Loading a Yaml Document
Using Rails as a Framework for a Large Website
How to Show Longer Traces in Rails Testcases
Rake Cucumber and Rake Spec Always Use "Develop" Environment
Ruby Fails on Osx Lion with Rbenv
How to Use Rspec to Test That a Model Using Paperclip Is Validating The Size of an Uploaded File