Simple SQL Lite table/import question
.import
does not support reshaping the input (except from setting the separator). You need to import the CSV file into a temporary table and the insert that into the real table. Here is a example session:
$ cat a.csv
1,2
3,4
5,6
$ sqlite3 a.db
SQLite version 3.6.23.1
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> create table foo(id integer primary key,x,y);
sqlite> create temp table footmp(x,y);
sqlite> .separator ,
sqlite> .import a.csv footmp
sqlite> select * from footmp;
1,2
3,4
5,6
sqlite> insert into foo(x,y) select * from footmp;
sqlite> select * from foo;
1,1,2
2,3,4
3,5,6
sqlite> drop table footmp;
You see that ID is counted up. This is because a column with type INTEGER PRIMARY KEY is treated as an alias for the internal ROWID - which always is a unique, ascending number.
Importing a CSV file into a sqlite3 database table using Python
import csv, sqlite3
con = sqlite3.connect(":memory:") # change to 'sqlite:///your_filename.db'
cur = con.cursor()
cur.execute("CREATE TABLE t (col1, col2);") # use your column names here
with open('data.csv','r') as fin: # `with` statement available in 2.5+
# csv.DictReader uses first line in file for column headings by default
dr = csv.DictReader(fin) # comma is default delimiter
to_db = [(i['col1'], i['col2']) for i in dr]
cur.executemany("INSERT INTO t (col1, col2) VALUES (?, ?);", to_db)
con.commit()
con.close()
SQLite 3 CSV Import to Table
SQLite3's column types basically boil down to:
- TEXT
- NUMERIC (REAL, FLOAT)
- INTEGER (the various lengths of integer; but INT will normally do)
- BLOB (binary objects)
Generally in a CSV file you will encounter strings (TEXT), decimal numbers (FLOAT), and integers (INT). If performance isn't critical, those are pretty much the only three column types you need. (CHAR(80)
is smaller on disk than TEXT
but for a few thousand rows it's not so much of an issue.)
As far as putting data into the columns is concerned, SQLite3 uses type coercion to convert the input data type to the column type whereever the conversion makes sense. So all you have to do is specify the correct column type, and SQLite will take care of storing it in the correct way.
For example the number -1230.00
, the string "-1230.00"
, and the string "-1.23e3"
will all coerce to the number 1230 when stored in a FLOAT
column.
Note that if SQLite3 can't apply a meaningful type conversion, it will just store the original data without attempting to convert it at all. SQLite3 is quite happy to insert "Hello World!"
into a FLOAT
column. This is usually a Bad Thing.
See the SQLite3 documentation on column types and conversion for gems such as:
Type Affinity
In order to maximize compatibility between SQLite and other database
engines, SQLite supports the concept of "type affinity" on columns.
The type affinity of a column is the recommended type for data stored
in that column. The important idea here is that the type is
recommended, not required. Any column can still store any type of
data. It is just that some columns, given the choice, will prefer to
use one storage class over another. The preferred storage class for a
column is called its "affinity".
sqlite3 import with quotes
The web page you reference is old (note the cvstrac portion of the URL, which is the giveaway; sqlite uses fossil now, not cvs). The newer version of that web page is here.
Since SQLite is public domain software, one solution to your problem is to fix sqlite's shell.c to handle your file format correctly. The problem is around line 1861 that does
if( c=='"' ) inQuote = !inQuote;
If you don't want quote delimiting, just comment out this line. The purpose of the line is so you can embed delimiters in your columns by quoting the column.
Another approach is to use a Database Manager that supports SQLite; there are many of them, and most claim to support file import/export.
importing CSV file into sqlite table
I would guess that there really ARE double-quotes around the data. If you use Windows, a CSV will automatically open in Excel and it looks like there are no quotes because Excel interprets the file properly. However, I bet if you open the file in Notepad there will be quotes around the strings.
and then, as pointed out in your discussion above, truncate your sqlite table and reimport the data, indicating that the fields are enclosed by double quotes.
How to import a tsv file with SQLite3
You should create the table, set a separator and import the data sqlite wiki.
Example for TSV:
data.tsv
(tab as a separator):
Bob 30 1000
Wendy 20 900
1) Create a table and set TAB as a separator:
sqlite> create table people (name text, param1 int, param2 int);
sqlite> .separator "\t"
2) Import data:
sqlite> .import data.tsv people
And the result is:
sqlite> select * from people;
Bob 30 1000
Wendy 20 900
Efficiently import csv with 2,000,000 rows in an empty SQLite table
I'm going to stick to the Python since you do some extra stuff in your script rather than just directly reading into the database from the CSV, like setting the creation time or populating multiple fields with "not_available"
.
The short answer is to batch your writes and use sqlite3.Cursor.executemany
, because the cost of a transaction is quite expensive when you're making a transaction for every write.
But that's been answered before - however, I'll show you how much faster batched writes are compared to doing transaction-per-row, as I got speed ups of 300x+ by doing batches.
Test Setup
All tests were run on a macbook pro M1 Max / 32 GB RAM / 1 TB SSD.
For these tests, I generate two-column CSV files of 1 million lines using this command:
hexdump -v -e '4/1 "%02x""\n"' /dev/urandom | \
paste -d "," - - | \
head -n 1000000 > test.csv
And the files look like this:
> head -n 10 test.csv
d224d9ab,76e8eb72
358a23af,6fa66351
a24c330b,d0a15fea
e587bd05,c812d51b
71e23afc,5d4654a7
fac87ef8,0576ee3f
baf53d50,a82bfd43
193cbf62,553cd6bb
ba1f82b5,2200ad0d
2387b69f,6100ae3b
Perf Test - executemany
with different batch sizes
#!/usr/bin/env python3
# execute_many.py
import csv
import datetime
import itertools
import sqlite3
import time
# This script creates and populates four databases from scratch
# using 4 different CSV files
# modified from grouper() recipe at:
# https://docs.python.org/3/library/itertools.html#itertools-recipes
def file_chunker(fp, batch_size):
args = [iter(fp)] * batch_size
return itertools.zip_longest(*args, fillvalue=None)
if __name__ == "__main__":
# here I created test files called test-1000.csv,
# test-10000.csv, and so on
batch_sizes = [1000, 10000, 100000, 1000000]
for batch_size in batch_sizes:
db = sqlite3.connect(f"test-{batch_size}.db")
cur = db.cursor()
cur.execute("""CREATE TABLE hostname_table (
name TEXT,
hostname TEXT,
scheme TEXT,
description TEXT,
created_date INTEGER
)""")
db.commit()
with open(f"test-{batch_size}.csv", "r") as f:
reader = csv.reader(f)
chunker = file_chunker(reader, batch_size)
start = time.perf_counter()
for batch in chunker:
data = [
(row[0], row[1], "not available", "not available", datetime.datetime.now())
for row in batch if row is not None
]
try:
cur.executemany("INSERT INTO hostname_table VALUES (?, ?, ?, ?, ?)", data)
db.commit()
except Exception:
db.rollback()
end = time.perf_counter()
print(f"Elapsed time (batch size: {batch_size}): {end - start}")
cur.close()
Running this, I get:
❯ python execute_many.py
Elapsed time (batch size: 1000): 1.9760890419711359
Elapsed time (batch size: 10000): 1.7937561669969
Elapsed time (batch size: 100000): 2.017556666978635
Elapsed time (batch size: 1000000): 2.7237499579787254
By reading these files in chunks of 1K/10K/100K/1M lines at a time, you can see that the additionally large batches don't really buy you any savings in time, but they're all very fast to run. Basically 2 seconds to populate a million row database.
Perf Test - execute
with transaction-per-row
Here's a reduced version of your code that only deals with the sqlite3
portion, and creates a new database and populates it with my 1M line file, using a transaction-per-insert like your current code does.
#!/usr/bin/env python3
# process_data.py
import csv
import datetime
import sqlite3
import time
if __name__ == "__main__":
db = sqlite3.connect("test.db")
cur = db.cursor()
cur.execute("""CREATE TABLE hostname_table (
name TEXT,
hostname TEXT,
scheme TEXT,
description TEXT,
created_date INTEGER
)""")
db.commit()
with open("test.csv", "r") as f:
reader = csv.reader(f)
start = time.perf_counter()
for row in reader:
db_row = (row[0], row[1], "not available", "not available", datetime.datetime.now())
try:
cur.execute("INSERT INTO hostname_table VALUES (?, ?, ?, ?, ?)", db_row)
db.commit()
except Exception:
db.rollback()
end = time.perf_counter()
print(f"Elapsed time: {end - start}")
cur.close()
Running this, I get as an elapsed time...
> time python process_data.py
Elapsed time: 390.3731578750303
python process_data.py 10.82s user 173.25s system 47% cpu 6:30.41 total
So 6.5 minutes on my test machine - which shows you how much slower doing a transaction per row is.
Perf Test - Insertion Time
For fun, let's look at how much time it takes to insert a row vs. a batch insert (of 1000)
Here's a script that creates two databases, and one is written to using execute
while the other is written to using executemany
with a batch of 1000.
#!/usr/bin/env python3
import datetime
import sqlite3
import random
import string
import timeit
def insert_single(db, cursor, data):
cursor.execute("INSERT INTO hostname_table VALUES (?, ?, ?, ?, ?)", data)
db.commit()
def insert_many(db, cursor, data):
cursor.executemany("INSERT INTO hostname_table VALUES (?, ?, ?, ?, ?)", data)
db.commit()
def row_generator(count=1):
return [
(''.join(random.choices(string.ascii_letters, k=10)),
''.join(random.choices(string.ascii_letters, k=10)),
'not_available',
'not_available',
datetime.datetime.now()
)
for _ in range(count)
]
if __name__ == "__main__":
db_single = sqlite3.connect("timeit_test_single.db")
db_multi = sqlite3.connect("timeit_test_multi.db")
cur_single = db_single.cursor()
cur_single.execute("""CREATE TABLE hostname_table (
name TEXT,
hostname TEXT,
scheme TEXT,
description TEXT,
created_date INTEGER
)""")
db_single.commit()
cur_multi = db_multi.cursor()
cur_multi.execute("""CREATE TABLE hostname_table (
name TEXT,
hostname TEXT,
scheme TEXT,
description TEXT,
created_date INTEGER
)""")
db_multi.commit()
single_data = row_generator()[0]
multi_data = row_generator(1000)
print(timeit.timeit(lambda: insert_single(db_single, cur_single, single_data), number=10000))
print(timeit.timeit(lambda: insert_many(db_multi, cur_multi, multi_data), number=10000))
I run the timeit
test in ipython
and here's what I get...
In [16]: %timeit insert_single(db_single, cur_single, single_data)
271 µs ± 71.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [17]: %timeit insert_many(db_multi, cur_multi, multi_data)
1.44 ms ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
So you can see that for a single execute
statement with a commit
, it takes roughly 200-300 µs (0.2 - 0.3 ms), whereas to write a batch of 1000 with executemany
it takes roughly 5 times longer at 1.4+ ms. Only 5x longer to write 1000x more rows!
Related Topics
Convert Integer to Text in SQLite's Select Query
SQL Query That Groups Different Items into Buckets
Isdate Function in SQL Evaluates Invalid Dates as Valid
Fifo Implementation in Inventory Using SQL
Boolean 'Not' in T-SQL Not Working on 'Bit' Datatype
What Free SQL Formatting Tools Exist
How to List User Defined Types in a SQL Server Database
How to Count Rows That Have the Same Values in Two Columns (Sql)
Concatenate/Merge Array Values During Grouping/Aggregation
Postgresql: Foreign Key/On Delete Cascade
Special Characters Displaying Incorrectly After Bulk Insert
Using Backquote/Backticks for MySQL Queries