Simple SQL Lite Table/Import Question

Simple SQL Lite table/import question

.import does not support reshaping the input (except from setting the separator). You need to import the CSV file into a temporary table and the insert that into the real table. Here is a example session:

$ cat a.csv 
1,2
3,4
5,6
$ sqlite3 a.db
SQLite version 3.6.23.1
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> create table foo(id integer primary key,x,y);
sqlite> create temp table footmp(x,y);
sqlite> .separator ,
sqlite> .import a.csv footmp
sqlite> select * from footmp;
1,2
3,4
5,6
sqlite> insert into foo(x,y) select * from footmp;
sqlite> select * from foo;
1,1,2
2,3,4
3,5,6
sqlite> drop table footmp;

You see that ID is counted up. This is because a column with type INTEGER PRIMARY KEY is treated as an alias for the internal ROWID - which always is a unique, ascending number.

Importing a CSV file into a sqlite3 database table using Python

import csv, sqlite3

con = sqlite3.connect(":memory:") # change to 'sqlite:///your_filename.db'
cur = con.cursor()
cur.execute("CREATE TABLE t (col1, col2);") # use your column names here

with open('data.csv','r') as fin: # `with` statement available in 2.5+
# csv.DictReader uses first line in file for column headings by default
dr = csv.DictReader(fin) # comma is default delimiter
to_db = [(i['col1'], i['col2']) for i in dr]

cur.executemany("INSERT INTO t (col1, col2) VALUES (?, ?);", to_db)
con.commit()
con.close()

SQLite 3 CSV Import to Table

SQLite3's column types basically boil down to:

  • TEXT
  • NUMERIC (REAL, FLOAT)
  • INTEGER (the various lengths of integer; but INT will normally do)
  • BLOB (binary objects)

Generally in a CSV file you will encounter strings (TEXT), decimal numbers (FLOAT), and integers (INT). If performance isn't critical, those are pretty much the only three column types you need. (CHAR(80) is smaller on disk than TEXT but for a few thousand rows it's not so much of an issue.)

As far as putting data into the columns is concerned, SQLite3 uses type coercion to convert the input data type to the column type whereever the conversion makes sense. So all you have to do is specify the correct column type, and SQLite will take care of storing it in the correct way.

For example the number -1230.00, the string "-1230.00", and the string "-1.23e3" will all coerce to the number 1230 when stored in a FLOAT column.

Note that if SQLite3 can't apply a meaningful type conversion, it will just store the original data without attempting to convert it at all. SQLite3 is quite happy to insert "Hello World!" into a FLOAT column. This is usually a Bad Thing.

See the SQLite3 documentation on column types and conversion for gems such as:

Type Affinity

In order to maximize compatibility between SQLite and other database
engines, SQLite supports the concept of "type affinity" on columns.
The type affinity of a column is the recommended type for data stored
in that column. The important idea here is that the type is
recommended, not required. Any column can still store any type of
data. It is just that some columns, given the choice, will prefer to
use one storage class over another. The preferred storage class for a
column is called its "affinity".

sqlite3 import with quotes

The web page you reference is old (note the cvstrac portion of the URL, which is the giveaway; sqlite uses fossil now, not cvs). The newer version of that web page is here.

Since SQLite is public domain software, one solution to your problem is to fix sqlite's shell.c to handle your file format correctly. The problem is around line 1861 that does

if( c=='"' ) inQuote = !inQuote;

If you don't want quote delimiting, just comment out this line. The purpose of the line is so you can embed delimiters in your columns by quoting the column.

Another approach is to use a Database Manager that supports SQLite; there are many of them, and most claim to support file import/export.

importing CSV file into sqlite table

I would guess that there really ARE double-quotes around the data. If you use Windows, a CSV will automatically open in Excel and it looks like there are no quotes because Excel interprets the file properly. However, I bet if you open the file in Notepad there will be quotes around the strings.

and then, as pointed out in your discussion above, truncate your sqlite table and reimport the data, indicating that the fields are enclosed by double quotes.

How to import a tsv file with SQLite3

You should create the table, set a separator and import the data sqlite wiki.

Example for TSV:

data.tsv (tab as a separator):

Bob 30  1000
Wendy 20 900

1) Create a table and set TAB as a separator:

sqlite> create table people (name text, param1 int, param2 int);
sqlite> .separator "\t"

2) Import data:

sqlite> .import data.tsv people

And the result is:

sqlite> select * from people;
Bob 30 1000
Wendy 20 900

Efficiently import csv with 2,000,000 rows in an empty SQLite table

I'm going to stick to the Python since you do some extra stuff in your script rather than just directly reading into the database from the CSV, like setting the creation time or populating multiple fields with "not_available".

The short answer is to batch your writes and use sqlite3.Cursor.executemany, because the cost of a transaction is quite expensive when you're making a transaction for every write.

But that's been answered before - however, I'll show you how much faster batched writes are compared to doing transaction-per-row, as I got speed ups of 300x+ by doing batches.



Test Setup

All tests were run on a macbook pro M1 Max / 32 GB RAM / 1 TB SSD.

For these tests, I generate two-column CSV files of 1 million lines using this command:

hexdump -v -e '4/1 "%02x""\n"' /dev/urandom | \
paste -d "," - - | \
head -n 1000000 > test.csv

And the files look like this:

> head -n 10 test.csv
d224d9ab,76e8eb72
358a23af,6fa66351
a24c330b,d0a15fea
e587bd05,c812d51b
71e23afc,5d4654a7
fac87ef8,0576ee3f
baf53d50,a82bfd43
193cbf62,553cd6bb
ba1f82b5,2200ad0d
2387b69f,6100ae3b


Perf Test - executemany with different batch sizes

#!/usr/bin/env python3

# execute_many.py

import csv
import datetime
import itertools
import sqlite3
import time

# This script creates and populates four databases from scratch
# using 4 different CSV files

# modified from grouper() recipe at:
# https://docs.python.org/3/library/itertools.html#itertools-recipes
def file_chunker(fp, batch_size):
args = [iter(fp)] * batch_size
return itertools.zip_longest(*args, fillvalue=None)

if __name__ == "__main__":
# here I created test files called test-1000.csv,
# test-10000.csv, and so on
batch_sizes = [1000, 10000, 100000, 1000000]

for batch_size in batch_sizes:
db = sqlite3.connect(f"test-{batch_size}.db")
cur = db.cursor()

cur.execute("""CREATE TABLE hostname_table (
name TEXT,
hostname TEXT,
scheme TEXT,
description TEXT,
created_date INTEGER
)""")
db.commit()
with open(f"test-{batch_size}.csv", "r") as f:
reader = csv.reader(f)
chunker = file_chunker(reader, batch_size)
start = time.perf_counter()
for batch in chunker:
data = [
(row[0], row[1], "not available", "not available", datetime.datetime.now())
for row in batch if row is not None
]
try:
cur.executemany("INSERT INTO hostname_table VALUES (?, ?, ?, ?, ?)", data)
db.commit()
except Exception:
db.rollback()
end = time.perf_counter()
print(f"Elapsed time (batch size: {batch_size}): {end - start}")

cur.close()

Running this, I get:

❯ python execute_many.py
Elapsed time (batch size: 1000): 1.9760890419711359
Elapsed time (batch size: 10000): 1.7937561669969
Elapsed time (batch size: 100000): 2.017556666978635
Elapsed time (batch size: 1000000): 2.7237499579787254

By reading these files in chunks of 1K/10K/100K/1M lines at a time, you can see that the additionally large batches don't really buy you any savings in time, but they're all very fast to run. Basically 2 seconds to populate a million row database.



Perf Test - execute with transaction-per-row

Here's a reduced version of your code that only deals with the sqlite3 portion, and creates a new database and populates it with my 1M line file, using a transaction-per-insert like your current code does.

#!/usr/bin/env python3

# process_data.py

import csv
import datetime
import sqlite3
import time

if __name__ == "__main__":
db = sqlite3.connect("test.db")
cur = db.cursor()

cur.execute("""CREATE TABLE hostname_table (
name TEXT,
hostname TEXT,
scheme TEXT,
description TEXT,
created_date INTEGER
)""")
db.commit()
with open("test.csv", "r") as f:
reader = csv.reader(f)
start = time.perf_counter()
for row in reader:
db_row = (row[0], row[1], "not available", "not available", datetime.datetime.now())
try:
cur.execute("INSERT INTO hostname_table VALUES (?, ?, ?, ?, ?)", db_row)
db.commit()
except Exception:
db.rollback()
end = time.perf_counter()
print(f"Elapsed time: {end - start}")

cur.close()

Running this, I get as an elapsed time...

> time python process_data.py
Elapsed time: 390.3731578750303
python process_data.py 10.82s user 173.25s system 47% cpu 6:30.41 total

So 6.5 minutes on my test machine - which shows you how much slower doing a transaction per row is.



Perf Test - Insertion Time

For fun, let's look at how much time it takes to insert a row vs. a batch insert (of 1000)

Here's a script that creates two databases, and one is written to using execute while the other is written to using executemany with a batch of 1000.

#!/usr/bin/env python3

import datetime
import sqlite3
import random
import string
import timeit

def insert_single(db, cursor, data):
cursor.execute("INSERT INTO hostname_table VALUES (?, ?, ?, ?, ?)", data)
db.commit()

def insert_many(db, cursor, data):
cursor.executemany("INSERT INTO hostname_table VALUES (?, ?, ?, ?, ?)", data)
db.commit()

def row_generator(count=1):
return [
(''.join(random.choices(string.ascii_letters, k=10)),
''.join(random.choices(string.ascii_letters, k=10)),
'not_available',
'not_available',
datetime.datetime.now()
)
for _ in range(count)
]

if __name__ == "__main__":
db_single = sqlite3.connect("timeit_test_single.db")
db_multi = sqlite3.connect("timeit_test_multi.db")

cur_single = db_single.cursor()
cur_single.execute("""CREATE TABLE hostname_table (
name TEXT,
hostname TEXT,
scheme TEXT,
description TEXT,
created_date INTEGER
)""")
db_single.commit()

cur_multi = db_multi.cursor()
cur_multi.execute("""CREATE TABLE hostname_table (
name TEXT,
hostname TEXT,
scheme TEXT,
description TEXT,
created_date INTEGER
)""")
db_multi.commit()

single_data = row_generator()[0]
multi_data = row_generator(1000)

print(timeit.timeit(lambda: insert_single(db_single, cur_single, single_data), number=10000))
print(timeit.timeit(lambda: insert_many(db_multi, cur_multi, multi_data), number=10000))

I run the timeit test in ipython and here's what I get...

In [16]: %timeit insert_single(db_single, cur_single, single_data)
271 µs ± 71.7 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

In [17]: %timeit insert_many(db_multi, cur_multi, multi_data)
1.44 ms ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

So you can see that for a single execute statement with a commit, it takes roughly 200-300 µs (0.2 - 0.3 ms), whereas to write a batch of 1000 with executemany it takes roughly 5 times longer at 1.4+ ms. Only 5x longer to write 1000x more rows!



Related Topics



Leave a reply



Submit