Edit Huge SQL Data File

How to open a huge .sql file

SQL editor can open a file upto 500 mb without very very good specs, this seems to be going something wrong, if you want to insert the data from one database to another, try to use import/export wizard, SSIS package or command line utility creating a script of table and then insert is not a good approach.

How to open and work with a very large .SQL file that was generated in a dump?

2 things that might be helpful:

  1. Use pv to see how much of the .sql file has already been read. This can give you a progress bar which at least tells you it's not suck.
  2. Log into MySQL and use SHOW PROCESSLIST to see what MySQL currently is executing. If it's still running, just let it run to completion.

If turned on, it might really help to turn off the binlog for the duration of the restore. Another thing that may or may not be helpful... if you have the choice, try to use the fastest disks available. You may have this kind of option if you're running on hosters like Amazon. You're going to really feel the pain if you're (for example) doing this on a standard EC2 host.

Edit very large sql dump/text file (on linux)

Rather than removing the first few lines, try editing them to be whitespace.

The hexedit program can do this-- it reads files in chunks, so opening a 10GB file is no different from opening a 100KB file to it.

$ hexedit largefile.sql.dump
tab (switch to ASCII side)
space (repeat as needed until your header is gone)
F2 (save)/Ctrl-X (save and exit)/Ctrl-C (exit without saving)

How do you import a large MS SQL .sql file?

From the command prompt, start up sqlcmd:

sqlcmd -S <server> -i C:\<your file here>.sql 

Just replace <server> with the location of your SQL box and <your file here> with the name of your script. Don't forget, if you're using a SQL instance the syntax is:

sqlcmd -S <server>\instance.

Here is the list of all arguments you can pass sqlcmd:

Sqlcmd            [-U login id]          [-P password]
[-S server] [-H hostname] [-E trusted connection]
[-d use database name] [-l login timeout] [-t query timeout]
[-h headers] [-s colseparator] [-w screen width]
[-a packetsize] [-e echo input] [-I Enable Quoted Identifiers]
[-c cmdend] [-L[c] list servers[clean output]]
[-q "cmdline query"] [-Q "cmdline query" and exit]
[-m errorlevel] [-V severitylevel] [-W remove trailing spaces]
[-u unicode output] [-r[0|1] msgs to stderr]
[-i inputfile] [-o outputfile] [-z new password]
[-f | i:[,o:]] [-Z new password and exit]
[-k[1|2] remove[replace] control characters]
[-y variable length type display width]
[-Y fixed length type display width]
[-p[1] print statistics[colon format]]
[-R use client regional setting]
[-b On error batch abort]
[-v var = "value"...] [-A dedicated admin connection]
[-X[1] disable commands, startup script, environment variables [and exit]]
[-x disable variable substitution]
[-? show syntax summary]

Manipulation of Large Files in R

As the @Dominic Comtois I would also recommend to use SQL.

R can handle quite a biggish data - there is nice benchmark of 2 billions rows which beats python - but because R run mostly in memory you need to have a good machine to make it work. Still your case don't need to load more than 4.5GB file at once so it should be well doable on personal computer, see second approach for fast non-database solution.

You can utilize R to load data to SQL database and later to query them from database.
If you don't know SQL you may want to use some simple database. The simplest way from R is to use RSQLite (unfortunately since v1.1 it is not lite any more). You don't need to install or manage any external dependency. The RSQLite package contains the database engine embedded.

library(RSQLite)
library(data.table)
conn <- dbConnect(dbDriver("SQLite"), dbname="mydbfile.db")
monthfiles <- c("month1","month2") # ...
# write data
for(monthfile in monthfiles){
dbWriteTable(conn, "mytablename", fread(monthfile), append=TRUE)
cat("data for",monthfile,"loaded to db\n")
}
# query data
df <- dbGetQuery(conn, "select * from mytablename where customerid = 1")
# when working with bigger sets of data I would recommend to do below
setDT(df)
dbDisconnect(conn)

Thats all. You use SQL without really having to do much overhead usually related to databases.

If you prefer to go with the approach from your post I think you can dramatically speed up by doing write.csv by groups while aggregation in data.table.

library(data.table)
monthfiles <- c("month1","month2") # ...
# write data
for(monthfile in monthfiles){
fread(monthfile)[, write.csv(.SD,file=paste0(CustomerID,".csv"), append=TRUE), by=CustomerID]
cat("data for",monthfile,"written to csv\n")
}

So you utilize fast unique from data.table and perform subsetting while grouping which is also ultra fast. Below is working example of the approach.

library(data.table)
data.table(a=1:4,b=5:6)[,write.csv(.SD,file=paste0(b,".csv")),b]

Update 2016-12-05:
Starting from data.table 1.9.8+ you can replace write.csv with fwrite, example in this answer.



Related Topics



Leave a reply



Submit