Fast Concatenate Multiple Files on Linux

Fast concatenate multiple files on Linux

Even if there was such a tool, this could only work if the files except the last
were guaranteed to have a size that is a multiple of the filesystem's block
size.

If you control how the data is written into the temporary files, and you know
how large each one will be, you can instead do the following

  1. Before starting the multiprocessing, create the final output file, and grow
    it to the final size by
    fseek()ing
    to the end, this will create a
    sparse file.

  2. Start multiprocessing, handing each process the FD and the offset into its
    particular slice of the file.

This way, the processes will collaboratively fill the single output file,
removing the need to cat them together later.

EDIT

If you can't predict the size of the individual files, but the consumer of the
final file can work with sequential (as opposed to random-access) input, you can
feed cat tmpfile1 .. tmpfileN to the consumer, either on stdin

cat tmpfile1 ... tmpfileN | consumer

or via named pipes (using bash's Process Substitution):

consumer <(cat tmpfile1 ... tmpfileN)

Faster way to concatenate files of same name in different directories

In case this is useful to anyone else searching one day, I learned a much faster way to do this using xargs and ls:

while read -r name
do
ls */*/*/*/$name.txt | xargs cat > $name.combine
done <List_to_combine.txt

Large Number of file concatenation

cat itself is not slow. But every time you expand a shell wild card (? and *), the shell will read and search through all the file names in that directory, which is very slow.

Also the kernel will take time finding the file when you open it by name, which you can not avoid. This depends on the file system in use (unspecified in the question): some file systems are more intelligent with huge directories than others.

To sort this out you might benefit from taking a file listing once:

ls > /tmp/filelist

...and then using grep or similar for selecting the files out of that list:

cat `grep foo /tmp/filelist` > /out/bar

After you have sorted this mess out, make sure to structure your storage/application in such a way that this does not ever happen again. :) Also make sure to to rmdir the existing directory after you have gotten your files out of it (using it again for any purpose will not be effective even if there is just a single file in it).

How to concatenate huge number of files

If your directory structure is shallow (there are no subdirectories) then you can simply do:

find . -type f -exec cat {} \; > newFile

If you have subdirectories, you can limit the find to the top level, or you might consider putting some of the files in the sub-directories so you don't have this problem!

This is not particularly efficient, and some versions of find allow you to do:

find . -type f -exec cat {} \+ > newFile

for greater efficiency. (Note the backslash before the + is not necessary, but I find it nice for symmetry with the previous example.)

Faster way to merge multiple files with unequal number of rows by column in bash

You could put all in one awk script:

awk -F'|' '{if (NR==FNR) a[$1]=$2; else print $1 "|" a[$1] " " $2}' a.txt b.txt 
001|johan chu
001|johan stewart
002|mike lewis
002|mike jordan
003|adam lambert
003|adam johnson
003|adam smith
003|adam long


Related Topics



Leave a reply



Submit