Fast concatenate multiple files on Linux
Even if there was such a tool, this could only work if the files except the last
were guaranteed to have a size that is a multiple of the filesystem's block
size.
If you control how the data is written into the temporary files, and you know
how large each one will be, you can instead do the following
Before starting the multiprocessing, create the final output file, and grow
it to the final size byfseek()
ing
to the end, this will create a
sparse file.Start multiprocessing, handing each process the FD and the offset into its
particular slice of the file.
This way, the processes will collaboratively fill the single output file,
removing the need to cat them together later.
EDIT
If you can't predict the size of the individual files, but the consumer of the
final file can work with sequential (as opposed to random-access) input, you can
feed cat tmpfile1 .. tmpfileN
to the consumer, either on stdin
cat tmpfile1 ... tmpfileN | consumer
or via named pipes (using bash's Process Substitution):
consumer <(cat tmpfile1 ... tmpfileN)
Faster way to concatenate files of same name in different directories
In case this is useful to anyone else searching one day, I learned a much faster way to do this using xargs and ls:
while read -r name
do
ls */*/*/*/$name.txt | xargs cat > $name.combine
done <List_to_combine.txt
Large Number of file concatenation
cat
itself is not slow. But every time you expand a shell wild card (? and *), the shell will read and search through all the file names in that directory, which is very slow.
Also the kernel will take time finding the file when you open it by name, which you can not avoid. This depends on the file system in use (unspecified in the question): some file systems are more intelligent with huge directories than others.
To sort this out you might benefit from taking a file listing once:
ls > /tmp/filelist
...and then using grep
or similar for selecting the files out of that list:
cat `grep foo /tmp/filelist` > /out/bar
After you have sorted this mess out, make sure to structure your storage/application in such a way that this does not ever happen again. :) Also make sure to to rmdir
the existing directory after you have gotten your files out of it (using it again for any purpose will not be effective even if there is just a single file in it).
How to concatenate huge number of files
If your directory structure is shallow (there are no subdirectories) then you can simply do:
find . -type f -exec cat {} \; > newFile
If you have subdirectories, you can limit the find to the top level, or you might consider putting some of the files in the sub-directories so you don't have this problem!
This is not particularly efficient, and some versions of find allow you to do:
find . -type f -exec cat {} \+ > newFile
for greater efficiency. (Note the backslash before the +
is not necessary, but I find it nice for symmetry with the previous example.)
Faster way to merge multiple files with unequal number of rows by column in bash
You could put all in one awk
script:
awk -F'|' '{if (NR==FNR) a[$1]=$2; else print $1 "|" a[$1] " " $2}' a.txt b.txt
001|johan chu
001|johan stewart
002|mike lewis
002|mike jordan
003|adam lambert
003|adam johnson
003|adam smith
003|adam long
Related Topics
Using Linux How to Pass the Contents of a File as a Parameter to an Executable
How to Fix Permission Denied for .Git/ Directory When Performing Git Push
Should Linux Cron Jobs Be Specified with an "&" to Indicate to Run in Background
Moving Multiple Files Having Spaces in Name (Linux)
Timed Out While Waiting for the MAChine to Boot When Vagrant Up
Run Bash Command on Jenkins Pipeline
How to Specify Filenames Within a Zip When Creating It on the Command Line from a Pipe
Glibc: Elf File Os Abi Invalid
How to Grep '---' in Linux? Grep: Unrecognized Option '---'
Centos Directory Structure as Tree
Check If Rsync Command Ran Successful
What Does Double Slash // in 'Cd //' Mean in Linux
Standard Library Abi Compatibility
Recursive Copy of a Specific File Type Maintaining the File Structure in Unix/Linux
How to Runtime Debug Shared Libraries