C streams: Copy data from one stream to another directly, without using a buffer
2 possible Linux-only solutions are splice() and sendfile(). What they do is copy data without it ever leaving kernel space, thus making a potentially significant performance optimization.
Note that both have limitations:
sendfile() requires a socket for its output for Linux kernels before 2.6.33, after that, any file can be the output, and also it requires the input to support
mmap()
operations, meaning the input can't bestdin
or a pipe.splice() requires one of the input or output streams to be a pipe (not sure about both), and also for kernel versions 2.6.30.10 and older, it requires the file system for the stream that is not a pipe to support splicing.
Edit: Note that some filesystems might not support splicing for Linux 2.6.30.10 and below.
Most efficient way to copy a file in Linux
Unfortunately, you cannot use sendfile()
here because the destination is not a socket. (The name sendfile()
comes from send()
+ "file").
For zero-copy, you can use splice()
as suggested by @Dave. (Except it will not be zero-copy; it will be "one copy" from the source file's page cache to the destination file's page cache.)
However... (a) splice()
is Linux-specific; and (b) you can almost certainly do just as well using portable interfaces, provided you use them correctly.
In short, use open()
+ read()
+ write()
with a small temporary buffer. I suggest 8K. So your code would look something like this:
int in_fd = open("source", O_RDONLY);
assert(in_fd >= 0);
int out_fd = open("dest", O_WRONLY);
assert(out_fd >= 0);
char buf[8192];
while (1) {
ssize_t read_result = read(in_fd, &buf[0], sizeof(buf));
if (!read_result) break;
assert(read_result > 0);
ssize_t write_result = write(out_fd, &buf[0], read_result);
assert(write_result == read_result);
}
With this loop, you will be copying 8K from the in_fd page cache into the CPU L1 cache, then writing it from the L1 cache into the out_fd page cache. Then you will overwrite that part of the L1 cache with the next 8K chunk from the file, and so on. The net result is that the data in buf
will never actually be stored in main memory at all (except maybe once at the end); from the system RAM's point of view, this is just as good as using "zero-copy" splice()
. Plus it is perfectly portable to any POSIX system.
Note that the small buffer is key here. Typical modern CPUs have 32K or so for the L1 data cache, so if you make the buffer too big, this approach will be slower. Possibly much, much slower. So keep the buffer in the "few kilobytes" range.
Of course, unless your disk subsystem is very very fast, memory bandwidth is probably not your limiting factor. So I would recommend posix_fadvise
to let the kernel know what you are up to:
posix_fadvise(in_fd, 0, 0, POSIX_FADV_SEQUENTIAL);
This will give a hint to the Linux kernel that its read-ahead machinery should be very aggressive.
I would also suggest using posix_fallocate
to preallocate the storage for the destination file. This will tell you ahead of time whether you will run out of disk. And for a modern kernel with a modern file system (like XFS), it will help to reduce fragmentation in the destination file.
The last thing I would recommend is mmap
. It is usually the slowest approach of all thanks to TLB thrashing. (Very recent kernels with "transparent hugepages" might mitigate this; I have not tried recently. But it certainly used to be very bad. So I would only bother testing mmap
if you have lots of time to benchmark and a very recent kernel.)
[Update]
There is some question in the comments about whether splice
from one file to another is zero-copy. The Linux kernel developers call this "page stealing". Both the man page for splice
and the comments in the kernel source say that the SPLICE_F_MOVE
flag should provide this functionality.
Unfortunately, the support for SPLICE_F_MOVE
was yanked in 2.6.21 (back in 2007) and never replaced. (The comments in the kernel sources never got updated.) If you search the kernel sources, you will find SPLICE_F_MOVE
is not actually referenced anywhere. The last message I can find (from 2008) says it is "waiting for a replacement".
The bottom line is that splice
from one file to another calls memcpy
to move the data; it is not zero-copy. This is not much better than you can do in userspace using read
/write
with small buffers, so you might as well stick to the standard, portable interfaces.
If "page stealing" is ever added back into the Linux kernel, then the benefits of splice
would be much greater. (And even today, when the destination is a socket, you get true zero-copy, making splice
more attractive.) But for the purpose of this question, splice
does not buy you very much.
Concatenating multiple csv files into a single csv with the same header
If you don't need the CSV in memory, just copying from input to output, it'll be a lot cheaper to avoid parsing at all, and copy without building up in memory:
import shutil
import glob
#import csv files from folder
path = r'data/US/market/merged_data'
allFiles = glob.glob(path + "/*.csv")
allFiles.sort() # glob lacks reliable ordering, so impose your own if output order matters
with open('someoutputfile.csv', 'wb') as outfile:
for i, fname in enumerate(allFiles):
with open(fname, 'rb') as infile:
if i != 0:
infile.readline() # Throw away header on all but first file
# Block copy rest of file from input to output without parsing
shutil.copyfileobj(infile, outfile)
print(fname + " has been imported.")
That's it; shutil.copyfileobj
handles efficiently copying the data, dramatically reducing the Python level work to parse and reserialize. Don't omit the `allFiles.sort()!†
This assumes all the CSV files have the same format, encoding, line endings, etc., the encoding encodes such that newlines appear as a single byte equivalent to ASCII \n
and it's the last byte in the character (so ASCII and all ASCII superset encodings work, as does UTF-16-BE and UTF-32-BE, but not UTF-16-LE and UTF-32-LE) and the header doesn't contain embedded newlines, but if that's the case, it's a lot faster than the alternatives.
For the cases where the encoding's version of a newline doesn't look enough like an ASCII newline, or where the input files are in one encoding, and the output file should be in a different encoding, you can add the work of encoding and decoding without adding CSV parsing/serializing work, with (adding a from io import open
if on Python 2, to get Python 3-like efficient encoding-aware file objects, and defining known_input_encoding
to some string representing the known encoding for input files, e.g. known_input_encoding = 'utf-16-le'
, and optionally a different encoding for output files):
# Other imports and setup code prior to first with unchanged from before
# Perform encoding to chosen output encoding, disabling line-ending
# translation to avoid conflicting with CSV dialect, matching raw binary behavior
with open('someoutputfile.csv', 'w', encoding=output_encoding, newline='') as outfile:
for i, fname in enumerate(allFiles):
# Decode with known encoding, disabling line-ending translation
# for same reasons as above
with open(fname, encoding=known_input_encoding, newline='') as infile:
if i != 0:
infile.readline() # Throw away header on all but first file
# Block copy rest of file from input to output without parsing
# just letting the file object decode from input and encode to output
shutil.copyfileobj(infile, outfile)
print(fname + " has been imported.")
This is still much faster than involving the csv
module, especially in modern Python (where the io
module has undergone greater and greater optimization, to the point where the cost of decoding and reencoding is pretty minor, especially next to the cost of performing I/O in the first place). It's also a good validity check for self-checking encodings (e.g. the UTF family) even if the encoding is not supposed to change; if the data doesn't match the assumed self-checking encoding, it's highly unlikely to decode validly, so you'll get an exception rather than silent misbehavior.
Because some of the duplicates linked here are looking for an even faster solution than copyfileobj
, some options:
The only succinct, reasonably portable option is to continue using
copyfileobj
and explicitly pass a non-defaultlength
parameter, e.g.shutil.copyfileobj(infile, outfile, 1 << 20)
(1 << 20
is 1 MiB, a number whichshutil
has switched to for plainshutil.copyfile
calls on Windows due to superior performance).Still portable, but only works for binary files and not succinct, would be to copy the underlying code
copyfile
uses on Windows, which uses a reusablebytearray
buffer with a larger size thancopyfileobj
's default (1 MiB, rather than 64 KiB), removing some allocation overhead thatcopyfileobj
can't fully avoid for large buffers. You'd replaceshutil.copyfileobj(infile, outfile)
with (3.8+'s walrus operator,:=
, used for brevity) the following code adapted from CPython 3.10's implementation ofshutil._copyfileobj_readinto
(which you could always use directly if you don't mind using non-public APIs):buf_length = 1 << 20 # 1 MiB buffer; tweak to preference
# Using a memoryview gets zero copy performance when short reads occur
with memoryview(bytearray(buf_length)) as mv:
while n := infile.readinto(mv):
if n < buf_length:
with mv[:n] as smv:
outfile.write(smv)
else:
outfile.write(mv)Non-portably, if you can (in any way you feel like) determine the precise length of the header, and you know it will not change by even a byte in any other file, you can write the header directly, then use OS-specific calls similar to what
shutil.copyfile
uses under the hood to copy the non-header portion of each file, using OS-specific APIs that can do the work with a single system call (regardless of file size) and avoid extra data copies (by pushing all the work to in-kernel or even within file-system operations, removing copies to and from user space) e.g.:a. On Linux kernel 2.6.33 and higher (and any other OS that allows the
sendfile(2)
system call to work between open files), you can replace the.readline()
andcopyfileobj
calls with:filesize = os.fstat(infile.fileno()).st_size # Get underlying file's size
os.sendfile(outfile.fileno(), infile.fileno(), header_len_bytes, filesize - header_len_bytes)To make it signal resilient, it may be necessary to check the return value from
sendfile
, and track the number of bytes sent + skipped and the number remaining, looping until you've copied them all (these are low level system calls, they can be interrupted).b. On any system Python 3.8+ built with glibc >= 2.27 (or on Linux kernel 4.5+), where the files are all on the same filesystem, you can replace
sendfile
withcopy_file_range
:filesize = os.fstat(infile.fileno()).st_size # Get underlying file's size
os.copy_file_range(infile.fileno(), outfile.fileno(), filesize - header_len_bytes, header_len_bytes)With similar caveats about checking for copying fewer bytes than expected and retrying.
c. On OSX/macOS, you can use the completely undocumented, and therefore even less portable/stable API
shutil.copyfile
uses,posix._fcopyfile
for a similar purpose, with something like (completely untested, and really, don't do this; it's likely to break across even minor Python releases):infile.seek(header_len_bytes) # Skip past header
posix._fcopyfile(infile.fileno(), outfile.fileno(), posix._COPYFILE_DATA)which assumes
fcopyfile
pays attention to the seek position (docs aren't 100% on this) and, as noted, is not only macOS-specific, but uses undocumented CPython internals that could change in any release.
† An aside on sorting the results of glob
: That allFiles.sort()
call should not be omitted; glob
imposes no ordering on the results, and for reproducible results, you'll want to impose some ordering (it wouldn't be great if the same files, with the same names and data, produced an output file in a different order simply because in-between runs, a file got moved out of the directory, then back in, and changed the native iteration order). Without the sort
call, this code (and all other Python+glob module answers) will not reliably read from a directory containing a.csv
and b.csv
in alphabetical (or any other useful) order; it'll vary by OS, file system, and often the entire history of file creation/deletion in the directory in question. This has broken stuff before in the real world, see details at A Code Glitch May Have Caused Errors In More Than 100 Published Studies.
Reproduce the Unix cat command in Python
The easiest way might be simply to forget about the lines, and just read in the entire file, then write it to the output:
with open('command.fort.13', 'wb') as outFile:
with open('command.info', 'rb') as com, open('fort.13', 'rb') as fort13:
outFile.write(com.read())
outFile.write(fort13.read())
As pointed out in a comment, this can cause high memory usage if either of the inputs is large (as it copies the entire file into memory first). If this might be an issue, the following will work just as well (by copying the input files in chunks):
import shutil
with open('command.fort.13', 'wb') as outFile:
with open('command.info', 'rb') as com, open('fort.13', 'rb') as fort13:
shutil.copyfileobj(com, outFile)
shutil.copyfileobj(fort13, outFile)
Related Topics
Does a Fully Qualified Domain Name Need a Period
Changing /Proc/Sys/Kernel/Core_Pattern File Inside Docker Container
How to Launch a Background Process Through a Wrapper That'Ll Undo The Sigint Ignore
List of Synchronous and Asynchronous Linux/Posix Signals
Deleting All Files Except Ones Mentioned in Config File
The Most Reliable Way to Terminate a Family of Processes
Google Cloud Storage Buckets: Mounting in a Linux Instance with Global Permissions
How to Open The Default Text Editor in Linux
Dialog in Bash Is Not Grabbing Variables Correctly
Nohup Failing with Anaconda Ipython
Documentation About Device Driver Programming on Kernel 3.X