Seeking and Reading Large Files in a Linux C++ Application

Seeking and reading large files in a Linux C++ application

fseek64 is a C function. To make it available you'll have to define _FILE_OFFSET_BITS=64 before including the system headers That will more or less define fseek to be actually fseek64. Or do it in the compiler arguments e.g.
gcc -D_FILE_OFFSET_BITS=64 ....

http://www.suse.de/~aj/linux_lfs.html has a great overviw of large file support on linux:

  • Compile your programs with "gcc -D_FILE_OFFSET_BITS=64". This forces all file access calls to use the 64 bit variants. Several types change also, e.g. off_t becomes off64_t. It's therefore important to always use the correct types and to not use e.g. int instead of off_t. For portability with other platforms you should use getconf LFS_CFLAGS which will return -D_FILE_OFFSET_BITS=64 on Linux platforms but might return something else on e.g. Solaris. For linking, you should use the link flags that are reported via getconf LFS_LDFLAGS. On Linux systems, you do not need special link flags.
  • Define _LARGEFILE_SOURCE and _LARGEFILE64_SOURCE. With these defines you can use the LFS functions like open64 directly.
  • Use the O_LARGEFILE flag with open to operate on large files.

Can I seek a position beyond 2GB in C using the standard library?

There is no portable way.

On Linux there are fseeko() and ftello(), pair (need some defines, check ftello()).

On Windows, I believe you have to use _fseeki64() and _ftelli64()

#ifdef is your friend

Parsing Large File in C

I think the most likely cause here is (ironically enough) a stack overflow. Your numbersToSort array is allocated on the stack, and the stack has a fixed size (varies by compiler and operating system, but 1 MB is a typical number). You should dynamically allocate numbersToSort on the heap (which has much more available space) using malloc():

uint32_t *numbersToSort = malloc(sizeof(uint32_t) * numNumbers);

Don't forget to deallocate it later:

free(numbersToSort);

I would also point out that your first-pass loop, which is intended to count the number of lines, will fail if there are any blank lines. This is because on a blank line, the first character is '\n', and fgetc() will consume it; the next call to fgets() will then be reading the following line, and you'll have skipped the blank one in your count.

How to read large files in segments?

The problem is that you don't override the buffer's content. Here's what your code does:

  • It reads the beginning of the file
  • When reaching the 'YZ', it reads it and only overrides the buffer's first two characters ('U' and 'V') because it has reached the end of the file.

One easy fix is to clear the buffer before each file read:

#include <iostream>
#include <fstream>
#include <array>

int main()
{
std::ifstream bigFile("bigfile.txt", std::ios::binary | std::ios::ate);
int fileSize = bigFile.tellg();
std::cout << bigFile.tellg() << " Bytes" << '\n';

bigFile.seekg(0);

constexpr size_t bufferSize = 4;
std::array<char, bufferSize> buffer;

while (bigFile)
{
for (int i(0); i < bufferSize; ++i)
buffer[i] = '\0';
bigFile.read(buffer.data(), bufferSize);
// Print the buffer data
std::cout.write(buffer.data(), bufferSize) << '\n';
}
}

I also changed:

  • The std::unique_ptr<char[]> to a std::array since we don't need dynamic allocation here and std::arrays's are safer that C-style arrays
  • The printing instruction to std::cout.write because it caused undefined behavior (see @paddy's comment). std::cout << prints a null-terminated string (a sequence of characters terminated by a '\0' character) whereas std::cout.write prints a fixed amount of characters
  • The second file opening to a call to the std::istream::seekg method (see @rustyx's answer).

Another (and most likely more efficient) way of doing this is to read the file character by character, put them in the buffer, and printing the buffer when it's full. We then print the buffer if it hasn't been already in the main for loop.

#include <iostream>
#include <fstream>
#include <array>

int main()
{
std::ifstream bigFile("bigfile.txt", std::ios::binary | std::ios::ate);
int fileSize = bigFile.tellg();
std::cout << bigFile.tellg() << " Bytes" << '\n';

bigFile.seekg(0);

constexpr size_t bufferSize = 4;
std::array<char, bufferSize> buffer;

int bufferIndex;
for (int i(0); i < fileSize; ++i)
{
// Add one character to the buffer
bufferIndex = i % bufferSize;
buffer[bufferIndex] = bigFile.get();
// Print the buffer data
if (bufferIndex == bufferSize - 1)
std::cout.write(buffer.data(), bufferSize) << '\n';
}
// Override the characters which haven't been already (in this case 'W' and 'X')
for (++bufferIndex; bufferIndex < bufferSize; ++bufferIndex)
buffer[bufferIndex] = '\0';
// Print the buffer for the last time if it hasn't been already
if (fileSize % bufferSize /* != 0 */)
std::cout.write(buffer.data(), bufferSize) << '\n';
}

C++: will this disk seek take a very large performance hit?

On POSIX systems (notably Linux, and probably MacOSX), the C++ streams are based on lower primitives (often, system calls) such as read(2) and write(2) and the implementation will buffer the data (in the standard C++ library, which would call read(2) on buffers of several kilobytes) and the kernel generally keeps recently accessed pages in its page cache. Hence, practically speaking, most not too big files (e.g. files of few hundred megabytes on a laptop with several gigabytes of RAM) are staying in RAM (once they have been read or written) for a while. See also sync(2).

As commented by Hans Passant, reading in the middle a textual file could be errorprone (in particular, because an UTF8 character may span on several bytes) if not done very carefully.

Notice that for a C (fopen) or C++ point of view, textual files and binary files differ notably on how they handle end of lines.

If performance matters a lot for you, you could use directly low level systems calls like read(2) and write(2) and lseek(2) but then be careful to use wide enough buffers (typically of several kilobytes, e.g. 4Kbytes to 512Kbytes, or even several megabytes). Don't forget to use the returned read or written byte count (some IO operations can be partial, or fail, etc...). Avoid if possible (for performance reasons) to repeatedly read(2) only a dozen of bytes. You could instead memory-map the file (or a segment of it) using mmap(2) (before mmap-ing, use stat(2) to get metadata information, notably file size). And you could give advices to the kernel using posix_fadvise(2) or (for file mapped into virtual memory) madvise(2). Performance details are heavily system dependent (file system, hardware -SSD and hard-disks are different!, system load).

At last, you should consider using some higher-level library on binary files such as indexed files à la GDBM or the sqlite library, or consider using real databases such as PostGreSQL, MonogDB etc.

Apparently, your files contain genomics information. Probably you don't care about end-of-line processing and could open them as binary streams (or directly as low-level Unix file descriptors). Perhaps there already exist free software libraries to parse them. Otherwise, you might consider a two-pass approach: a first pass is reading sequentially the entire file and remembering (in C++ containers like std::map) the interesting parts and their offsets. A second pass would use direct access. You might even have some preprocessor converting your genomics file into SQLITE or GDBM files, and have your application work on these. You probably should avoid opening these files as text (but just as binary file) because end-of-line processing is useless to you.

On a 64 bits system, if you handle only a few files (not thousands of them at once) of several dozens of gigabytes, memory mapping (with mmap) them should make sense, then use madvise (but on a 32 bits system, you won't be able to mmap the entire file).



Related Topics



Leave a reply



Submit