Fast Textfile Reading in C++

Fast textfile reading in c++

Updates: Be sure to check the (surprising) updates below the initial answer

Memory mapped files have served me well¹:

#include <boost/iostreams/device/mapped_file.hpp> // for mmap
#include <algorithm>  // for std::find
#include <iostream>   // for std::cout
#include <cstring>

int main()
{
    boost::iostreams::mapped_file mmap("input.txt", boost::iostreams::mapped_file::readonly);
    auto f = mmap.const_data();
    auto l = f + mmap.size();

    uintmax_t m_numLines = 0;
    while (f && f!=l)
        if ((f = static_cast<const char*>(memchr(f, '\n', l-f))))
            m_numLines++, f++;

    std::cout << "m_numLines = " << m_numLines << "\n";
}

This should be rather quick.

Update

In case it helps you test this approach, here's a version using mmap directly instead of using Boost: see it live on Coliru

#include <algorithm>
#include <iostream>
#include <cstring>

// for mmap:
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>

const char* map_file(const char* fname, size_t& length);

int main()
{
    size_t length;
    auto f = map_file("test.cpp", length);
    auto l = f + length;

    uintmax_t m_numLines = 0;
    while (f && f!=l)
        if ((f = static_cast<const char*>(memchr(f, '\n', l-f))))
            m_numLines++, f++;

    std::cout << "m_numLines = " << m_numLines << "\n";
}

void handle_error(const char* msg) {
    perror(msg); 
    exit(255);
}

const char* map_file(const char* fname, size_t& length)
{
    int fd = open(fname, O_RDONLY);
    if (fd == -1)
        handle_error("open");

    // obtain file size
    struct stat sb;
    if (fstat(fd, &sb) == -1)
        handle_error("fstat");

    length = sb.st_size;

    const char* addr = static_cast<const char*>(mmap(NULL, length, PROT_READ, MAP_PRIVATE, fd, 0u));
    if (addr == MAP_FAILED)
        handle_error("mmap");

    // TODO close fd at some point in time, call munmap(...)
    return addr;
}

Update

The last bit of performance I could squeeze out of this I found by looking at the source of GNU coreutils wc. To my surprise using the following (greatly simplified) code adapted from wc runs in about 84% of the time taken with the memory mapped file above:

static uintmax_t wc(char const *fname)
{
    static const auto BUFFER_SIZE = 16*1024;
    int fd = open(fname, O_RDONLY);
    if(fd == -1)
        handle_error("open");

    /* Advise the kernel of our access pattern.  */
    posix_fadvise(fd, 0, 0, 1);  // FDADVICE_SEQUENTIAL

    char buf[BUFFER_SIZE + 1];
    uintmax_t lines = 0;

    while(size_t bytes_read = read(fd, buf, BUFFER_SIZE))
    {
        if(bytes_read == (size_t)-1)
            handle_error("read failed");
        if (!bytes_read)
            break;

        for(char *p = buf; (p = (char*) memchr(p, '\n', (buf + bytes_read) - p)); ++p)
            ++lines;
    }

    return lines;
}

¹ see e.g. the benchmark here: How to parse space-separated floats in C++ quickly?

How do I read and parse a text file with numbers, fast (in C)?

Some suggestions:

a) Consider converting (or pre-processing) the file into a binary format; with the aim to minimise the file size and also drastically reduce the cost of parsing. I don't know the ranges for your values, but various techniques (e.g. using one bit to tell if the number is small or large and storing the number as either a 7-bit integer or a 31-bit integer) could halve the file IO (and double the speed of reading the file from disk) and slash parsing costs down to almost nothing. Note: For maximum effect you'd modify whatever software created the file in the first place.

b) Reading the entire file into memory before you parse it is a mistake. It doubles the amount of RAM required (and the cost of allocating/freeing) and has disadvantages for CPU caches. Instead read a small amount of the file (e.g. 16 KiB) and process it, then read the next piece and process it, and so on; so that you're constantly reusing the same small buffer memory.

c) Use parallelism for file IO. It shouldn't be hard to read the next piece of the file while you're processing the previous piece of the file (either by using 2 threads or by using asynchronous IO).

d) Pre-allocate memory for the "neighbour" structures and remove most/all malloc() calls from your loop. The best possible case is to use a statically allocated array as a pool - e.g. Neighbor myPool[MAX_NEIGHBORS]; where malloc() can be replaced with &myPool[nextEntry++];. This reduces/removes the overhead of malloc() while also improving cache locality for the data itself.

e) Use parallelism for storing values. For example, you could have multiple threads where the first thread handles all the cases where root_level % NUM_THREADS == 0, the second thread handles all cases where root_level % NUM_THREADS == 1, etc.

With all of the above (assuming a modern 4-core CPU), I think you can get the total time (for reading and storing) down to less than 15 seconds.

Fast Processing of Text Files in C

My guess is that you are looking for basic functions to read file and that reading characters one by one is not the direction you are looking for.

There are many functions to read and handle string in c. In stdio.h are some functions that could help you :

char * fgets ( char * str, int num, FILE * stream ) : read characters until end of line or end of file, if num is large enough.
int sscanf ( const char * s, const char * format, ...); : reads formatted input. For instance fscanf(line,"%d",&nb); will read an integer and place it into nb. It is not possible to call sscanf many times on the same string. But a bypass is to use strtok() of string.h to split a string using space " " as a separator.

Here a sample code doing the job :

#include <stdio.h>
#include <stdlib.h>
#include <string.h>


#define MAX_LINE_SIZE 1000
#define MAX_SEQ_SIZE 100

int main()
{
    FILE * pFile;
    char line [MAX_LINE_SIZE];
    int nbvar,nbclauses;
    int* array=NULL;
    int i=0;int j;
    pFile = fopen ("example.txt" , "r");
    if (pFile == NULL){ perror ("Error opening file");}
    else {
        while (fgets(line, MAX_LINE_SIZE, pFile) != NULL){
            printf("%s",line);fflush(stdout);
            // parsing the line
            if(line[0]!='c' && line[0]!='\0'){
                if(line[0]=='p'){
                    sscanf(line,"%*s%*s%d%d",&nbvar,&nbclauses);
                    array=malloc(MAX_SEQ_SIZE*nbclauses*sizeof(int));
                }else{
                    char * temp;
                    char stop=0;
                    j=0;
                    //strtok split the line into token
                    temp=strtok(line," ");
                    while(stop==0){
                        sscanf(temp,"%d",&array[i*(MAX_SEQ_SIZE)+j]);
                        temp=strtok(NULL," ");
                        if(array[i*MAX_SEQ_SIZE+j]==0){stop=1;}
                        j++;
                        printf("j %d\n",j );fflush(stdout);
                    }
                    i++;
                }

            }
        }
        fclose (pFile);
    }
    if(array!=NULL){
        for(i=0;i<nbclauses;i++){
            j=0;
            while(array[i*MAX_SEQ_SIZE+j]!=0){
                printf("line %d seq item %d worths %d\n",i,j,array[i*MAX_SEQ_SIZE+j]);
                j++;
            }
        }
        free(array);
    }
    return 0;
}

Fastest file reading in C

If you are willing to go beyond the C spec into OS specific code, memory mapping is generally considered the most efficient way.

For Posix, check out mmap and for Windows check out OpenFileMapping

C++ Fast way to load large txt file in vectorstring

Loading a large text file into std::vector<std::string> is rather inefficient and wasteful because it allocates heap memory for each std::string and re-allocates the vector multiple times. Each of these heap allocations requires heap book-keeping information under the hood (normally 8 bytes per allocation on a 64-bit system), and each line requires an std::string object (8-32 bytes depending on the standard library), so that a file loaded this way takes a lot more space in RAM than on disk.

One fast way is to map the file into memory and implement iterators to walk over lines in it. This sidesteps the issues mentioned above.

Working example:

#include <boost/interprocess/file_mapping.hpp>
#include <boost/interprocess/mapped_region.hpp>
#include <boost/iterator/iterator_facade.hpp>
#include <boost/range/iterator_range_core.hpp>

#include <iostream>

class LineIterator
    : public boost::iterator_facade<
          LineIterator,
          boost::iterator_range<char const*>,
          boost::iterators::forward_traversal_tag,
          boost::iterator_range<char const*>
          >
{
    char const *p_, *q_;
    boost::iterator_range<char const*> dereference() const { return {p_, this->next()}; }
    bool equal(LineIterator b) const { return p_ == b.p_; }
    void increment() { p_ = this->next(); }
    char const* next() const { auto p = std::find(p_, q_, '\n'); return p + (p != q_); }
    friend class boost::iterator_core_access;

public:
    LineIterator(char const* begin, char const* end) : p_(begin), q_(end) {}
};

inline boost::iterator_range<LineIterator> crange(boost::interprocess::mapped_region const& r) {
    auto p = static_cast<char const*>(r.get_address());
    auto q = p + r.get_size();
    return {LineIterator{p, q}, LineIterator{q, q}};
}

inline std::ostream& operator<<(std::ostream& s, boost::iterator_range<char const*> const& line) {
    return s.write(line.begin(), line.size());
}

int main() {
    boost::interprocess::file_mapping file("/usr/include/gnu-versions.h", boost::interprocess::read_only);
    boost::interprocess::mapped_region memory(file, boost::interprocess::read_only);

    unsigned n = 0;
    for(auto line : crange(memory))
        std::cout << n++ << ' ' << line;
}

Fastest way to read numerical values from text file in C++ (double in this case)

atof is probably much faster, it doesn't have to process the format string.

If you don't need to support all 1001 recognized input formats (with and without exponents, etc) then a custom function may be faster yet. If atof is still too slow for you, say so and I can clean up the code I use (it's not really suitable for public posting at the moment).

I just remembered the problem with atof -- it doesn't tell you where the number ended, so reading several numbers in sequence is difficult. strtod is better in that regard.

how to read extreme long lines from text file fast and safe in C++?

As I commented, on Linux & POSIX systems, you could consider using getline(3); I guess that the following could compile both as C and as C++ (assuming you do have some valid fopen-ed FILE*fil; ...)

char* linbuf = NULL; /// or nullptr in C++
size_t linsiz = 0;
ssize_t linlen = 0;

while((linlen=getline(&linbuf, &linsiz,fil))>=0) {
  // do something useful with linbuf; but no C++ exceptions
}
free(linbuf); linsiz=0;

I guess this might work (or be easily adapted) to C++. But then, beware of C++ exceptions, they should not go thru the while loop (or you should ensure that an appropriate destructor or catch is doing free(linbuf);).

Also getline could fail (e.g. if it calls a failing malloc) and you might need to handle that failure sensibly.

Fast Textfile Reading in C++