Efficiently Reading a Very Large Text File in C++

How to read large text file in c

I changed the function according to your answers, now the function NewWord just print the word into a second file, skipping unnecessary words according to the functions step1() and step2().

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define SIZE 67

char  letters[SIZE] = {'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z',
                        'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z',
                        '.','_','1','2','3','4','5','6','7','8','9','0','!','@','$'};

struct Word
{
  char word[13];
};

_Bool step1(char * word)
{
for(int i = 0; i < SIZE; i++)
{
for(int j = 0, c = 0; j < strlen(word); j++)
  {
  if(word[j] == letters[i])
  {
    c++;
    if(c > 3)
    {
      return 1;
    }
  }

  }
}
return 0;

}

_Bool step2(char * word)
{
  for(int i = 0; i < SIZE; i++)
  {
  for(int j = 0; j < strlen(word); j++)
    {
    if(word[j] == letters[i] && word[j+1] == letters[i] && word[j+2] == letters[i])
    {
        return 1;
    }

    }
  }
  return 0;

}

void NewWord(FILE *f, struct Word s)
{
if(step1(s.word ) == 1 || step2(s.word) == 1)
 return;

fprintf(f, "%s\n", s.word);

}

void LoadList()
{
  FILE * f1;
  FILE * f2;
 struct Word s;
  char * buffer = malloc(sizeof(struct Word));

  if(!(f1= fopen("wordlist.txt", "r")))
  {
    fclose(f1);
    exit(1);
  }

  if(!(f2 = fopen("bb.txt", "w")))
  {
    fclose(f2);
    exit(1);
  }

 while(fgets(buffer, sizeof(struct Word), f1))
  {
    if(sscanf(buffer,"%s", s.word) == 1)
     {
       NewWord(f2, s);
     }

  }

fclose(f1);
fclose(f2);
free(buffer);

}

int main()
{
 LoadList();

exit(0);
}

Efficiently reading a very large text file in C++

I'd redesign this to act streaming, instead of on a block.

A simpler approach would be:

std::ifstream ifs("input.txt");
std::vector<uint64_t> parsed(std::istream_iterator<uint64_t>(ifs), {});

If you know roughly how many values are expected, using std::vector::reserve up front could speed it up further.

Alternatively you can use a memory mapped file and iterate over the character sequence.

How to parse space-separated floats in C++ quickly? shows these approaches with benchmarks for floats.

Update I modified the above program to parse uint32_ts into a vector.

When using a sample input file of 4.5GiB^[1] the program runs in 9 seconds^[2]:

sehe@desktop:/tmp$ make -B && sudo chrt -f 99 /usr/bin/time -f "%E elapsed, %c context switches" ./test smaller.txt
g++ -std=c++0x -Wall -pedantic -g -O2 -march=native test.cpp -o test -lboost_system -lboost_iostreams -ltcmalloc
parse success
trailing unparsed: '
'
data.size():   402653184
0:08.96 elapsed, 6 context switches

Of course it allocates at least 402653184 * 4 * byte = 1.5 gibibytes. So when
you read a 45 GB file, you will need an estimated 15GiB of RAM to just store
the vector (assuming no fragmentation on reallocation): The 45GiB parse
completes in 10min 45s:

make && sudo chrt -f 99 /usr/bin/time -f "%E elapsed, %c context switches" ./test 45gib_uint32s.txt 
make: Nothing to be done for `all'.
tcmalloc: large alloc 17570324480 bytes == 0x2cb6000 @  0x7ffe6b81dd9c 0x7ffe6b83dae9 0x401320 0x7ffe6af4cec5 0x40176f (nil)
Parse success
Trailing unparsed: 1 characters
Data.size():   4026531840
Time taken by parsing: 644.64s
10:45.96 elapsed, 42 context switches

By comparison, just running wc -l 45gib_uint32s.txt took ~12 minutes (without realtime priority scheduling though). wc is blazingly fast

Full Code Used For Benchmark

#include <boost/spirit/include/qi.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <chrono>

namespace qi = boost::spirit::qi;

typedef std::vector<uint32_t> data_t;

using hrclock = std::chrono::high_resolution_clock;

int main(int argc, char** argv) {
    if (argc<2) return 255;
    data_t data;
    data.reserve(4392580288);   // for the  45 GiB file benchmark
    // data.reserve(402653284); // for the 4.5 GiB file benchmark

    boost::iostreams::mapped_file mmap(argv[1], boost::iostreams::mapped_file::readonly);
    auto f = mmap.const_data();
    auto l = f + mmap.size();

    using namespace qi;

    auto start_parse = hrclock::now();
    bool ok = phrase_parse(f,l,int_parser<uint32_t, 10>() % eol, blank, data);
    auto stop_time = hrclock::now();

    if (ok)   
        std::cout << "Parse success\n";
    else 
        std::cerr << "Parse failed at #" << std::distance(mmap.const_data(), f) << " around '" << std::string(f,f+50) << "'\n";

    if (f!=l) 
        std::cerr << "Trailing unparsed: " << std::distance(f,l) << " characters\n";

    std::cout << "Data.size():   " << data.size() << "\n";
    std::cout << "Time taken by parsing: " << std::chrono::duration_cast<std::chrono::milliseconds>(stop_time-start_parse).count() / 1000.0 << "s\n";
}

^[1] generated with od -t u4 /dev/urandom -A none -v -w4 | pv | dd bs=1M count=$((9*1024/2)) iflag=fullblock > smaller.txt

^[2] obviously, this was with the file cached in the buffer cache on linux - the large file doesn't have this benefit

How to load larger text file to buffer in optimized way -- c program

use a local buffer and read blocks of data using fread() in binary mode. Parse your text data and continue with the next block.

tune your buffer size properly, maybe 64K or 1Mb in size, it depends on your application.

#include <stdio.h>

int BUFFER_SIZE = 1024;
FILE *source;
FILE *destination;
int n;
int count = 0;
int written = 0;

int main()
{
    unsigned char buffer[BUFFER_SIZE];

    source = fopen("myfile", "rb");

    if (source)
    {
        while (!feof(source))
        {
            n = fread(buffer, 1, BUFFER_SIZE, source);
            count += n;
            // here parse data
        }
    }

    fclose(source);
    return 0;
}

Read large txt file in c++

Reading a file line by line:

ifstream fin ("file.txt");
string     myStr;

while(getline(fin, myStr))   // Always put the read in the while condition.
{                            // Then you only enter the loop if there is data to
    //use myStr data         // processes. Otherwise you need to read and then
}                            //  test if the read was OK
                             //
                             // Note: The last line read will read up to (but not
                             //        past) then end of file. Thus When there is
                             //        no data left in the file its state is still
                             //        OK. It is not until you try and explicitly
                             //        read past the end of file that EOF flag is set.

For a reason to not explicitly call close see:

https://codereview.stackexchange.com/questions/540/my-c-code-involving-an-fstream-failed-review/544#544

If efficiency is your major goal (its probably not). Then read the whole file into memory and parse from there: see Thomas below: Read large txt file in c++

What is the best way to read large file (2GB) (Text file contains ethernet data) and access the data randomly by different parameters?

Here is the solution I found:

Used Circular buffers (Boost lock free Buffers) for parsing file and to save the structured format of line
Used Separate threads:
- One will continuously parse the file and push to lock free queue
- One will continuously read from the buffer, process the line, form a structure and push to another queue
- Whenever user needs random data, based on time, I will move the file pointer to particular line and read only the particular line.
Both threads have mutex wait mechanisms to stop parsing once the predefined buffer limit reached
User will get data at any time, and no need of storing the complete file contents. As and when the frame is read, I will be deleting the frame from queue. So file size doesn't matter. Parallel threads which fills the buffers allows to not spend time on reading file every time.
If I want to move to other line, move file pointer, wipe off existing data, start threads again.

Note:
Only issue is now to move the file pointer to particular line.
I need to parse line by line till I reach the point.

If there exist any solution to move file pointer to required line it would be helpful. Binary search or any efficient search algorithm can be used and will get what I want.

I appreciate if anybody gives solution for the above new issue!

Effective methods for reading and writing large files in C

25000 lines * 100 characters = 2.5MB, that's not really a huge file. The fastest will probably be to read the whole file in memory and write your results to a new file and replace the original with that.

Fastest way to read very large text file in C#

If you can do this line by line then the answer is simple:

Read a line.
Process the line.
Write the line.

If you want it to go a bit faster, put those in three BlockingCollections with a specified upper bound of something like 10, so a slower step is never waiting on a faster step. If you can output to a different physical disc (if output is to disc).

OP changed the rules even after being asked if the process was line by line (twice).

Read line(s) to generate unit of work (open to close tags).
Process unit of work.
Write unit of work.