Reading Files Larger Than 4Gb Using C++ Stl

Reading files larger than 4GB using c++ stl

Apparently it depends on how off_t is implemented by the library.

#include <streambuf>
__int64_t temp=std::numeric_limits<std::streamsize>::max();

gives you what the current max is.

STLport supports larger files.

Determining the size of a file larger than 4GB

If you're in Windows, you want GetFileSizeEx (MSDN). The return value is a 64bit int.

On linux stat64 (manpage) is correct. fstat if you're working with a FILE*.

msvc9, iostream and 2g/4g plus files

I ended up using STLport. The biggest difference with STLport being that some unit tests which crashed during multiplies of double precision numbers now work and those unit tests pass. There are some other differences with relative precision popping up but those seem to be minor.

File size is larger than it should, extra new lines are added

The \n character has special meaning to STL character streams. It represents a newline, which gets translated to the platform-specific line break upon output. This is discussed here:

Binary and text modes

A text stream is an ordered sequence of characters composed into lines (zero or more characters plus a terminating '\n'). Whether the last line requires a terminating '\n' is implementation-defined. Characters may have to be added, altered, or deleted on input and output to conform to the conventions for representing text in the OS (in particular, C streams on Windows OS convert \n to \r\n on output, and convert \r\n to \n on input) .

So it is likely that std::cout outputs \r\n when it is given \n, even if a preceding \r was also given, thus an input of \r\n could become \r\r\n on output. It is not standardized behavior on Windows how individual apps handle bare-CR characters. They might be ignored, or they might be treated as line breaks. In your case, it sounds like the latter.

There is no standard way to use std::cout in binary mode so \n is output as \n instead of as \r\n. However, see How to make cout behave as in binary mode? for some possible ways that you might be able to make std::cout output in binary mode on Windows, depending on your compiler and STL implementation. Or, you could try using std::cout.rdbuf() to substitute in your own std::basic_streambuf object that performs binary output to the console.

That being said, the way your code is handling the data buffer is a little off, it should look more like this instead (not accounting for the above info):

#include <iostream>
#include <Windows.h>

int main()
{
HANDLE hFile = ::CreateFile("C:\\123.txt",
GENERIC_READ,
FILE_SHARE_READ |
FILE_SHARE_WRITE |
FILE_SHARE_DELETE, // why??
NULL,
OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL,
NULL);

if (INVALID_HANDLE_VALUE == hFile)
return ::GetLastError();

char buffer[256];
DWORD bytesRead, bytesWritten, err;

//======== so WriteFile outputs to console, not needed for cout version
HANDLE hStandardOutput = ::GetStdHandle(STD_OUTPUT_HANDLE);

if (INVALID_HANDLE_VALUE == hStandardOutput)
{
err = ::GetLastError();
std::cout << "GetStdHandle error code = " << err << std::endl;
::CloseHandle(hFile);
return err;
}

//============================
do
{
if (!::ReadFile(hFile, buffer, sizeof(buffer), &bytesRead, NULL))
{
err = ::GetLastError();
std::cout << "ReadFile error code = " << err << std::endl;
::CloseHandle(hFile);
return err;
}

if (bytesRead == 0) // EOF reached
break;

/*============= Works fine
if (!::WriteFile(hStandardOutput, buffer, bytesRead, &bytesWritten, NULL))
{
err = ::GetLastError();
std::cout << "WriteFile error code = " << err << std::endl;
::CloseHandle(hFile);
return err;
}
*/

//------------- comment out when testing WriteFile
std::cout.write(buffer, bytesRead);
//----------------------------------------
}
while (true);

::CloseHandle(hFile);
return 0;
}

remove \r\n while reading a file into stl vector

Assuming you input file is generated on the same platform as you are reading it on.

Then you can convert the LTS (in this case it looks like '\r\n') to a '\n' simply by opening the file in text mode:

std::ifstream testFile(inFileName);

You can remove specific characters by using the remove_copy algorithm:

std::vector<char> fileContents;

// Copy all elements that are not '\n'
std::remove_copy(std::istreambuf_iterator<char>(testFile), // src begin
std::istreambuf_iterator<char>(), // src end
std::back_inserter(fileContents), // dst begin
'\n'); // element to remove

If you need to remove more than one type of character you need to create a functor and use remove_copy_if algorithm:

struct DelNLorCR
{
bool operator()(char x) const {return x=='\n' || x=='\r';}
};
std::remove_copy_if(std::istreambuf_iterator<char>(testFile), // src begin
std::istreambuf_iterator<char>(), // src end
std::back_inserter(fileContents), // dst begin
DelNLorCR()); // functor describing bad characters

How to read huge file in c++

There are a couple of things that you can do.

First, there's no problem opening a file that is larger than the amount of RAM that you have. What you won't be able to do is copy the whole file live into your memory. The best thing would be for you to find a way to read just a few chunks at a time and process them. You can use ifstream for that purpose (with ifstream.read, for instance). Allocate, say, one megabyte of memory, read the first megabyte of that file into it, rinse and repeat:

ifstream bigFile("mybigfile.dat");
constexpr size_t bufferSize = 1024 * 1024;
unique_ptr<char[]> buffer(new char[bufferSize]);
while (bigFile)
{
bigFile.read(buffer.get(), bufferSize);
// process data in buffer
}

Another solution is to map the file to memory. Most operating systems will allow you to map a file to memory even if it is larger than the physical amount of memory that you have. This works because the operating system knows that each memory page associated with the file can be mapped and unmapped on-demand: when your program needs a specific page, the OS will read it from the file into your process's memory and swap out a page that hasn't been used in a while.

However, this can only work if the file is smaller than the maximum amount of memory that your process can theoretically use. This isn't an issue with a 1TB file in a 64-bit process, but it wouldn't work in a 32-bit process.

Also be aware of the spirits that you're summoning. Memory-mapping a file is not the same thing as reading from it. If the file is suddenly truncated from another program, your program is likely to crash. If you modify the data, it's possible that you will run out of memory if you can't save back to the disk. Also, your operating system's algorithm for paging in and out memory may not behave in a way that advantages you significantly. Because of these uncertainties, I would consider mapping the file only if reading it in chunks using the first solution cannot work.

On Linux/OS X, you would use mmap for it. On Windows, you would open a file and then use CreateFileMapping then MapViewOfFile.

C++: Improving ifstream binary file reading speed

It appeared that you can reach maximum SSD reading speed even with ifstream.

To do so, you need to set internal ifstream readbuffer to ~2Mb, which is where peak SSD read speed happening, while fitting nicely in L2 cache of CPU. Then you need to readout data in chunks smaller than internal buffer. I've got best results reading data in 8-16kB chunks, but it only about 1% faster than reading in 1Mb chunks.

Setting ifstream internal buffer:

ifstream datafile("base.txt", ios::binary);
datafile.rdbuf()->pubsetbuf(iobuf, sizeof iobuf);

With all these tweaks I've got 495 Mb/sec read speed which is close to theoretical maximum of M500 480Gb SSD. During execution CPU load was 5%, which means that it was not really limited by ifstream implementation overhead.

I found no observable speed difference between ifstream and std::basic_filebuf.



Related Topics



Leave a reply



Submit