Reading files larger than 4GB using c++ stl
Apparently it depends on how off_t
is implemented by the library.
#include <streambuf>
__int64_t temp=std::numeric_limits<std::streamsize>::max();
gives you what the current max is.
STLport supports larger files.
Determining the size of a file larger than 4GB
If you're in Windows, you want GetFileSizeEx (MSDN). The return value is a 64bit int.
On linux stat64 (manpage) is correct. fstat if you're working with a FILE*.
msvc9, iostream and 2g/4g plus files
I ended up using STLport. The biggest difference with STLport being that some unit tests which crashed during multiplies of double precision numbers now work and those unit tests pass. There are some other differences with relative precision popping up but those seem to be minor.
File size is larger than it should, extra new lines are added
The \n
character has special meaning to STL character streams. It represents a newline, which gets translated to the platform-specific line break upon output. This is discussed here:
Binary and text modes
A text stream is an ordered sequence of characters composed into lines (zero or more characters plus a terminating
'\n'
). Whether the last line requires a terminating'\n'
is implementation-defined. Characters may have to be added, altered, or deleted on input and output to conform to the conventions for representing text in the OS (in particular, C streams on Windows OS convert\n
to\r\n
on output, and convert\r\n
to\n
on input) .
So it is likely that std::cout
outputs \r\n
when it is given \n
, even if a preceding \r
was also given, thus an input of \r\n
could become \r\r\n
on output. It is not standardized behavior on Windows how individual apps handle bare-CR characters. They might be ignored, or they might be treated as line breaks. In your case, it sounds like the latter.
There is no standard way to use std::cout
in binary mode so \n
is output as \n
instead of as \r\n
. However, see How to make cout behave as in binary mode? for some possible ways that you might be able to make std::cout
output in binary mode on Windows, depending on your compiler and STL implementation. Or, you could try using std::cout.rdbuf()
to substitute in your own std::basic_streambuf
object that performs binary output to the console.
That being said, the way your code is handling the data buffer is a little off, it should look more like this instead (not accounting for the above info):
#include <iostream>
#include <Windows.h>
int main()
{
HANDLE hFile = ::CreateFile("C:\\123.txt",
GENERIC_READ,
FILE_SHARE_READ |
FILE_SHARE_WRITE |
FILE_SHARE_DELETE, // why??
NULL,
OPEN_EXISTING,
FILE_ATTRIBUTE_NORMAL,
NULL);
if (INVALID_HANDLE_VALUE == hFile)
return ::GetLastError();
char buffer[256];
DWORD bytesRead, bytesWritten, err;
//======== so WriteFile outputs to console, not needed for cout version
HANDLE hStandardOutput = ::GetStdHandle(STD_OUTPUT_HANDLE);
if (INVALID_HANDLE_VALUE == hStandardOutput)
{
err = ::GetLastError();
std::cout << "GetStdHandle error code = " << err << std::endl;
::CloseHandle(hFile);
return err;
}
//============================
do
{
if (!::ReadFile(hFile, buffer, sizeof(buffer), &bytesRead, NULL))
{
err = ::GetLastError();
std::cout << "ReadFile error code = " << err << std::endl;
::CloseHandle(hFile);
return err;
}
if (bytesRead == 0) // EOF reached
break;
/*============= Works fine
if (!::WriteFile(hStandardOutput, buffer, bytesRead, &bytesWritten, NULL))
{
err = ::GetLastError();
std::cout << "WriteFile error code = " << err << std::endl;
::CloseHandle(hFile);
return err;
}
*/
//------------- comment out when testing WriteFile
std::cout.write(buffer, bytesRead);
//----------------------------------------
}
while (true);
::CloseHandle(hFile);
return 0;
}
remove \r\n while reading a file into stl vector
Assuming you input file is generated on the same platform as you are reading it on.
Then you can convert the LTS (in this case it looks like '\r\n') to a '\n' simply by opening the file in text mode:
std::ifstream testFile(inFileName);
You can remove specific characters by using the remove_copy
algorithm:
std::vector<char> fileContents;
// Copy all elements that are not '\n'
std::remove_copy(std::istreambuf_iterator<char>(testFile), // src begin
std::istreambuf_iterator<char>(), // src end
std::back_inserter(fileContents), // dst begin
'\n'); // element to remove
If you need to remove more than one type of character you need to create a functor and use remove_copy_if
algorithm:
struct DelNLorCR
{
bool operator()(char x) const {return x=='\n' || x=='\r';}
};
std::remove_copy_if(std::istreambuf_iterator<char>(testFile), // src begin
std::istreambuf_iterator<char>(), // src end
std::back_inserter(fileContents), // dst begin
DelNLorCR()); // functor describing bad characters
How to read huge file in c++
There are a couple of things that you can do.
First, there's no problem opening a file that is larger than the amount of RAM that you have. What you won't be able to do is copy the whole file live into your memory. The best thing would be for you to find a way to read just a few chunks at a time and process them. You can use ifstream
for that purpose (with ifstream.read
, for instance). Allocate, say, one megabyte of memory, read the first megabyte of that file into it, rinse and repeat:
ifstream bigFile("mybigfile.dat");
constexpr size_t bufferSize = 1024 * 1024;
unique_ptr<char[]> buffer(new char[bufferSize]);
while (bigFile)
{
bigFile.read(buffer.get(), bufferSize);
// process data in buffer
}
Another solution is to map the file to memory. Most operating systems will allow you to map a file to memory even if it is larger than the physical amount of memory that you have. This works because the operating system knows that each memory page associated with the file can be mapped and unmapped on-demand: when your program needs a specific page, the OS will read it from the file into your process's memory and swap out a page that hasn't been used in a while.
However, this can only work if the file is smaller than the maximum amount of memory that your process can theoretically use. This isn't an issue with a 1TB file in a 64-bit process, but it wouldn't work in a 32-bit process.
Also be aware of the spirits that you're summoning. Memory-mapping a file is not the same thing as reading from it. If the file is suddenly truncated from another program, your program is likely to crash. If you modify the data, it's possible that you will run out of memory if you can't save back to the disk. Also, your operating system's algorithm for paging in and out memory may not behave in a way that advantages you significantly. Because of these uncertainties, I would consider mapping the file only if reading it in chunks using the first solution cannot work.
On Linux/OS X, you would use mmap
for it. On Windows, you would open a file and then use CreateFileMapping
then MapViewOfFile
.
C++: Improving ifstream binary file reading speed
It appeared that you can reach maximum SSD reading speed even with ifstream.
To do so, you need to set internal ifstream readbuffer to ~2Mb, which is where peak SSD read speed happening, while fitting nicely in L2 cache of CPU. Then you need to readout data in chunks smaller than internal buffer. I've got best results reading data in 8-16kB chunks, but it only about 1% faster than reading in 1Mb chunks.
Setting ifstream internal buffer:
ifstream datafile("base.txt", ios::binary);
datafile.rdbuf()->pubsetbuf(iobuf, sizeof iobuf);
With all these tweaks I've got 495 Mb/sec read speed which is close to theoretical maximum of M500 480Gb SSD. During execution CPU load was 5%, which means that it was not really limited by ifstream implementation overhead.
I found no observable speed difference between ifstream and std::basic_filebuf.
Related Topics
Cmake Imported Library Behaviour
Opencv Gtk+2.X Error - "Unspecified Error (The Function Is Not Implemented...)"
Parse String Containing Numbers into Integer Array
Why Is the New Operator Allowed to Return *Void to Every Pointer-Type
G++' Is Not Recognized as an Internal or External Command, Operable Program or Batch File
Sub-Millisecond Precision Timing in C or C++
Will I Be Able to Declare a Constexpr Lambda Inside a Template Parameter
What Happens to the Memory Allocated by 'New' If the Constructor Throws
C++ Difference Between 0 and 0.0
How to Open a Folder in %Appdata% with C++
Partial Specialization Ordering with Non-Deduced Context
What Do the C and C++ Standards Say About Bit-Level Integer Representation and Manipulation
Partial Specialization of Function Templates