Efficient Way of Reading a File into an Std::Vector≪Char≫

Efficient way of reading a file into an std::vector char ?

The canonical form is this:

#include<iterator>
// ...

std::ifstream testFile("testfile", std::ios::binary);
std::vector<char> fileContents((std::istreambuf_iterator<char>(testFile)),
std::istreambuf_iterator<char>());

If you are worried about reallocations then reserve space in the vector:

#include<iterator>
// ...

std::ifstream testFile("testfile", std::ios::binary);
std::vector<char> fileContents;
fileContents.reserve(fileSize);
fileContents.assign(std::istreambuf_iterator<char>(testFile),
std::istreambuf_iterator<char>());

How to read a file into a vector elegantly and efficiently?

I've checked your code on my side using with mingw482.
Out of curiosity I've added an additional function f3 with the following implementation:

inline vector<char> f3()
{
ifstream fin{ filepath, ios::binary };
fin.seekg (0, fin.end);
size_t len = fin.tellg();
fin.seekg (0, fin.beg);

vector<char> coll(len);
fin.read(coll.data(), len);
return coll;
}

I've tested using a file ~90M long. For my platform the results were a bit different than for you.

  • f1() ~850ms
  • f2() ~600ms
  • f3() ~70ms

The results were calculated as mean of 10 consecutive file reads.

The f3 function takes the least time since at vector<char> coll(len); it has all the required memory allocated and no further reallocations need to be done. As to the back_inserter it requires the type to have push_back member function. Which for vector does the reallocation when capacity is exceeded. As described in docs:

push_back

This effectively increases the container size by one, which causes an
automatic reallocation of the allocated storage space if -and only if-
the new vector size surpasses the current vector capacity.

Among f1 and f2 implementations the latter is slightly faster although both use the back_inserter. The f2 is probably faster since it reads the file in chunks which allows some buffering to take place.

Efficient way of reading part of a file into an std::vector char ?

auto old_end = buffer.size();
buffer.resize( old_end + blocksize );

//...

file.read( &buffer[old_end], blocksize );
auto actual_size = file.gcount;
if (actual_size < blocksize) buffer.resize(old_end + actual_size);

Loading a file into a vector char

Another approach, using rdbuf() to read the whole file to a std::stringstream first:

#include <fstream>
#include <sstream>
#include <vector>
#include <string>

// for check:
#include <algorithm>
#include <iterator>
#include <iostream>

int main() {
std::ifstream file("test.cc");
std::ostringstream ss;
ss << file.rdbuf();
const std::string& s = ss.str();
std::vector<char> vec(s.begin(), s.end());

// check:
std::copy(vec.begin(), vec.end(), std::ostream_iterator<char>(std::cout));
}

Load data into std::vector char efficiently

Is there any way to tell the v2 vector that its internal memory buffer is loaded with data?

No.

The behaviour of your second example is undefined.

This would be useful when you want to use a std::vector to hold data that is sourced from a file stream.

You can read a file into vector like this:

std::vector<char> v3(count);
ifs.read(v3.data(), count);

Or like this:

using It = std::istreambuf_iterator<char>;
std::vector<char> v4(It{ifs}, It{});

Efficient way of reading part of a file into an std::vector char ?

auto old_end = buffer.size();
buffer.resize( old_end + blocksize );

//...

file.read( &buffer[old_end], blocksize );
auto actual_size = file.gcount;
if (actual_size < blocksize) buffer.resize(old_end + actual_size);

Fastest way to read a vector double from file

Assuming both systems, windows and android, are little endian, which is common in ARM and x86/x64 CPUs, you can do the following.

First: Determine the type with a sepcific size, so either double, with 64-bit, float with 32-bit, or uint64/32/16 or int64/32/16. Do NOT use stuff like int or long to determine your data type.

Second: Use the following method to write binary data:

std::vector<uint64_t> myVec;
std::ofstream f("outputFile.bin", std::ios::binary);
f.write(reinterpret_cast<char*>(myVec.data()), myVec.size()*sizeof(uint64_t));
f.close();

In this, you're take the raw data and writing its binary format to a file.

Now on other machine, make sure the data type you use has the same datatype size and same endianness. If both are the same, you can do this:

std::vector<uint64_t> myVec(sizeOfTheData);
std::ifstream f("outputFile.bin", std::ios::binary);
f.read(reinterpret_cast<char*>(&myVec.front()), myVec.size()*sizeof(uint64_t));
f.close();

Notice that you have to know the size of the data before reading it.

Note: This code is off my head. I haven't tested it, but it should work.

Now if the target system doesn't have the same endianness, you have to read the data in batches, flip the endianness, then put it in your vector. How to flip endianness was extensively discussed here.

To determine the endianness of your system, this was discussed here.

The penalty on performance will be proportional to how different these systems are. If they're both the same endianness and you choose the same data type and size, you're good and you have optimum performance. Otherwise, you'll have some penalty depending on how many conversion you have to do. This is the fastest that you can ever get.

Note from comments: If you're transferring doubles or floats, make sure both systems use IEEE 754 standard. It's very common to use these, way more than endianness, but just to be sure.

Now if these solutions don't fit you, then you have to use a proper serialization library to standardize the format for you. There are libraries that can do that, such as protobuf.

How do I read a text file into a 2D vector?

You can use getline to read the file line by line into strings. For each string you read, iterate over its characters to build row vectors to populate your forest:

#include <fstream>
#include <iostream>
#include <string>
#include <vector>

int main() {
std::string line;
std::ifstream infile("file.txt");
std::vector<std::vector<char> > forest;

while (std::getline(infile, line)) {
std::vector<char> row;

for (char &c : line) {
if (c != ',') {
row.push_back(c);
}
}

forest.push_back(row);
}

for (std::vector<char> &row : forest) {
for (char &c : row) {
std::cout << c << ' ';
}

std::cout << '\n';
}

return 0;
}

Output:

T T T F T
T T T T T
T T T T T
T T T T T
T T T T T
T T T T T
T T T T T

Faster way of loading (big) std::vector std::vector float from file

This is an implementation of Alan Birtles' comment: When reading, read an inner vector with one single FILE.read call instead of many individual ones. This reduces the time dramatically on my system:

These are the results for a 2GB file:

Writing    took 2283 ms
Reading v1 took 7429 ms
Reading v2 took 644 ms

Here is the code that produces this output:

#include <vector>
#include <iostream>
#include <string>
#include <chrono>
#include <random>
#include <fstream>

using RV = std::vector<std::vector<float>>;

void saveData(std::string path, const RV& RecordData)
{
std::ofstream FILE(path, std::ios::out | std::ofstream::binary);

// Store size of the outer vector
int s1 = RecordData.size();
FILE.write(reinterpret_cast<const char*>(&s1), sizeof(s1));

// Now write each vector one by one
for (auto& v : RecordData) {
// Store its size
int size = v.size();
FILE.write(reinterpret_cast<const char*>(&size), sizeof(size));

// Store its contents
FILE.write(reinterpret_cast<const char*>(&v[0]), v.size() * sizeof(float));
}
FILE.close();
}

//original version for comparison
void loadData1(std::string path, RV& RecordData)
{
std::ifstream FILE(path, std::ios::in | std::ifstream::binary);

if (RecordData.size() > 0) // Clear data
{
for (int n = 0; n < RecordData.size(); n++)
RecordData[n].clear();
RecordData.clear();
}

int size = 0;
FILE.read(reinterpret_cast<char*>(&size), sizeof(size));
RecordData.resize(size);
for (int n = 0; n < size; ++n) {
int size2 = 0;
FILE.read(reinterpret_cast<char*>(&size2), sizeof(size2));
float f;
//RecordData[n].resize(size2); // This doesn't make a difference in speed
for (int k = 0; k < size2; ++k) {
FILE.read(reinterpret_cast<char*>(&f), sizeof(f));
RecordData[n].push_back(f);
}
}
}

//my version
void loadData2(std::string path, RV& RecordData)
{
std::ifstream FILE(path, std::ios::in | std::ifstream::binary);

if (RecordData.size() > 0) // Clear data
{
for (int n = 0; n < RecordData.size(); n++)
RecordData[n].clear();
RecordData.clear();
}

int size = 0;
FILE.read(reinterpret_cast<char*>(&size), sizeof(size));
RecordData.resize(size);
for (auto& v : RecordData) {
// load its size
int size2 = 0;
FILE.read(reinterpret_cast<char*>(&size2), sizeof(size2));
v.resize(size2);

// load its contents
FILE.read(reinterpret_cast<char*>(&v[0]), v.size() * sizeof(float));
}
}

int main()
{
using namespace std::chrono;
const std::string filepath = "./vecdata";
const std::size_t sizeOuter = 16000;
const std::size_t sizeInner = 32000;
RV vecSource;
RV vecLoad1;
RV vecLoad2;

const auto tGen1 = steady_clock::now();
std::cout << "generating random numbers..." << std::flush;
std::random_device dev;
std::mt19937 rng(dev());
std::uniform_real_distribution<float> dis;
for(int i = 0; i < sizeOuter; ++i)
{
RV::value_type inner;
for(int k = 0; k < sizeInner; ++k)
{
inner.push_back(dis(rng));
}
vecSource.push_back(inner);
}
const auto tGen2 = steady_clock::now();

std::cout << "done\nSaving..." << std::flush;
const auto tSave1 = steady_clock::now();
saveData(filepath, vecSource);
const auto tSave2 = steady_clock::now();

std::cout << "done\nReading v1..." << std::flush;
const auto tLoadA1 = steady_clock::now();
loadData1(filepath, vecLoad1);
const auto tLoadA2 = steady_clock::now();
std::cout << "verifying..." << std::flush;
if(vecSource != vecLoad1) std::cout << "FAILED! ...";

std::cout << "done\nReading v2..." << std::flush;
const auto tLoadB1 = steady_clock::now();
loadData2(filepath, vecLoad2);
const auto tLoadB2 = steady_clock::now();
std::cout << "verifying..." << std::flush;
if(vecSource != vecLoad2) std::cout << "FAILED! ...";


std::cout << "done\nResults:\n" <<
"Generating took " << duration_cast<milliseconds>(tGen2 - tGen1).count() << " ms\n" <<
"Writing took " << duration_cast<milliseconds>(tSave2 - tSave1).count() << " ms\n" <<
"Reading v1 took " << duration_cast<milliseconds>(tLoadA2 - tLoadA1).count() << " ms\n" <<
"Reading v2 took " << duration_cast<milliseconds>(tLoadB2 - tLoadB1).count() << " ms\n" <<
std::flush;
}


Related Topics



Leave a reply



Submit