Are There Binary Memory Streams in C++

Streaming binary files in C++

What strikes me as particularly weird is the read() function, which seems to me to be completely unusable seeing as how it doesn't tell how many bytes it actually put into the supplied buffer.

read() does not exit until either 1) the requested number of characters have been read, 2) EOF is reached, or 3) an error occurs.

After read() exits, if the read was successful, you can call gcount() to find out how many characters were read into the buffer. If EOF was reached during the read, the stream's eofbit state will be set to true, and gcount() will return fewer characters than you requested.

If the read fails, the stream's failbit and/or badbit state is set to true.

std::ifstream ifs(...);
if (is) {
    // stream opened...
    //...
    ifs.read(buffer, sizeof(buffer));
    if (ifs) {
        // read succeeded, EOF may have been reached...
        std::streamsize numInBuf = ifs.gcount();
        //...
    } else {
        // read failed...
    }
    //...
} else {
    // stream not opened...
}

If you use the stream's exceptions() method to enable error reporting via exceptions, an std::ios_base::failure exception may be thrown if the failure matches the error bits you have enabled exceptions for.

std::ifstream ifs;
ifs.exceptions(std::ifstream::badbit | std::ifstream::failbit);
try {
    ifs.open(...);
    // stream opened...
    //...
    ifs.read(buffer, sizeof(buffer));
    // read succeeded, EOF may have been reached...
    std::streamsize numInBuf = ifs.gcount();
    //...
} catch (const std::ios_base::failure &e) {
    // stream failure...
}

So how is one actually supposed to stream a file/pipe/socket with non-text data in C++? Is there some better facility than ifstream, perhaps?

std::ifstream is designed for file-based streams. In the case of pipes, if your platform can access pipes via standard file APIs, std::ifstream should work. For sockets though, you need to use a more appropriate std::basic_istream derived class, or at least use a standard std::istream with a custom std::streambuf derived class attached to it (example).

Access Binary form of Text saved in memory using an Array

I have been told that I/O functions like fread and fgets will read a string from disk into memory without conversion.

This is true if the file has been open as binary, ie: with "rb". Such streams do not undergo any translation when read into memory, and all stream functions will read the contents as it is stored on disk, getc() included. If your system is unix based, there is no difference with "r", but on legacy systems, there can be substantial differences: text mode, which is the default, may imply end of line conversion, code page translation, end of file mitigation... If you want the actual file contents, always use binary mode ("rb").

You should also avoid the char type when dealing with binary representation, because char is signed by default on many architectures, hence inappropriate for byte values which are usually considered positive. Use unsigned char to prevent this issue.^(*)

The most common way to display binary contents is using hexadecimal representation, where each byte is output as exactly 2 hex digits.

If you want to output binary representation, there is no standard printf conversion to output base-2 numbers, but you can write a loop to convert the byte to its bit values.

^{(*) among other historical issues such as non two's complement signed value representations}

Here is a modified version:

#include <stdio.h>

int main() {
    FILE *fp = fopen("a.txt", "r");
    if (fp == NULL) {
        perror("a.txt");
        return 1;
    }
    unsigned char buffer[100];
    unsigned char bits[100 * 8];
    int r = fread(buffer, 1, sizeof(buffer), fp);
    if (r <= 0) {
        fprintf(stderr, "empty file\n");
        fclose(fp);
        return 1;
    }
    
    printf("As a string: %.*s\n\n", r, (char *)buffer);
    int pos;
    pos = printf("As 8-bit integers:");
    for (int i = 0; i < r; i++) {
        if (pos > 72) {
            printf("\n");
            pos = 0;
        }
        pos += printf(" %d", buffer[i]);
    }
    printf("\n\n");
    
    pos = printf("As hex bytes:");
    for (int i = 0; i < r; i++) {
        if (pos > 72) {
            printf("\n");
            pos = 0;
        }
        pos += printf(" %02X", buffer[i]);
    }
    printf("\n\n");
    
    pos = printf("Converting to a bit array:");
    for (int i = 0; i < r; i++) {
        for (int j = 8; j-- > 0;) {
            bits[i * 8 + 7 - j] = (buffer[i] >> j) & 1;
        }
    }
    /* output the bit array */
    for (int i = 0; i < r * 8; i++) {
        if (pos > 72) {
            printf("\n    ");
            pos = 4;
        }
        pos += printf("%d", bits[i]);
    }
    printf("\n");
    fclose(fp);
    return 0;
}

.net: efficient way to read a binary file into memory then access

To get the initial MemoryStream from reading the file, the following works:

    byte[] bytes;
    try
    {
        // File.ReadAllBytes opens a filestream and then ensures it is closed
        bytes = File.ReadAllBytes(_fi.FullName); 
        _ms = new MemoryStream(bytes, 0, bytes.Length, false, true);
    }
    catch (IOException e)
    {
        throw e;
    }

File.ReadAllBytes() copies the file content into memory. It uses using, which means that it ensures the file gets closed. So no Finally statement is needed.

I can read individual values from the MemoryStream using MemoryStream.Read. These calls involve copies of those values, which is fine.

In one situation, I needed to read a table out of the file, change a value, and then calculate a checksum of the entire file with that change in place. Instead of copying the entire file so that I could edit one part, I was able to calculate the checksum in progressive steps: first on the initial, unchanged segment of the file, then continue with the middle segment that was changed, then continue with the remainder.

For this I could process the first and final segments using the MemoryStream. This involved lots of reads, with each read copying; but those copies were transient variables, so no significant working set increase.

For the middle segment, that needed to be copied since it had to be changed (but the original version needed to be kept intact). The following worked:

    // get ref (not copy!) to the byte array underlying the MemoryStream
    byte[] fileData = _ms.GetBuffer();

    // determine the required length
    int length = _tableRecord.Length;

    // create array to hold the copy
    byte[] segmentCopy = new byte[length];

    // get the copy
    Array.ConstrainedCopy(fileData, _tableRecord.Offset, segmentCopy, 0, length);

After modifying values in segmentCopy, I then needed to pass this to my static method for calculating checksums, which expected a MemoryStream (for sequential reading). This worked:

    // new MemoryStream will hold a ref to the segmentCopy array (no new copy!)
    MemoryStream ms = new MemoryStream(segmentCopy, 0, segmentCopy.Length);

What I haven't needed to do yet, but will want to do, is to get a slice of the MemoryStream that doesn't involve copying. This works:

    MemoryStream sliceFromMS = new MemoryStream(fileData, offset, length);

From above, fileData was a ref to the array underlying the original MemoryStream. Now sliceFromMS will have a ref to a segment within that same array.

How to import and read large binary file data in c#?

This will very much depend on what format the file is in. Each byte in the file might represent different things, or it might just represent values from a large array, or some mix of the two.

You need to know what the format looks like to be able to read it, since binary files are not self-descriptive. Reading a simple object might look like

var authorName = binReader.ReadString();
var publishDate = DateTime.FromBinary(binReader.ReadInt64());
...

If you have a list of items it is common to use a length prefix. Something like

var numItems = binReader.ReadInt32();
for(int i = 0; i < numItems; i++){
    var title = binReader.ReadString();
    ...
}

You would then typically create one or more objects from the data that can be used in the rest of the application. I.e.

new Bibliography(authorName, publishDate , books);

If this is a format you do not control I hope you have a detailed specification. Otherwise this is kind of a lost cause for anything but the cludgiest solutions.

If there is more data than can fit in memory you need some kind of streaming mechanism. I.e. read one item, do some processing of the item, save the result, read the next item, etc.

If you do control the format I would suggest alternatives that are easier to manage. I have used protobuf.Net, and I find it quite easy to use, but there are other alternatives. The common way to use these kinds of libraries is to create a class for the data, and add attributes for the fields that should be stored. The library can manage serialization/deserialization automatically, and usually handle things like inheritance and changes to the format in an easy way.

Simpler way to create a C++ memorystream from (char*, size_t), without copying the data?

I'm assuming that your input data is binary (not text), and that you want to extract chunks of binary data from it. All without making a copy of your input data.

You can combine boost::iostreams::basic_array_source and boost::iostreams::stream_buffer (from Boost.Iostreams) with boost::archive::binary_iarchive (from Boost.Serialization) to be able to use convenient extraction >> operators to read chunks of binary data.

#include <stdint.h>
#include <iostream>
#include <boost/iostreams/device/array.hpp>
#include <boost/iostreams/stream.hpp>
#include <boost/archive/binary_iarchive.hpp>

int main()
{
    uint16_t data[] = {1234, 5678};
    char* dataPtr = (char*)&data;

    typedef boost::iostreams::basic_array_source<char> Device;
    boost::iostreams::stream_buffer<Device> buffer(dataPtr, sizeof(data));
    boost::archive::binary_iarchive archive(buffer, boost::archive::no_header);

    uint16_t word1, word2;
    archive >> word1 >> word2;
    std::cout << word1 << "," << word2 << std::endl;
    return 0;
}

With GCC 4.4.1 on AMD64, it outputs:

1234,5678

Boost.Serialization is very powerful and knows how to serialize all basic types, strings, and even STL containers. You can easily make your types serializable. See the documentation. Hidden somewhere in the Boost.Serialization sources is an example of a portable binary archive that knows how to perform the proper swapping for your machine's endianness. This might be useful to you as well.

If you don't need the fanciness of Boost.Serialization and are happy to read the binary data in an fread()-type fashion, you can use basic_array_source in a simpler way:

#include <stdint.h>
#include <iostream>
#include <boost/iostreams/device/array.hpp>
#include <boost/iostreams/stream.hpp>

int main()
{
    uint16_t data[] = {1234, 5678};
    char* dataPtr = (char*)&data;

    typedef boost::iostreams::basic_array_source<char> Device;
    boost::iostreams::stream<Device> stream(dataPtr, sizeof(data));

    uint16_t word1, word2;
    stream.read((char*)&word1, sizeof(word1));
    stream.read((char*)&word2, sizeof(word2));
    std::cout << word1 << "," << word2 << std::endl;

    return 0;
}

I get the same output with this program.

How to read a binary file quickly in c#? (ReadOnlySpan vs MemoryStream)

I did some measurement of your code on my computer (Intel Q9400, 8 GiB RAM, SSD disk, Win10 x64 Home, .NET Framework 4/7/2, tested with 15 MB (when unpacked) file) with these results:

No-Span version: 520 ms

Span version: 720 ms

So Span version is actually slower! Why? Because new ReadOnlySpan<byte>(m.ToArray()) performs additional copy of whole file and also ReadUInt32() performs many slicings of the Span (slicing is cheap, but not free). Since you performed more work, you can't expect performance to be any better just because you used Span.

So can we do better? Yes. It turns out that the slowest part of your code is actually garbage collection caused by repeatedly allocating 4-byte Arrays created by the .ToArray() calls in ReadUInt32() method. You can avoid it by implementing ReadUInt32() yourself. It's pretty easy and also eliminates need for Span slicing. You can also replace new ReadOnlySpan<byte>(m.ToArray()) with new ReadOnlySpan<byte>(m.GetBuffer()).Slice(0, (int)m.Length);, which performs cheap slicing instead of copy of whole file. So now code looks like this:

public static void Read(FileInfo path)
{
    using (FileStream filestream = path.OpenRead())
    {
        using (var d = new GZipStream(filestream, CompressionMode.Decompress))
        {
            using (MemoryStream m = new MemoryStream())
            {
                d.CopyTo(m);
                int position = 0;

                ReadOnlySpan<byte> stream = new ReadOnlySpan<byte>(m.GetBuffer()).Slice(0, (int)m.Length);

                while (position != stream.Length)
                {
                    UInt32 value = stream.ReadUInt32(position);
                    position += 4;
                }
            }
        }
    }
}

public static class BinaryReaderBigEndian
{
    public static UInt32 ReadUInt32(this ReadOnlySpan<byte> stream, int start)
    {
        UInt32 res = 0;
        for (int i = 0; i < 4; i++)
            {
                res = (res << 8) | (((UInt32)stream[start + i]) & 0xff);
        }
        return res;
    }
}

With these changes I get from 720 ms down to 165 ms (4x faster). Sounds great, doesn't it? But we can do even better. We can completely avoid MemoryStream copy and inline and further optimize ReadUInt32():

public static void Read(FileInfo path)
{
    using (FileStream filestream = path.OpenRead())
    {
        using (var d = new GZipStream(filestream, CompressionMode.Decompress))
        {
            var buffer = new byte[64 * 1024];

            do
            {
                int bufferDataLength = FillBuffer(d, buffer);

                if (bufferDataLength % 4 != 0)
                    throw new Exception("Stream length not divisible by 4");

                if (bufferDataLength == 0)
                    break;

                for (int i = 0; i < bufferDataLength; i += 4)
                {
                    uint value = unchecked(
                        (((uint)buffer[i]) << 24)
                        | (((uint)buffer[i + 1]) << 16)
                        | (((uint)buffer[i + 2]) << 8)
                        | (((uint)buffer[i + 3]) << 0));
                }

            } while (true);
        }
    }
}

private static int FillBuffer(Stream stream, byte[] buffer)
{
    int read = 0;
    int totalRead = 0;
    do
    {
        read = stream.Read(buffer, totalRead, buffer.Length - totalRead);
        totalRead += read;

    } while (read > 0 && totalRead < buffer.Length);

    return totalRead;
}

And now it takes less than 90 ms (8x faster then the original!). And without Span! Span is great in situations, where it allows perform slicing and avoid array copy, but it won't improve performance just by blindly using it. After all, Span is designed to have performance characteristics on par with Array, but not better (and only on runtimes that have special support for it, such as .NET Core 2.1).