Creating an Input Stream from Constant Memory

Creating an input stream from constant memory

The way to do this is to create a suitable stream buffer. This can, e.g., be done like this:

#include <streambuf>
#include <istream>

struct membuf: std::streambuf {
membuf(char const* base, size_t size) {
char* p(const_cast<char*>(base));
this->setg(p, p, p + size);
}
};
struct imemstream: virtual membuf, std::istream {
imemstream(char const* base, size_t size)
: membuf(base, size)
, std::istream(static_cast<std::streambuf*>(this)) {
}
};

The only somewhat awkward thing is the const_cast<char*>() in the stream buffer: the stream buffer won't change the data but the interface still requires char* to be used, mainly to make it easier to change the buffer in "normal" stream buffers. With this, you can use imemstream as a normal input stream:

imemstream in(data, size);
in >> value;

How is InputStream managed in memory?

InputStream and OutputStream implementations do not generally use a lot of memory. In fact, the word "Stream" in these types means that it does not need to hold the data, because it is accessed in a sequential manner -- in the same way that a stream can transfer water between a lake and the ocean without holding a lot of water itself.

But "stream" is not the best word to describe this. It's more like a pipe, because when you transfer data from a server to a client, every stage transfers back-pressure from the client that controls the rate at which data gets sent. This is similar to how your faucet controls the rate of flow through your pipes all the way to the city reservoir:

  1. As the client reads data, it's InputStream only requests more data from the OS when its internal (small) buffers are empty. Each request allows only a limited amount of data to be transferred;
  2. As data is requested from the OS, its own internal buffer empties, and it notifies the server about how much space there is for new data. The server can send only this much (that's called 'flow control' in TCP: https://en.wikipedia.org/wiki/Transmission_Control_Protocol#Resource_usage)
  3. On the server side, the server-side OS sends out data from its own internal buffer when the client has space to receive it. As its own internal buffer empties, it allows the writing process to re-fill it with more data.
  4. As the server-side process write()s to its OutputStream, the OutputStream will try to write data to the OS. When the OS buffer is full, it will make the server process wait until the server-side buffer has space to accept new data.

Notice that a slow client can make the server process take a very long time. If you're writing a server, and you don't control the clients, then it's very important to consider this and to ensure that there are not a lot of server-side resources tied up while a long data transfer takes place.

In Java, how can I create an InputStream for a specific part of a file?

I wrote a utility class that you can use like this:

try(FileChannel channel = FileChannel.open(file, READ);
InputStream input = new PartialChannelInputStream(channel, start, start + size)) {

thirdPartyMethod(input);
}

It reads the content of the file using a ByteBuffer, so you control the memory footprint.

import java.io.IOException;
import java.io.InputStream;
import java.nio.BufferUnderflowException;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;

public class PartialChannelInputStream extends InputStream {

private static final int DEFAULT_BUFFER_CAPACITY = 2048;

private final FileChannel channel;
private final ByteBuffer buffer;
private long position;
private final long end;

public PartialChannelInputStream(FileChannel channel, long start, long end)
throws IOException {
this(channel, start, end, DEFAULT_BUFFER_CAPACITY);
}

public PartialChannelInputStream(FileChannel channel, long start, long end, int bufferCapacity)
throws IOException {
if (start > end) {
throw new IllegalArgumentException("start(" + start + ") > end(" + end + ")");
}

this.channel = channel;
this.position = start;
this.end = end;
this.buffer = ByteBuffer.allocateDirect(bufferCapacity);
fillBuffer(end - start);
}

private void fillBuffer(long stillToRead) throws IOException {
if (stillToRead < buffer.limit()) {
buffer.limit((int) stillToRead);
}
channel.read(buffer, position);
buffer.flip();
}

@Override
public int read() throws IOException {
long stillToRead = end - position;
if (stillToRead <= 0) {
return -1;
}

if (!buffer.hasRemaining()) {
buffer.flip();
fillBuffer(stillToRead);
}

try {
position++;
return buffer.get();
} catch (BufferUnderflowException e) {
// Encountered EOF
position = end;
return -1;
}
}
}

This implementation above allows to create multiple PartialChannelInputStream reading from the same FileChannel and use them concurrently.

If that's not necessary, the simplified code below takes a Path directly.

import static java.nio.file.StandardOpenOption.READ;

import java.io.IOException;
import java.io.InputStream;
import java.nio.BufferUnderflowException;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import java.nio.file.Path;

public class PartialFileInputStream extends InputStream {

private static final int DEFAULT_BUFFER_CAPACITY = 2048;

private final FileChannel channel;
private final ByteBuffer buffer;
private long stillToRead;

public PartialChannelInputStream(Path file, long start, long end)
throws IOException {
this(channel, start, end, DEFAULT_BUFFER_CAPACITY);
}

public PartialChannelInputStream(Path file, long start, long end, int bufferCapacity)
throws IOException {
if (start > end) {
throw new IllegalArgumentException("start(" + start + ") > end(" + end + ")");
}

this.channel = FileChannel.open(file, READ).position(start);
this.buffer = ByteBuffer.allocateDirect(bufferCapacity);
this.stillToRead = end - start;
fillBuffer();
}

private void fillBuffer() throws IOException {
if (stillToRead < buffer.limit()) {
buffer.limit((int) stillToRead);
}
channel.read(buffer);
buffer.flip();
}

@Override
public int read() throws IOException {
if (stillToRead <= 0) {
return -1;
}

if (!buffer.hasRemaining()) {
buffer.flip();
fillBuffer();
}

try {
stillToRead--;
return buffer.get();
} catch (BufferUnderflowException e) {
// Encountered EOF
stillToRead = 0;
return -1;
}
}

@Override
public void close() throws IOException {
channel.close();
}
}

How to make istringstream more efficient?

You can simply assign the buffer used internally in the istringstream:

istringstream stream;
stream.rdbuf()->pubsetbuf(p, strlen(p));

This does not copy the string. Do note that pubsetbuf() wants char* not const char*, but it doesn't actually modify the string, so you might const_cast your C string pointer before passing it.

Can a java InputStream continuously read data from a method?

Sure. But you have a bit of an issue: Whatever code is generating the endless stream of dynamic data cannot just be in the method that 'returns the inputstream' just by itself, that's what your realisation is about.

You have two major options:

Threads

Instead, you could fire off a thread which is continually generating data. Note that whatever it 'generates' needs to be cached; this is not a good fit if, say, you want to dynamically generate an inputstream that just serves up an endless amount of 0 bytes, for example. It's a good fit if the data is coming from, say, a USB connected arduino that from time to time sends information about a temperature sensor that it's connected to. Note that you need the thread to store the data it receives someplace, and then have an inputstream that will 'pull' from this queue of data you're making. To make an inputstream that pulls from a queue, see the next section. As this will involve threads, use something from java.util.concurrent, such as ArrayBlockingQueue - this has the double benefit that you won't get infinite buffers, either (the act of putting something in the buffer will block if the buffer is full).

subclassing

What you can also do is take the code that can generate new values, but, put it in an envelope - a thing you can pass around. You want to make some code, but not run it - you want to run that later, when the thing you hand the inputstream to, calls .read().

One easy way to do that, is to extend InputStream - and then implement your own zero method. Looks something like this:


class InfiniteZeroesInputStream extends InputStream {
public int read() {
return 0;
}
}

It's that simple. Given:

try (InputStream in = new InfiniteZeroesInputStream()) {
in.read(); // returns 0.. and will always do so.
byte[] b = new byte[65536];
in.read(b); // fills the whole array with zeroes.
}

Does a Java InputStream help or hurt memory usage with large files?

So you are saying that an InputStream would typically help?

It entirely depends on how the application (or library) >>uses<< the InputStream

With what kind of follow up code? Could you offer an example of memory efficient Java?

For example:

  // Efficient use of memory
try (InputStream is = new FileInputStream(largeFileName);
BufferedReader br = new BufferedReader(new InputStreamReader(is))) {
String line;
while ((line = br.readLine()) != null) {
// process one line
}
}

// Inefficient use of memory
try (InputStream is = new FileInputStream(largeFileName);
BufferedReader br = new BufferedReader(new InputStreamReader(is))) {
StringBuilder sb = new StringBuilder();
while ((line = br.readLine()) != null) {
sb.append(line).append("\n");
}
String everything = sb.toString();
// process the entire string
}

// Very inefficient use of memory
try (InputStream is = new FileInputStream(largeFileName);
BufferedReader br = new BufferedReader(new InputStreamReader(is))) {
String everything = "";
while ((line = br.readLine()) != null) {
everything += line + "\n";
}
// process the entire string
}

(Note that there are more efficient ways of reading a file into memory. The above examples are purely to illustrate the principles.)

The general principles here are:

  • avoid holding the entire file in memory, all at the same time
  • if you have to hold the entire file in memory, then be careful about you "accumulate" the characters.

The posts that you linked to above:

  • The first one is not really about memory efficiency. Rather it is talking about a limitation of the AWS client-side library. Apparently, the API doesn't provide an easy way to stream an object while reading it. You have to save it the object to a file, then open the file as a stream. Whether that is memory efficient or not depends on what the application does with the stream; see above.

  • The second one specific to the POI APIs. Apparently, the POI library itself is reading the stream contents into memory if you use a stream. That would be an implementation limitation of that particular library.



Related Topics



Leave a reply



Submit