How Does Bufferedreader Read Files from S3

How does BufferedReader read files from S3?

One of the answer of your question is already given in the documentation you linked:

Your network connection remains open until you read all of the data or close the input stream.

A BufferedReader doesn't know where the data it reads is coming from, because you're passing another Reader to it. A BufferedReader creates a buffer of a certain size (e.g. 4096 characters) and fills this buffer by reading from the underlying Reader before starting to handing out data of calls of read() or read(char[] buf).

The Reader you pass to the BufferedReader is - by the way - using another buffer for itself to do the conversion from a byte-based stream to a char-based reader. It works the same way as with BufferedReader, so the internal buffer is filled by reading from the passed InputStream which is the InputStream returned by your S3-client.

What exactly happens within this client if you attempt to load data from the stream is implementation dependent. One way would be to keep open one network connection and you can read from it as you wish or the network connection can be closed after a chunk of data has been read and a new one is opened when you try to get the next one.

The documentation quoted above seems to say that we've got the former situation here, so: No, calls of readLine are not leading to single network calls.

And to answer your other question: No, a BufferedReader, the InputStreamReader and most likely the InputStream returned by the S3-client are not loading in the whole document into memory. That would contradict the whole purpose of using streams in the first place and the S3 client could simply return a byte[][] instead (to come around the limit of 2^32 bytes per byte-array)

Edit: There is an exception of the last paragraph. If the whole gigabytes big document has no line breaks, calling readLine will actually lead to the reading of the whole data into memory (and most likely to a OutOfMemoryError). I assumed a "regular" text document while answering your question.

How can I read an AWS S3 File with Java?

The 'File' class from Java doesn't understand that S3 exists. Here's an example of reading a file from the AWS documentation:

AmazonS3 s3Client = new AmazonS3Client(new ProfileCredentialsProvider());        
S3Object object = s3Client.getObject(new GetObjectRequest(bucketName, key));
InputStream objectData = object.getObjectContent();
// Process the objectData stream.
objectData.close();

How to make sure all the lines/records are read in S3Object

I wrote a method to read information from S3 object.

It looks fine to me1.

There are multiple records in S3Object, what's the best way to read all the lines.

Your code should read all of the lines.

Does it only read the first line of the object?

No. It should read all of the lines2. That while loop reads until readLine() returns null, and that only happens when you reach the end of the stream.

How to make sure all the lines are read?

If you are getting fewer lines than you expect, EITHER the S3 object contains fewer lines than you think, OR something is causing the object stream to close prematurely.

For the former, count the lines as you read them and compare that with the expected line count.

The latter could possibly be due to a timeout when reading a very large file. See How to read file chunk by chunk from S3 using aws-java-sdk for some ideas on how to deal with that problem.


1 - Actually, it would be better if you used a try with resources to ensure that the S3 stream is always closed. But that won't cause you to "lose" lines.

2 - This assumes that the S3 service doesn't time out the connection, and that you are not requesting a part (chunk) or a range in the URI request parameters; see https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html .

How to return io.BufferedReader through AWS Lambda(python)?

f.read() return bytes and JSON does not support binary data. Also "file":f is incorrect. Guess it should be a filename. Anyway, usually you would return binary data in JSON as base64:

import base64

s3 = boto3.client('s3')
def lambda_handler(event, context):
bucket = "document.as.a.service.test"
body = []
for record in event['uuid_filepath']:
with open('/tmp/2021-10-11T06:23:29:691472.pdf', 'wb') as f:
s3.download_fileobj(bucket, "123TfZ/2021-10-11T06:23:29:691472.pdf", f)
f = open("/tmp/2021-10-11T06:23:29:691472.pdf", "rb")
body.append(f)

return {
"statusCode": 200,
"file": "2021-10-11T06:23:29:691472.pdf",
"content": base64.b64encode(f.read())
}

Then on the client side, you have to decode base64 to binary.

How do I read the content of a file in Amazon S3

First you should get the object InputStream to do your need.

S3Object object = s3Client.getObject(new GetObjectRequest(bucketName, key));
InputStream objectData = object.getObjectContent();

Pass the InputStream, File Name and the path to the below method to download your stream.

public void saveFile(String fileName, String path, InputStream objectData) throws Exception {
DataOutputStream dos = null;
OutputStream out = null;
try {
File newDirectory = new File(path);
if (!newDirectory.exists()) {
newDirectory.mkdirs();
}

File uploadedFile = new File(path, uploadFileName);
out = new FileOutputStream(uploadedFile);
byte[] fileAsBytes = new byte[inputStream.available()];
inputStream.read(fileAsBytes);

dos = new DataOutputStream(out);
dos.write(fileAsBytes);
} catch (IOException io) {
io.printStackTrace();
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if (out != null) {
out.close();
}
if (dos != null) {
dos.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
}

After you Download your object read the file and make it to JSON and write it to .txt file after that you can upload the txt file to the desired bucket in S3

OOM when trying to process s3 file

This is a theory, but I can't think of any other reasons why your example would OOM.

Suppose that the uncompressed file consists contains a very long line; e.g. something like 650 million ASCII bytes.

Your application seems to just read the file a line at a time and (try to) display a running total of the megabytes that have been read.

Internally, the readLine() method reads characters one at a time and appends them to a StringBuffer. (You can see the append call in the stack trace.) If the file consist of a very large line, then the StringBuffer is going to get very large.

  • Each text character in the uncompressed string becomes a char in the char[] that is the buffer part of the StringBuffer.

  • Each time the buffer fills up, StringBuffer will grow the buffer by (I think) doubling its size. This entails allocating a new char[] and copying the characters to it.

  • So if the buffer fills when there are N characters, Arrays.copyOf will allocate a char[] hold 2 x N characters. And while the data is being copied, a total of 3 x N of character storage will be in use.

  • So 650MB could easily turn into a heap demand of > 6 x 650M bytes

The other thing to note that the 2 x N array has to be a single contiguous heap node.

Looking at the heap graphs, it looks like the heap got to ~1GB in use. If my theory is correct, the next allocation would have been for a ~2GB node. But 1GB + 2GB is right on the limit for your 3.1GB heap max. And when we take the contiguity requirement into account, the allocation cannot be done.


So what is the solution?

It is simple really: don't use readLine() if it is possible for lines to be unreasonably long.

    public static String unzip(InputStream in) 
throws IOException, CompressorException, ArchiveException {
System.out.println("Unzipping.............");
try (
GZIPInputStream gzis = new GZIPInputStream(in);
InputStreamReader reader = new InputStreamReader(gzis);
BufferedReader br = new BufferedReader(reader);
) {
int ch;
long i = 0;
while ((ch = br.read()) >= 0) {
i++;
if (i % (100 * 1024 * 1024) == 0) {
System.out.println(i / (1024 * 1024));
}
}
} catch (IOException e) {
e.printStackTrace();
LOG.error("Invoked AWSUtils getS3Content : json ", e);
}


Related Topics



Leave a reply



Submit