How does BufferedReader read files from S3?
One of the answer of your question is already given in the documentation you linked:
Your network connection remains open until you read all of the data or close the input stream.
A BufferedReader
doesn't know where the data it reads is coming from, because you're passing another Reader
to it. A BufferedReader
creates a buffer of a certain size (e.g. 4096 characters) and fills this buffer by reading from the underlying Reader
before starting to handing out data of calls of read()
or read(char[] buf)
.
The Reader
you pass to the BufferedReader
is - by the way - using another buffer for itself to do the conversion from a byte
-based stream to a char
-based reader. It works the same way as with BufferedReader
, so the internal buffer is filled by reading from the passed InputStream
which is the InputStream
returned by your S3-client.
What exactly happens within this client if you attempt to load data from the stream is implementation dependent. One way would be to keep open one network connection and you can read from it as you wish or the network connection can be closed after a chunk of data has been read and a new one is opened when you try to get the next one.
The documentation quoted above seems to say that we've got the former situation here, so: No, calls of readLine
are not leading to single network calls.
And to answer your other question: No, a BufferedReader
, the InputStreamReader
and most likely the InputStream
returned by the S3-client are not loading in the whole document into memory. That would contradict the whole purpose of using streams in the first place and the S3 client could simply return a byte[][]
instead (to come around the limit of 2^32 bytes per byte
-array)
Edit: There is an exception of the last paragraph. If the whole gigabytes big document has no line breaks, calling readLine
will actually lead to the reading of the whole data into memory (and most likely to a OutOfMemoryError). I assumed a "regular" text document while answering your question.
How can I read an AWS S3 File with Java?
The 'File' class from Java doesn't understand that S3 exists. Here's an example of reading a file from the AWS documentation:
AmazonS3 s3Client = new AmazonS3Client(new ProfileCredentialsProvider());
S3Object object = s3Client.getObject(new GetObjectRequest(bucketName, key));
InputStream objectData = object.getObjectContent();
// Process the objectData stream.
objectData.close();
How to make sure all the lines/records are read in S3Object
I wrote a method to read information from S3 object.
It looks fine to me1.
There are multiple records in S3Object, what's the best way to read all the lines.
Your code should read all of the lines.
Does it only read the first line of the object?
No. It should read all of the lines2. That while
loop reads until readLine()
returns null
, and that only happens when you reach the end of the stream.
How to make sure all the lines are read?
If you are getting fewer lines than you expect, EITHER the S3 object contains fewer lines than you think, OR something is causing the object stream to close prematurely.
For the former, count the lines as you read them and compare that with the expected line count.
The latter could possibly be due to a timeout when reading a very large file. See How to read file chunk by chunk from S3 using aws-java-sdk for some ideas on how to deal with that problem.
1 - Actually, it would be better if you used a try with resources to ensure that the S3 stream is always closed. But that won't cause you to "lose" lines.
2 - This assumes that the S3 service doesn't time out the connection, and that you are not requesting a part (chunk) or a range in the URI request parameters; see https://docs.aws.amazon.com/AmazonS3/latest/API/API_GetObject.html .
How to return io.BufferedReader through AWS Lambda(python)?
f.read()
return bytes and JSON does not support binary data. Also "file":f
is incorrect. Guess it should be a filename. Anyway, usually you would return binary data in JSON as base64
:
import base64
s3 = boto3.client('s3')
def lambda_handler(event, context):
bucket = "document.as.a.service.test"
body = []
for record in event['uuid_filepath']:
with open('/tmp/2021-10-11T06:23:29:691472.pdf', 'wb') as f:
s3.download_fileobj(bucket, "123TfZ/2021-10-11T06:23:29:691472.pdf", f)
f = open("/tmp/2021-10-11T06:23:29:691472.pdf", "rb")
body.append(f)
return {
"statusCode": 200,
"file": "2021-10-11T06:23:29:691472.pdf",
"content": base64.b64encode(f.read())
}
Then on the client side, you have to decode base64
to binary.
How do I read the content of a file in Amazon S3
First you should get the object InputStream
to do your need.
S3Object object = s3Client.getObject(new GetObjectRequest(bucketName, key));
InputStream objectData = object.getObjectContent();
Pass the InputStream
, File Name
and the path
to the below method to download your stream.
public void saveFile(String fileName, String path, InputStream objectData) throws Exception {
DataOutputStream dos = null;
OutputStream out = null;
try {
File newDirectory = new File(path);
if (!newDirectory.exists()) {
newDirectory.mkdirs();
}
File uploadedFile = new File(path, uploadFileName);
out = new FileOutputStream(uploadedFile);
byte[] fileAsBytes = new byte[inputStream.available()];
inputStream.read(fileAsBytes);
dos = new DataOutputStream(out);
dos.write(fileAsBytes);
} catch (IOException io) {
io.printStackTrace();
} catch (Exception e) {
e.printStackTrace();
} finally {
try {
if (out != null) {
out.close();
}
if (dos != null) {
dos.close();
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
After you Download your object read the file and make it to JSON
and write it to .txt
file after that you can upload the txt
file to the desired bucket in S3
OOM when trying to process s3 file
This is a theory, but I can't think of any other reasons why your example would OOM.
Suppose that the uncompressed file consists contains a very long line; e.g. something like 650 million ASCII bytes.
Your application seems to just read the file a line at a time and (try to) display a running total of the megabytes that have been read.
Internally, the readLine()
method reads characters one at a time and appends them to a StringBuffer
. (You can see the append
call in the stack trace.) If the file consist of a very large line, then the StringBuffer
is going to get very large.
Each text character in the uncompressed string becomes a
char
in thechar[]
that is the buffer part of theStringBuffer
.Each time the buffer fills up,
StringBuffer
will grow the buffer by (I think) doubling its size. This entails allocating a newchar[]
and copying the characters to it.So if the buffer fills when there are N characters,
Arrays.copyOf
will allocate achar[]
hold 2 x N characters. And while the data is being copied, a total of 3 x N of character storage will be in use.So 650MB could easily turn into a heap demand of > 6 x 650M bytes
The other thing to note that the 2 x N array has to be a single contiguous heap node.
Looking at the heap graphs, it looks like the heap got to ~1GB in use. If my theory is correct, the next allocation would have been for a ~2GB node. But 1GB + 2GB is right on the limit for your 3.1GB heap max. And when we take the contiguity requirement into account, the allocation cannot be done.
So what is the solution?
It is simple really: don't use readLine()
if it is possible for lines to be unreasonably long.
public static String unzip(InputStream in)
throws IOException, CompressorException, ArchiveException {
System.out.println("Unzipping.............");
try (
GZIPInputStream gzis = new GZIPInputStream(in);
InputStreamReader reader = new InputStreamReader(gzis);
BufferedReader br = new BufferedReader(reader);
) {
int ch;
long i = 0;
while ((ch = br.read()) >= 0) {
i++;
if (i % (100 * 1024 * 1024) == 0) {
System.out.println(i / (1024 * 1024));
}
}
} catch (IOException e) {
e.printStackTrace();
LOG.error("Invoked AWSUtils getS3Content : json ", e);
}
Related Topics
Showing Morning, Afternoon, Evening, Night Message Based on Time in Java
Replace Whitespace in Json Keys
How to Open a .Dat File in Java Program
Lombok Is Not Generating Getter and Setter
Retrieving Data from Biometric Fingerprint Attendance Device
Flush/Clear System.In (Stdin) Before Reading
Gradle: How to Exclude a Particular Package from a Jar
How to Get Numbers from Given Gcd and Lcm
Spring Boot JPA Unknown Column in Field List
Junit 5 - No Parameterresolver Registered for Parameter
Check If One List Contains Element from the Other
Resource from Src/Main/Resources Not Found After Building With Maven
How to Pass List in Postman in Get Request and Get in Getmapping
Java Jackson Deserialization of Nested Objects
How to Put a Scanner Input into an Array... for Example a Couple of Numbers