Reading Rather Large JSON Files

Reading rather large JSON files

The issue here is that JSON, as a format, is generally parsed in full and then handled in-memory, which for such a large amount of data is clearly problematic.

The solution to this is to work with the data as a stream - reading part of the file, working with it, and then repeating.

The best option appears to be using something like ijson - a module that will work with JSON as a stream, rather than as a block file.

Edit: Also worth a look - kashif's comment about json-streamer and Henrik Heino's comment about bigjson.

How to handle huge JSON files?

The big size of json file requires too much source to be handled. As mentioned in @cizario's link, it should be used some stream logic that access json objects without storing all the content of the file.

One py library that works in streaming can be found at https://www.npmjs.com/package/stream-json

Is there a memory efficient and fast way to load big JSON files?

Update

See the other answers for advice.

Original answer from 2010, now outdated

Short answer: no.

Properly dividing a json file would take intimate knowledge of the json object graph to get right.

However, if you have this knowledge, then you could implement a file-like object that wraps the json file and spits out proper chunks.

For instance, if you know that your json file is a single array of objects, you could create a generator that wraps the json file and returns chunks of the array.

You would have to do some string content parsing to get the chunking of the json file right.

I don't know what generates your json content. If possible, I would consider generating a number of managable files, instead of one huge file.

Reading rather large JSON files

The issue here is that JSON, as a format, is generally parsed in full and then handled in-memory, which for such a large amount of data is clearly problematic.

The solution to this is to work with the data as a stream - reading part of the file, working with it, and then repeating.

The best option appears to be using something like ijson - a module that will work with JSON as a stream, rather than as a block file.

Edit: Also worth a look - kashif's comment about json-streamer and Henrik Heino's comment about bigjson.

Parse very large JSON files with dynamic data

If you are going to collect all items to a list anyways (instead of processing them immediately one by another), using streaming API makes not much sense. It can be done much simpler:

val response = Klaxon().parseJsonObject(StringReader(testJson))
val result = response["result"]
val items = response.array<JsonObject>("items") ?: JsonArray()
...

Streaming processing is a bit more involved. First of all you would like to make sure, that the server response is not read entirely into the memory before starting processing (i.e. the parser input should not be a string, but rather an input stream. Details depend on the http client library of your choice). Secondly, you would need to provide some kind of callback, to process the items as they arrive, e.g.:

fun parse(input: Reader, onResult: (String) -> Unit, onItem: (JsonObject) -> Unit)  {

JsonReader(input).use { reader ->
reader.beginObject {
while (reader.hasNext()) {
when (reader.nextName()) {
"result" -> onResult(reader.nextString())
"items" -> reader.beginArray {
while (reader.hasNext()) {
val item = Parser(passedLexer = reader.lexer, streaming = true).parse(reader) as JsonObject
onItem(item)
}
}
}
}
}
}
}

fun main(args: Array<String>) {

// "input" simulates the server response
val input = ByteArrayInputStream(testJson.encodeToByteArray())

InputStreamReader(input).use {
parse(it,
onResult = { println("""Result: $it""") },
onItem = { println(it.asIterable().joinToString(", ")) }
)
}
}

Yet better would be integrating Klaxon with the Kotlin Flow or Sequence, but I found it difficult due to the beginObject and beginArray wrappers, which do not play well with suspend functions.

Reading big arrays from big json file in php

JSON is a great format and way better alternative to XML.
In the end JSON is almost one on one convertible to XML and back.

Big files can get bigger, so we don't want to read all the stuff in memory and we don't want to parse the whole file. I had the same issue with XXL size JSON files.

I think the issue lays not in a specific programming language, but in a realisation and specifics of the formats.

I have 3 solutions for you:

  1. Native PHP implementation (preferred)

Almost as fast as streamed XMLReader, there is a library https://github.com/pcrov/JsonReader. Example:

use pcrov\JsonReader\JsonReader;

$reader = new JsonReader();
$reader->open("data.json");

while ($reader->read("type")) {
echo $reader->value(), "\n";
}
$reader->close();

This library will not read the whole file into memory or parse all the lines. It is step by step on command traverse through the tree of JSON object.


  1. Let go formats (cons: multiple conversions)

Preprocess file to a different format like XML or CSV.
There is very lightweight nodejs libs like https://www.npmjs.com/package/json2csv to CSV from JSON.


  1. Use some NoSQL DB (cons: additional complex software to install and maintain)

For example Redis or CouchDB(import json file to couch db-)



Related Topics



Leave a reply



Submit