Is There a Memory Efficient and Fast Way to Load Big Json Files

Is there a memory efficient and fast way to load big JSON files?

Update

See the other answers for advice.

Original answer from 2010, now outdated

Short answer: no.

Properly dividing a json file would take intimate knowledge of the json object graph to get right.

However, if you have this knowledge, then you could implement a file-like object that wraps the json file and spits out proper chunks.

For instance, if you know that your json file is a single array of objects, you could create a generator that wraps the json file and returns chunks of the array.

You would have to do some string content parsing to get the chunking of the json file right.

I don't know what generates your json content. If possible, I would consider generating a number of managable files, instead of one huge file.

Fast & Efficient Way To Read Large JSON Files Line By Line in Java

You can use JSON Processing API (JSR 353), to process your data in a streaming fashion:

import javax.json.Json;
import javax.json.stream.JsonParser;

...

String dataPath = "data.json";

try(JsonParser parser = Json.createParser(new FileReader(dataPath))) {
List<String> row = new ArrayList<>();

while(parser.hasNext()) {
JsonParser.Event event = parser.next();
switch(event) {
case START_ARRAY:
continue;
case VALUE_STRING:
row.add(parser.getString());
break;
case END_ARRAY:
if(!row.isEmpty()) {
//Do something with the current row of data
System.out.println(row);

//Reset it (prepare for the new row)
row.clear();
}
break;
default:
throw new IllegalStateException("Unexpected JSON event: " + event);
}
}
}

Bypass memory error to read large JSON file in Python

In general, if you want to use ijson to reduce memory overheads, you'll need to be careful that the rest of your code doesn't introduce overheads as well. The best-case scenario would be that you translate a single item of your JSON object into a single line in your resulting CVS file, and that you do this iteratively. This would mean moving away from using using list comprehensions (which act on all the data at once) and not using a DataFrame (which again will hold all your contents at once).

Regarding ijson usage: a cheap solution would be to use ijson.items to iterate over each object in your JSON document. In the best-case sceanrio I described above you would then remove the unnecessary fields, and convert that object into a CSV line. Something like:

with open(path, 'rb') as fin:
for obj in ijson.items(fin, 'item'):
filter_object_and_turn_it_into_a_cvs_line(obj)

If you still for some reason really need to keep using a DataFrame, you can at least try to do the data cleaning always as generator expressions before passing it to the DataFrame to avoid additional data copies (but remember that you end up loading most data in memory anyways):

with open(path, 'rb') as fin:
json_list = ijson.items(fin, 'item')
key_list = ['created', 'emails', 'identities']
json_list = ({k:d[k] for k in key_list} for d in json_list) # this was a list comprehension in the original code
flattened = (flatten(d, '.') for d in json_list)
df = pandas.DataFrame(json_list_flattened)

Reading rather large JSON files

The issue here is that JSON, as a format, is generally parsed in full and then handled in-memory, which for such a large amount of data is clearly problematic.

The solution to this is to work with the data as a stream - reading part of the file, working with it, and then repeating.

The best option appears to be using something like ijson - a module that will work with JSON as a stream, rather than as a block file.

Edit: Also worth a look - kashif's comment about json-streamer and Henrik Heino's comment about bigjson.

Reading big arrays from big json file in php

JSON is a great format and way better alternative to XML.
In the end JSON is almost one on one convertible to XML and back.

Big files can get bigger, so we don't want to read all the stuff in memory and we don't want to parse the whole file. I had the same issue with XXL size JSON files.

I think the issue lays not in a specific programming language, but in a realisation and specifics of the formats.

I have 3 solutions for you:

  1. Native PHP implementation (preferred)

Almost as fast as streamed XMLReader, there is a library https://github.com/pcrov/JsonReader. Example:

use pcrov\JsonReader\JsonReader;

$reader = new JsonReader();
$reader->open("data.json");

while ($reader->read("type")) {
echo $reader->value(), "\n";
}
$reader->close();

This library will not read the whole file into memory or parse all the lines. It is step by step on command traverse through the tree of JSON object.


  1. Let go formats (cons: multiple conversions)

Preprocess file to a different format like XML or CSV.
There is very lightweight nodejs libs like https://www.npmjs.com/package/json2csv to CSV from JSON.


  1. Use some NoSQL DB (cons: additional complex software to install and maintain)

For example Redis or CouchDB(import json file to couch db-)



Related Topics



Leave a reply



Submit