How to I Lazily Read Multiple JSON Values from a File/Stream in Python

How I can I lazily read multiple JSON values from a file/stream in Python?

Here's a much, much simpler solution. The secret is to try, fail, and use the information in the exception to parse correctly. The only limitation is the file must be seekable.

def stream_read_json(fn):
import json
start_pos = 0
with open(fn, 'r') as f:
while True:
try:
obj = json.load(f)
yield obj
return
except json.JSONDecodeError as e:
f.seek(start_pos)
json_str = f.read(e.pos)
obj = json.loads(json_str)
start_pos += e.pos
yield obj

Edit: just noticed that this will only work for Python >=3.5. For earlier, failures return a ValueError, and you have to parse out the position from the string, e.g.

def stream_read_json(fn):
import json
import re
start_pos = 0
with open(fn, 'r') as f:
while True:
try:
obj = json.load(f)
yield obj
return
except ValueError as e:
f.seek(start_pos)
end_pos = int(re.match('Extra data: line \d+ column \d+ .*\(char (\d+).*\)',
e.args[0]).groups()[0])
json_str = f.read(end_pos)
obj = json.loads(json_str)
start_pos += end_pos
yield obj

How I can I lazily read multiple JSON values from a file/stream in Rust?

This was a pain when I wanted to do it in Python, but fortunately in Rust this is a directly-supported feature of the de-facto-standard serde_json crate! It isn't exposed as a single convenience function, but we just need to create a serde_json::Deserializer reading from our file/reader, then use its .into_iter() method to get a StreamDeserializer iterator yielding Results containing serde_json::Value JSON values.

use serde_json; // 1.0.39

fn main() -> Result<(), Box<dyn std::error::Error>> {
let stdin = std::io::stdin();
let stdin = stdin.lock();

let deserializer = serde_json::Deserializer::from_reader(stdin);
let iterator = deserializer.into_iter::<serde_json::Value>();
for item in iterator {
println!("Got {:?}", item?);
}

Ok(())
}

One thing to be aware of: if a syntax error is encountered, the iterator will start to produce an infinite sequence of error results and never move on. You need to make sure you handle the errors inside of the loop, or the loop will never end. In the snippet above, we do this by using the ? question mark operator to break the loop and return the first serde_json::Result::Err from our function.

Is there anyway to update values in a JSON file after a certain interval of time?

With time.sleep() you can do this. time.sleep() delays your code for the amount of seconds that you specify, decimals are allowed. If you put it in a different thread, you can execute code whilst this is happening. This is some code for the threaded version, but you can remove threading by just calling updatedict()

import json
from time import sleep
import threading

values = {
"google":5,
"apple":4,
"msft":3,
"amazon":6
}
def updatedict():
while True:
global values
#incrementing values
for value in values:
values[value] += 1
#Writing file
with open("values.json","w+") as f:
f.write(json.dumps(values))
#Sleeping 1 minute
sleep(60)
#Starting new thread
threading.Thread(target=updatedict).start()
#Any code you wish to run at the same time as the above function

If you intend to run the script multiple times each incrementing onto what is already there, replace the existing values variable assignment with

try:
with open("values.json") as f:
values = json.load(f)
#If the file is not found, regenerates values
except FileNotFoundError:
values = {
"google":5,
"apple":4,
"msft":3,
"amazon":6
}

Reading multiple JSON records into a Pandas dataframe

Note: Line separated json is now supported in read_json (since 0.19.0):

In [31]: pd.read_json('{"a":1,"b":2}\n{"a":3,"b":4}', lines=True)
Out[31]:
a b
0 1 2
1 3 4

or with a file/filepath rather than a json string:

pd.read_json(json_file, lines=True)

It's going to depend on the size of you DataFrames which is faster, but another option is to use str.join to smash your multi line "JSON" (Note: it's not valid json), into valid json and use read_json:

In [11]: '[%s]' % ','.join(test.splitlines())
Out[11]: '[{"a":1,"b":2},{"a":3,"b":4}]'

For this tiny example this is slower, if around 100 it's the similar, signicant gains if it's larger...

In [21]: %timeit pd.read_json('[%s]' % ','.join(test.splitlines()))
1000 loops, best of 3: 977 µs per loop

In [22]: %timeit l=[ json.loads(l) for l in test.splitlines()]; df = pd.DataFrame(l)
1000 loops, best of 3: 282 µs per loop

In [23]: test_100 = '\n'.join([test] * 100)

In [24]: %timeit pd.read_json('[%s]' % ','.join(test_100.splitlines()))
1000 loops, best of 3: 1.25 ms per loop

In [25]: %timeit l = [json.loads(l) for l in test_100.splitlines()]; df = pd.DataFrame(l)
1000 loops, best of 3: 1.25 ms per loop

In [26]: test_1000 = '\n'.join([test] * 1000)

In [27]: %timeit l = [json.loads(l) for l in test_1000.splitlines()]; df = pd.DataFrame(l)
100 loops, best of 3: 9.78 ms per loop

In [28]: %timeit pd.read_json('[%s]' % ','.join(test_1000.splitlines()))
100 loops, best of 3: 3.36 ms per loop

Note: of that time the join is surprisingly fast.



Related Topics



Leave a reply



Submit