How I can I lazily read multiple JSON values from a file/stream in Python?
Here's a much, much simpler solution. The secret is to try, fail, and use the information in the exception to parse correctly. The only limitation is the file must be seekable.
def stream_read_json(fn):
import json
start_pos = 0
with open(fn, 'r') as f:
while True:
try:
obj = json.load(f)
yield obj
return
except json.JSONDecodeError as e:
f.seek(start_pos)
json_str = f.read(e.pos)
obj = json.loads(json_str)
start_pos += e.pos
yield obj
Edit: just noticed that this will only work for Python >=3.5. For earlier, failures return a ValueError, and you have to parse out the position from the string, e.g.
def stream_read_json(fn):
import json
import re
start_pos = 0
with open(fn, 'r') as f:
while True:
try:
obj = json.load(f)
yield obj
return
except ValueError as e:
f.seek(start_pos)
end_pos = int(re.match('Extra data: line \d+ column \d+ .*\(char (\d+).*\)',
e.args[0]).groups()[0])
json_str = f.read(end_pos)
obj = json.loads(json_str)
start_pos += end_pos
yield obj
How I can I lazily read multiple JSON values from a file/stream in Rust?
This was a pain when I wanted to do it in Python, but fortunately in Rust this is a directly-supported feature of the de-facto-standard serde_json
crate! It isn't exposed as a single convenience function, but we just need to create a serde_json::Deserializer
reading from our file/reader, then use its .into_iter()
method to get a StreamDeserializer
iterator yielding Result
s containing serde_json::Value
JSON values.
use serde_json; // 1.0.39
fn main() -> Result<(), Box<dyn std::error::Error>> {
let stdin = std::io::stdin();
let stdin = stdin.lock();
let deserializer = serde_json::Deserializer::from_reader(stdin);
let iterator = deserializer.into_iter::<serde_json::Value>();
for item in iterator {
println!("Got {:?}", item?);
}
Ok(())
}
One thing to be aware of: if a syntax error is encountered, the iterator will start to produce an infinite sequence of error results and never move on. You need to make sure you handle the errors inside of the loop, or the loop will never end. In the snippet above, we do this by using the ?
question mark operator to break the loop and return the first serde_json::Result::Err
from our function.
Is there anyway to update values in a JSON file after a certain interval of time?
With time.sleep()
you can do this. time.sleep()
delays your code for the amount of seconds that you specify, decimals are allowed. If you put it in a different thread, you can execute code whilst this is happening. This is some code for the threaded version, but you can remove threading by just calling updatedict()
import json
from time import sleep
import threading
values = {
"google":5,
"apple":4,
"msft":3,
"amazon":6
}
def updatedict():
while True:
global values
#incrementing values
for value in values:
values[value] += 1
#Writing file
with open("values.json","w+") as f:
f.write(json.dumps(values))
#Sleeping 1 minute
sleep(60)
#Starting new thread
threading.Thread(target=updatedict).start()
#Any code you wish to run at the same time as the above function
If you intend to run the script multiple times each incrementing onto what is already there, replace the existing values
variable assignment with
try:
with open("values.json") as f:
values = json.load(f)
#If the file is not found, regenerates values
except FileNotFoundError:
values = {
"google":5,
"apple":4,
"msft":3,
"amazon":6
}
Reading multiple JSON records into a Pandas dataframe
Note: Line separated json is now supported in read_json
(since 0.19.0):
In [31]: pd.read_json('{"a":1,"b":2}\n{"a":3,"b":4}', lines=True)
Out[31]:
a b
0 1 2
1 3 4
or with a file/filepath rather than a json string:
pd.read_json(json_file, lines=True)
It's going to depend on the size of you DataFrames which is faster, but another option is to use str.join
to smash your multi line "JSON" (Note: it's not valid json), into valid json and use read_json:
In [11]: '[%s]' % ','.join(test.splitlines())
Out[11]: '[{"a":1,"b":2},{"a":3,"b":4}]'
For this tiny example this is slower, if around 100 it's the similar, signicant gains if it's larger...
In [21]: %timeit pd.read_json('[%s]' % ','.join(test.splitlines()))
1000 loops, best of 3: 977 µs per loop
In [22]: %timeit l=[ json.loads(l) for l in test.splitlines()]; df = pd.DataFrame(l)
1000 loops, best of 3: 282 µs per loop
In [23]: test_100 = '\n'.join([test] * 100)
In [24]: %timeit pd.read_json('[%s]' % ','.join(test_100.splitlines()))
1000 loops, best of 3: 1.25 ms per loop
In [25]: %timeit l = [json.loads(l) for l in test_100.splitlines()]; df = pd.DataFrame(l)
1000 loops, best of 3: 1.25 ms per loop
In [26]: test_1000 = '\n'.join([test] * 1000)
In [27]: %timeit l = [json.loads(l) for l in test_1000.splitlines()]; df = pd.DataFrame(l)
100 loops, best of 3: 9.78 ms per loop
In [28]: %timeit pd.read_json('[%s]' % ','.join(test_1000.splitlines()))
100 loops, best of 3: 3.36 ms per loop
Note: of that time the join is surprisingly fast.
Related Topics
Splitting a Pandas Dataframe Column by Delimiter
How to Recursively Find Specific Key in Nested JSON
Numpy Sum Elements in Array Based on Its Value
How to Change Default Anaconda Python Environment
Want to Find Contours -> Valueerror: Not Enough Values to Unpack (Expected 3, Got 2), This Appears
How to Remove/Delete a Folder That Is Not Empty
Share Large, Read-Only Numpy Array Between Multiprocessing Processes
Timeit Versus Timing Decorator
Longest Increasing Subsequence
Flask to Return Image Stored in Database
Naturally Sorting Pandas Dataframe
How to Unzip a List of Tuples into Individual Lists
Python Subprocess.Call a Bash Alias
How to Install 2 Anacondas (Python 2 and 3) on MAC Os
How to Determine a Point Is Between Two Other Points on a Line Segment
When and How to Use the Builtin Function Property() in Python