Split a Large Json File into Multiple Smaller Files

Split a large json file into multiple smaller files

Answering the question whether Python or Node will be better for the task would be an opinion and we are not allowed to voice our opinions on Stack Overflow. You have to decide yourself what you have more experience in and what you want to work with - Python or Node.

If you go with Node, there are some modules that can help you with that task, that do streaming JSON parsing. E.g. those modules:

  • https://www.npmjs.com/package/JSONStream
  • https://www.npmjs.com/package/stream-json
  • https://www.npmjs.com/package/json-stream

If you go with Python, there are streaming JSON parsers here as well:

  • https://github.com/kashifrazzaqui/json-streamer
  • https://github.com/danielyule/naya
  • http://www.enricozini.org/blog/2011/tips/python-stream-json/

How to split one big json file into smaller multiple files?

import json
dct = json.load('inputfile.json')
for subdct in dct['templates']:
if "textGroup" in subdct:
fname = "processConfig-textGroup-{}.json".format(subdct["textGroup"])
elif "clientId" in subdct:
fname = "processConfig-client-{}.json".format(subdct["clientId"])
with open(fname, 'w') as f:
json.dump({'templates': [subdct]}, f)

Of course, if your file is really huge (too big to fit into memory) this won't work.

How to split a large JSON file based on an array property which is deeply nested?

Here is one another approach you can use to split your large Json file using Cinchoo ETL - an open source library

Assumes the input json file comes with Job.Steps[*].InstrumentData[*] nodes, which requires 2 level of parsing to split the files by InstrumentData

First, break the input file by each Steps[*] node (aka. StepsFiles), then take each StepsFiles and break them by each InstrumentData[*] node.

Due to complexity of the code, drafted sample fiddle for review.

Sample fiddle: https://dotnetfiddle.net/j3Y03m

How can I split large json file into multiple small chunk files using javascript to display in google map

You can use https://pinetools.com/split-files for split data and you can create multiple files from this link then you can use map.data.loadGeoJson().
also please do defer for the script so a browser will not crash.

Splitting a big json file into smaller json files on AWS Lambda and save it on S3

My Solution
Probably not the best one but it works.

import json
import boto3
import os
import time
import math


# Variable definition.
SESSION_STORAGE = os.environ['JSON_BUCKET']
SESSION = boto3.session.Session()
CURRENT_REGION = SESSION.region_name
S3_CLIENT = boto3.client("s3")
MIN_SIZE = 16000000 # 16 mb

def handler(event, context):

# Instantiate start time.
start_time = time.time()

# Bucket Name where file was uploaded
#bucket = event['Records'][0]['s3']['bucket']['name']
bucket = 'json-upload-bucket' # => testing bucket.

#file_key_name = event['Records'][0]['s3']['object']['key'] # Use this to make it dynamic.
file_key_name = 'XXXXXXXXX-users.json' # This is for testing only.
print("File Key Name", file_key_name)

response = S3_CLIENT.get_object(Bucket=bucket, Key=file_key_name)
#print("Response : ", response)

# json size
json_size = response['ContentLength']
print("json_size : ", json_size)

# Reading content
content = response['Body']
jsonObject = json.loads(content.read())
data = jsonObject['users']
data_len = len(jsonObject)
#print('Length of JSON : ', data_len)
#print('Order array length : ', len(data))

if isinstance(data, list):
data_len = len(data)
print('Valid JSON file found')

if(json_size <= MIN_SIZE):
print('File meets the minimum size.')
else:
# determine number of files necessary
split_into_files = math.ceil(json_size/MIN_SIZE)
print(f'File will be split into {split_into_files} equal parts')

# initialize 2D array
split_data = [[] for i in range(0,split_into_files)]

# determine indices of cutoffs in array
starts = [math.floor(i * data_len/split_into_files) for i in range(0,split_into_files)]
starts.append(data_len)

# loop through 2D array
for i in range(0,split_into_files):
# loop through each range in array
for n in range(starts[i],starts[i+1]):
split_data[i].append(data[n])


print(file_key_name.split('.')[0] + '_' + str(i+1) + '.json')
name = os.path.basename(file_key_name).split('.')[0] + '_' + str(i+1) + '.json'
print('Name : ', name)
folder = '/tmp/'+name
with open(folder, 'w') as outfile:

# restructure the json back to its original state.
generated_json = {
list(jsonObject.keys())[0] : list(jsonObject.values())[0],
list(jsonObject.keys())[1] : split_data[i]}
json.dump(generated_json, outfile, indent=4)

S3_CLIENT.upload_file(folder, bucket, name)

print('Part',str(i+1),'... completed')

else:
print("JSON is not an Array of Objects")

return {
'statusCode': 200,
'body': json.dumps('JSON split completed checks s3.')
}

Splitting JSON file into smaller parts

The original file is not valid JSON while the json.dump creates a file with valid JSON. My suggestion would be to convert the line items to JSON one at a time when writing to file.

Replace this:

for i in range(total+1):
json.dump(ll[i * size_of_the_split:(i + 1) * size_of_the_split], open(
json_file+"\\split50k" + str(i+1) + ".json", 'w',
encoding='utf8'), ensure_ascii=False, indent=True)

with this:

for i in range(len(ll)):
if i % size_of_the_split ==0:
if i != 0:
file.close()
file = open(json_file+"\\split50k"+str(i+1)+".json",'w')
file.write(str(ll[i]))
file.close()

Using jq how can I split a very large JSON file into multiple files, each a specific quantity of objects?

[EDIT: This answer has been revised in accordance with the revision to the question.]

The key to using jq to solve the problem is the -c command-line option, which produces output in JSON-Lines format (i.e., in the present case, one object per line). You can then use a tool such as awk or split to distribute those lines amongst several files.

If the file is not too big, then the simplest would be to start the pipeline with:

jq -c '.[]' INPUTFILE

If the file is too big to fit comfortably in memory, then you could use jq's streaming parser, like so:

jq -cn --stream 'fromstream(1|truncate_stream(inputs))'

For further discussion about the streaming parser, see e.g. the relevant section in the jq FAQ: https://github.com/stedolan/jq/wiki/FAQ#streaming-json-parser

Partitioning

For different approaches to partitioning the output produced in the first step, see for example How to split a large text file into smaller files with equal number of lines?

If it is required that each of the output files be an array of objects, then I'd probably use awk to perform both the partitioning and the re-constitution in one step, but there are many other reasonable approaches.

If the input is a sequence of JSON objects

For reference, if the original file consists of a stream or sequence of JSON objects, then the appropriate invocation would be:

jq -n -c inputs INPUTFILE

Using inputs in this manner allows arbitrarily many objects to be processed efficiently.



Related Topics



Leave a reply



Submit