Parse Large JSON Hash with Ruby-Yajl

Parse large JSON hash with ruby-yajl?

I ended up solving this using JSON::Stream which has callbacks for start_document, start_object etc.

I gave my 'parser' a to_enum method which emits all the 'Resource' objects as they're parsed. Note that ResourcesCollectionNode is never really used unless you completely parse the JSON stream, and the ResourceNode is a subclass of ObjectNode for naming purposes only, though I might just get rid of it:

class Parser
  METHODS = %w[start_document end_document start_object end_object start_array end_array key value]

  attr_reader :result

  def initialize(io, chunk_size = 1024)
    @io = io
    @chunk_size = chunk_size
    @parser = JSON::Stream::Parser.new

    # register callback methods
    METHODS.each do |name|
      @parser.send(name, &method(name))
    end 
  end

  def to_enum
    Enumerator.new do |yielder|
      @yielder = yielder
      begin
        while !@io.eof?
          # puts "READING CHUNK"
          chunk = @io.read(@chunk_size)
          @parser << chunk
        end
      ensure
        @yielder = nil
      end
    end
  end

  def start_document
    @stack = []
    @result = nil
  end

  def end_document
    # @result = @stack.pop.obj
  end

  def start_object
    if @stack.size == 0
      @stack.push(ResourceCollectionNode.new)
    elsif @stack.size == 1
      @stack.push(ResourceNode.new)
    else
      @stack.push(ObjectNode.new)
    end
  end

  def end_object
    if @stack.size == 2
      node = @stack.pop
      #puts "Stack depth: #{@stack.size}. Node: #{node.class}"
      @stack[-1] << node.obj

      # puts "Parsed complete resource: #{node.obj}"
      @yielder << node.obj

    elsif @stack.size == 1
      # puts "Parsed all resources"
      @result = @stack.pop.obj
    else
      node = @stack.pop
      # puts "Stack depth: #{@stack.size}. Node: #{node.class}"
      @stack[-1] << node.obj
    end
  end

  def end_array
    node = @stack.pop
    @stack[-1] << node.obj
  end

  def start_array
    @stack.push(ArrayNode.new)
  end

  def key(key)
    # puts "Stack depth: #{@stack.size} KEY: #{key}"
    @stack[-1] << key
  end

  def value(value)
    node = @stack[-1]
    node << value
  end

  class ObjectNode
    attr_reader :obj

    def initialize
      @obj, @key = {}, nil
    end

    def <<(node)
      if @key
        @obj[@key] = node
        @key = nil
      else
        @key = node
      end
      self
    end
  end

  class ResourceNode < ObjectNode
  end

  # Node that contains all the resources - a Hash keyed by url
  class ResourceCollectionNode < ObjectNode
    def <<(node)
      if @key
        @obj[@key] = node
        # puts "Completed Resource: #{@key} => #{node}"
        @key = nil
      else
        @key = node
      end
      self
    end
  end

  class ArrayNode
    attr_reader :obj

    def initialize
      @obj = []
    end

    def <<(node)
      @obj << node
      self
    end
  end

end

and an example in use:

def json
  <<-EOJ
  {
    "1": {
      "url": "url_1",
      "title": "title_1",
      "http_req": {
        "status": 200,
        "time": 10
      }
    },
    "2": {
      "url": "url_2",
      "title": "title_2",
      "http_req": {
        "status": 404,
        "time": -1
      }
    },
    "3": {
      "url": "url_1",
      "title": "title_1",
      "http_req": {
        "status": 200,
        "time": 10
      }
    },
    "4": {
      "url": "url_2",
      "title": "title_2",
      "http_req": {
        "status": 404,
        "time": -1
      }
    },
    "5": {
      "url": "url_1",
      "title": "title_1",
      "http_req": {
        "status": 200,
        "time": 10
      }
    },
    "6": {
      "url": "url_2",
      "title": "title_2",
      "http_req": {
        "status": 404,
        "time": -1
      }
    }          

  }
  EOJ
end

io = StringIO.new(json)
resource_parser = ResourceParser.new(io, 100)

count = 0
resource_parser.to_enum.each do |resource|
  count += 1
  puts "READ: #{count}"
  pp resource
  break
end

io.close

Output:

READ: 1
{"url"=>"url_1", "title"=>"title_1", "http_req"=>{"status"=>200, "time"=>10}}

How can I process huge JSON files as streams in Ruby, without consuming all memory?

Both @CodeGnome's and @A. Rager's answer helped me understand the solution.

I ended up creating the gem json-streamer that offers a generic approach and spares the need to manually define callbacks for every scenario.

Stream based parsing and writing of JSON

Why can't you retrieve a single record at a time from the database, process it as necessary, convert it to JSON, then emit it with a trailing/delimiting comma?

If you started with a file that only contained [, then appended all your JSON strings, then, on the final entry didn't append a comma, and instead used a closing ], you'd have a JSON array of hashes, and would only have to process one row's worth at a time.

It'd be a tiny bit slower (maybe) but wouldn't impact your system. And DB I/O can be very fast if you use blocking/paging to retrieve a reasonable number of records at a time.

For instance, here's a combination of some Sequel example code, and code to extract the rows as JSON and build a larger JSON structure:

require 'json'
require 'sequel'

DB = Sequel.sqlite # memory database

DB.create_table :items do
  primary_key :id
  String :name
  Float :price
end

items = DB[:items] # Create a dataset

# Populate the table
items.insert(:name => 'abc', :price => rand * 100)
items.insert(:name => 'def', :price => rand * 100)
items.insert(:name => 'ghi', :price => rand * 100)

add_comma = false

puts '['
items.order(:price).each do |item|
  puts ',' if add_comma
  add_comma ||= true
  print JSON[item]
end
puts "\n]"

Which outputs:

[
{"id":2,"name":"def","price":3.714714089426208},
{"id":3,"name":"ghi","price":27.0179624376119},
{"id":1,"name":"abc","price":52.51248221170203}
]

Notice the order is now by "price".

Validation is easy:

require 'json'
require 'pp'

pp JSON[<<EOT]
[
{"id":2,"name":"def","price":3.714714089426208},
{"id":3,"name":"ghi","price":27.0179624376119},
{"id":1,"name":"abc","price":52.51248221170203}
]
EOT

Which results in:

[{"id"=>2, "name"=>"def", "price"=>3.714714089426208},
 {"id"=>3, "name"=>"ghi", "price"=>27.0179624376119},
 {"id"=>1, "name"=>"abc", "price"=>52.51248221170203}]

This validates the JSON and demonstrates that the original data is recoverable. Each row retrieved from the database should be a minimal "bitesized" piece of the overall JSON structure you want to build.

Building upon that, here's how to read incoming JSON in the database, manipulate it, then emit it as a JSON file:

require 'json'
require 'sequel'

DB = Sequel.sqlite # memory database

DB.create_table :items do
  primary_key :id
  String :json
end

items = DB[:items] # Create a dataset

# Populate the table
items.insert(:json => JSON[:name => 'abc', :price => rand * 100])
items.insert(:json => JSON[:name => 'def', :price => rand * 100])
items.insert(:json => JSON[:name => 'ghi', :price => rand * 100])
items.insert(:json => JSON[:name => 'jkl', :price => rand * 100])
items.insert(:json => JSON[:name => 'mno', :price => rand * 100])
items.insert(:json => JSON[:name => 'pqr', :price => rand * 100])
items.insert(:json => JSON[:name => 'stu', :price => rand * 100])
items.insert(:json => JSON[:name => 'vwx', :price => rand * 100])
items.insert(:json => JSON[:name => 'yz_', :price => rand * 100])

add_comma = false

puts '['
items.each do |item|
  puts ',' if add_comma
  add_comma ||= true
  print JSON[
    JSON[
      item[:json]
    ].merge('foo' => 'bar', 'time' => Time.now.to_f)
  ]
end
puts "\n]"

Which generates:

[
{"name":"abc","price":3.268814929005337,"foo":"bar","time":1379688093.124606},
{"name":"def","price":13.871147312377719,"foo":"bar","time":1379688093.124664},
{"name":"ghi","price":52.720984131655676,"foo":"bar","time":1379688093.124702},
{"name":"jkl","price":53.21477190840114,"foo":"bar","time":1379688093.124732},
{"name":"mno","price":40.99364022416619,"foo":"bar","time":1379688093.124758},
{"name":"pqr","price":5.918738444452265,"foo":"bar","time":1379688093.124803},
{"name":"stu","price":45.09391752439902,"foo":"bar","time":1379688093.124831},
{"name":"vwx","price":63.08947792357426,"foo":"bar","time":1379688093.124862},
{"name":"yz_","price":94.04921035056373,"foo":"bar","time":1379688093.124894}
]

I added the timestamp so you can see that each row is processed individually, AND to give you an idea how fast the rows are being processed. Granted, this is a tiny, in-memory database, which has no network I/O to content with, but a normal network connection through a switch to a database on a reasonable DB host should be pretty fast too. Telling the ORM to read the DB in chunks can speed up the processing because the DBM will be able to return larger blocks to more efficiently fill the packets. You'll have to experiment to determine what size chunks you need because it will vary based on your network, your hosts, and the size of your records.

Your original design isn't good when dealing with enterprise-sized databases, especially when your hardware resources are limited. Over the years we've learned how to parse BIG databases, which make 20,000 row tables appear miniscule. VM slices are common these days and we use them for crunching, so they're often the PCs of yesteryear: single CPU with small memory footprints and dinky drives. We can't beat them up or they'll be bottlenecks, so we have to break the data into the smallest atomic pieces we can.

Harping about DB design: Storing JSON in a database is a questionable practice. DBMs these days can spew JSON, YAML and XML representations of rows, but forcing the DBM to search inside stored JSON, YAML or XML strings is a major hit in processing speed, so avoid it at all costs unless you also have the equivalent lookup data indexed in separate fields so your searches are at the highest possible speed. If the data is available in separate fields, then doing good ol' database queries, tweaking in the DBM or your scripting language of choice, and emitting the massaged data becomes a lot easier.

Yajl::ParseError: lexical error: invalid char in json text

Your text isn't an allowed JSON object. The token => should no be present but a colon : instead.

{
    "id" : 2126244,
    "name" : "bootstrap",
    ...
}

How to read Watson Visual Recognition JSON Respsonse with YAJL Library?

Here's the code that works now:

images = YAJL_OBJECT_FIND(docNode: 'images'); 
i = 0;

dow yajl_array_loop(images: i: node);

  classifiers = YAJL_OBJECT_FIND(node: 'classifiers');

  k = 0;
  dow yajl_array_loop(classifiers: k: node);

    classes = YAJL_OBJECT_FIND(node: 'classes');

    j = 0;
    dow yajl_array_loop(classes: j: node);

      val = YAJL_object_find(node:'class');
      imageClasses.classes(j).class = yajl_get_string(val);

      val = YAJL_object_find(node:'score');
      imageClasses.classes(j).score = yajl_get_number(val);

      val = YAJL_object_find(node:'type_hierarchy');
      imageClasses.classes(j).typeHierarchy = yajl_get_string(val);

  enddo;

 enddo;

enddo;

Parse Large JSON Hash with Ruby-Yajl