Query Combinations with Nested Array of Records in JSON Datatype

Query combinations with nested array of records in JSON datatype

Given this table (which you should have provided in a form like this):

CREATE TABLE reports (rep_id int primary key, data json);
INSERT INTO reports (rep_id, data)
VALUES 
  (1, '{"objects":[{"album": 1, "src":"fooA.png", "pos": "top"},   {"album": 2, "src":"barB.png", "pos": "top"}],   "background":"background.png"}')
, (2, '{"objects":[{"album": 1, "src":"fooA.png", "pos": "top"},   {"album": 2, "src":"barC.png", "pos": "top"}],   "background":"bacakground.png"}')
, (3, '{"objects":[{"album": 1, "src":"fooA.png", "pos": "middle"},{"album": 2, "src":"barB.png", "pos": "middle"}],"background":"background.png"}')
, (4, '{"objects":[{"album": 1, "src":"fooA.png", "pos": "top"},   {"album": 3, "src":"barB.png", "pos": "top"}],   "background":"backgroundA.png"}')
;

JSON records of well known translatable type

Use json_populate_recordset() for unnesting the recordset "objects". The function requires a registered row type to define the names and data types of resulting columns. For the purpose of this demo or generally for ad-hoc queries, a temp table modeled after "objects" can provide the same:

CREATE TEMP TABLE obj(album int, src text, pos text);

To find the top 3 most common combinations ... of entries that have the same album, src, and background:

SELECT array_agg(r.rep_id) AS ids, count(*) AS ct
FROM   reports r
     , json_populate_recordset(null::obj, r.data->'objects') o
GROUP  BY r.data->>'background'
        , o.album
        , o.scr
ORDER  BY count(*) DESC
LIMIT  3;

Each object counts, no matter whether from the same row or not. You did not define how to handle that exactly. Consequently, rep_id can pop up multiple times in the array ids. Add DISTINCT to array_agg() to fold possible duplicates. The count ct can be greater then the length of the array ids in this case.

Requires Postgres 9.3 for the JSON functions and operators and the implicit JOIN LATERAL.

JSON records of unknown or untranslatable type

json_array_elements() just unnests the json array without transforming the result into an SQL row. Access individual fields with JSON operators accordingly.

SELECT array_agg(r.rep_id) AS ids, count(*) AS ct
FROM   reports r
     , json_array_elements(r.data->'objects') o
GROUP  BY r.data->>'background'
        , o->>'album'
        , o->>'scr'
ORDER  BY count(*) DESC
LIMIT  3;

Query for array elements inside JSON type

`jsonb` in Postgres 9.4+

You can use the same query as below, just with jsonb_array_elements().

But rather use the jsonb "contains" operator @> in combination with a matching GIN index on the expression data->'objects':

CREATE INDEX reports_data_gin_idx ON reports
USING gin ((data->'objects') jsonb_path_ops);

SELECT * FROM reports WHERE data->'objects' @> '[{"src":"foo.png"}]';

Since the key objects holds a JSON array, we need to match the structure in the search term and wrap the array element into square brackets, too. Drop the array brackets when searching a plain record.

More explanation and options:

Index for finding an element in a JSON array

`json` in Postgres 9.3+

Unnest the JSON array with the function json_array_elements() in a lateral join in the FROM clause and test for its elements:

SELECT data::text, obj
FROM   reports r, json_array_elements(r.data#>'{objects}') obj
WHERE  obj->>'src' = 'foo.png';

db<>fiddle here

_{Old sqlfiddle}

Or, equivalent for just a single level of nesting:

SELECT *
FROM   reports r, json_array_elements(r.data->'objects') obj
WHERE  obj->>'src' = 'foo.png';

->>, -> and #> operators are explained in the manual.

Both queries use an implicit JOIN LATERAL.

Closely related:

Query for element of array in JSON column

How do I query using fields inside the new PostgreSQL JSON datatype?

Postgres 9.2

I quote Andrew Dunstan on the pgsql-hackers list:

At some stage there will possibly be some json-processing (as opposed
to json-producing) functions, but not in 9.2.

Doesn't prevent him from providing an example implementation in PLV8 that should solve your problem. (Link is dead now, see modern PLV8 instead.)

Postgres 9.3

Offers an arsenal of new functions and operators to add "json-processing".

The manual on new JSON functionality.
The Postgres Wiki on new features in pg 9.3.

The answer to the original question in Postgres 9.3:

SELECT *
FROM   json_array_elements(
  '[{"name": "Toby", "occupation": "Software Engineer"},
    {"name": "Zaphod", "occupation": "Galactic President"} ]'
  ) AS elem
WHERE elem->>'name' = 'Toby';

Advanced example:

Query combinations with nested array of records in JSON datatype

For bigger tables you may want to add an expression index to increase performance:

Index for finding an element in a JSON array

Postgres 9.4

Adds jsonb (b for "binary", values are stored as native Postgres types) and yet more functionality for both types. In addition to expression indexes mentioned above, jsonb also supports GIN, btree and hash indexes, GIN being the most potent of these.

The manual on json and jsonb data types and functions.
The Postgres Wiki on JSONB in pg 9.4

The manual goes as far as suggesting:

In general, most applications should prefer to store JSON data as
jsonb, unless there are quite specialized needs, such as legacy
assumptions about ordering of object keys.

Bold emphasis mine.

Performance benefits from general improvements to GIN indexes.

Postgres 9.5

Complete jsonb functions and operators. Add more functions to manipulate jsonb in place and for display.

Major good news in the release notes of Postgres 9.5.

Select data based on two constraints for each row, using JSON data as input

Main result for the below results and discussions: Using the multiple OR query (as suggested by @KA_lin) is faster for small data sets (n < 1000 or so). This approach scales badly for larger data sets however, so I will probably stick with using a query with the TEMPORARY TABLE approach below in case my data set should grow large in the future. The payload for this is not that high.

CREATE TEMPORARY TABLE my_data_virtual(id INTEGER NOT NULL, version TINYINT(3) NOT NULL);

INSERT INTO my_data_virtual VALUES
    (1,0), (2,1), (3,0), (4,0), (5,1),
    (6,0), (7,0), (8,1), (9,2), (10,1);

SELECT md.id, md.version
    FROM my_data AS md
    INNER JOIN my_data_virtual AS mdv
        ON md.id = mvd.id AND md.id > mvd.id;

I ran a series of tests using the MySQLdb and timeit modules in Python. I created 5 tables: test_100, test_500, test_1000, test_5000 and test_10000. All the databases were given a single table, data, which contained the following columns.

+-------------+---------+------+-----+---------+----------------+
| Field       | Type    | Null | Key | Default | Extra          |
+-------------+---------+------+-----+---------+----------------+
| id          | int(11) | NO   | PRI | NULL    | auto_increment |
| version     | int(11) | NO   |     | 0       |                |
| description | text    | YES  |     | NULL    |                |
+-------------+---------+------+-----+---------+----------------+

The tables in the databases were then filled with random versions from 0 to 5 and a semi-random amount of lorem ipsum text. The test_100.data table got 100 rows, the test_500.data table got 500 rows and so forth. I then ran test for both the query using nested OR statements and using a temporary table with all ids and random version between 0 and 5.

Results

Results for nested OR query. Number of repeats for each n was 1000.

+----------+-------------+-------------+-------------+-------------+-------------+
|          | n = 100     | n = 500     | n = 1000    | n = 5000    | n = 10000   |
+----------+-------------+-------------+-------------+-------------+-------------+
| max      | 0.00719     | 0.02213     | 0.04325     | 1.75707     | 8.91687     |
| min      | 0.00077     | 0.00781     | 0.02696     | 0.63565     | 5.29613     |
| median   | 0.00100     | 0.00917     | 0.02996     | 0.82732     | 5.92217     |
| average  | 0.00111     | 0.01001     | 0.03057     | 0.82540     | 5.89577     |
+----------+-------------+-------------+-------------+-------------+-------------+

Results for temporary table query. Number of repeats for each n was 1000.

+----------+-------------+-------------+-------------+-------------+-------------+
|          | n = 100     | n = 500     | n = 1000    | n = 5000    | n = 10000   |
+----------+-------------+-------------+-------------+-------------+-------------+
| max      | 0.06352     | 0.07192     | 0.08798     | 0.28648     | 0.26939     |
| min      | 0.02119     | 0.02027     | 0.03126     | 0.07677     | 0.12269     |
| median   | 0.03075     | 0.03210     | 0.043833    | 0.10068     | 0.15839     |
| average  | 0.03121     | 0.03258     | 0.044968    | 0.10342     | 0.16153     |
+----------+-------------+-------------+-------------+-------------+-------------+

It seems that using nested OR queries is faster up to about n = 1000. From there on, the the nested OR scales badly and the temporary table approach wins solidly. In my case I am likely to have a maximum of around 1000 rows, so it seems that I can choose between these two approaches relatively freely.

I will probably go for the temporary table approach in case my data set should become larger than expected. The payload is small in any case.

Notes

Since the timeit module in Python is a bit ticklish, the database is opened and closed for each run/repeat. This might produce some overhead to the timings.
The queries for the temporary table approach were done in 3 steps: 1 for creating the temporary, 1 for inserting the data and 1 for joining the tables.
The creation of the queries are not part of the timing; they are created outside of the Python timeit call.
Since both the versions in the inserted data and the random "client" data are randomly chosen between 0 and 5, it is likely that between 33 % and 50 % of the rows are selected. I have not verified this. This is not really the case I have, as the client data will at any point have almost the same data as the server.
I tried adding WHERE id IN (1,2,3...,10) on both the temporary table approach and the nested OR approach, but it neither sped things up nor slowed them down in any of the tests, except for the larger data sets and the multiple OR approach. Here, the times were slightly lower than without this WHERE statement.

Populate multiple tables from a single JSON object with json_populate_recordset

3 Steps:

Access field of JSON object wich ->.
Create a derived table from the JSON array of records using json_populate_recordset().
Decompose the row-type for the INSERT command.

To reuse the input value for all tables, wrap it in data-modifying CTEs:

WITH input AS (
   SELECT '{
      "tablename_a":[{"a_id":1,"b_id":2,"c_id":3},
       {"a_id":2,"b_id":51,"c_id":3}],
      "tablename_b":[{"b_id":2,"name":"John Doe", "z_id":123},
       {"b_id":51,"name":"Mary Ann", "z_id":412}],
      "tablename_c":[{"c_id":3, "OS type":"Windows 7"}],
      "tablename_z":[{"z_id":123, "Whatever":"Something"},
      {"z_id":123, "Whatever":"Something else"}]
      }'::json AS j
   )
,  a AS (
   INSERT INTO tablename_a
   SELECT t.*
   FROM   input i
        , json_populate_recordset(NULL::tablename_a, i.j->'tablename_a') t
   )
,  b AS (
   INSERT INTO tablename_b
   SELECT t.*
   FROM   input i
        , json_populate_recordset(NULL::tablename_b, i.j->'tablename_b') t
   )
   -- ... more ...
INSERT INTO tablename_z
SELECT t.*
FROM   input i
     , json_populate_recordset(NULL::tablename_z, i.j->'tablename_z') t
;

SQL Fiddle.

Using implicit JOIN LATERAL. Related:

Query combinations with nested array of records in JSON datatype

Query nested arrays in ArangoDB

You can inspect what the filter does using a simple trick: you RETURN the actual filter condition:

db._query(`FOR r IN resource  RETURN r.properties[*].property[*]`).toArray()
[ 
  [ 
    [ 
      "urn:project:domain:attribute:surname" 
    ], 
    [ 
      "urn:project:domain:attribute:givennames" 
    ] 
  ] 
]

which makes it pretty clear whats going on. The IN operator can only work on one dimensional arrays. You could work around this by using FLATTEN() to remove the sub layers:

db._query(`FOR r IN resource  RETURN FLATTEN(r.properties[*].property[*])`).toArray()
[ 
  [ 
    "urn:project:domain:attribute:surname", 
    "urn:project:domain:attribute:givennames" 
  ] 
]

However, while your documents are valid json (I guess its converted from xml?) you should alter the structure as one would do it in json:

"properties" : {
  "urn:project:domain:attribute:surname":[
      "Simpson"
  ],
  "urn:project:domain:attribute:givennames": [
      "Homer",
      "Jay"
  ]
}

Since the FILTER combination you specify would also find any other Jay (not only those found in givennames) and the usage of FLATTEN() will prohibit using indices in your filter statement. You don't want to use queries that can't use indices on reasonably sized collections for performance reasons.

In Contrast you can use an array index on givennames with the above document layout:

db.resource.ensureIndex({type: "hash",
                        fields:
      ["properties.urn:project:domain:attribute:givennames[*]"]
    })

Now doublecheck the explain for the query:

db._explain("FOR r IN resource FILTER 'Jay' IN " + 
            "r.properties.`urn:project:domain:attribute:givennames` RETURN r")
...
  6   IndexNode            1     - FOR r IN resource   /* hash index scan */
...
Indexes used:
By   Type   Collection   Unique   Sparse   Selectivity   Fields           Ranges
  6   hash   resource     false    false       100.00 % \
       [ `properties.urn:project:domain:attribute:givennames[*]` ] \
       ("Jay" in r.`properties`.`urn:project:domain:attribute:givennames`)

that its using the index.

How can I access (and save to my database) nested objects/properties within a JSON array via an API GET request in Ruby on Rails?

This was resolved by creating a Property and then using its ID in Units:

require 'rest-client'
require 'json'

# Define Method - API Request
def properties

    response = RestClient.get('https://api.valoff.ie/api/Property/GetProperties?Fields=*&LocalAuthority=CAVAN%20COUNTY%20COUNCIL&CategorySelected=OFFICE&Format=csv&Download=false')
    json = JSON.parse(response)

    json.each do |property|
        puts "Creating property #{property['PropertyNumber']}"

        property_model = Property.create!(

            publication_date: property['PublicationDate'],
            property_number: property['PropertyNumber'],
            county: property['County'],
            local_authority: property['LocalAuthority'],
            valuation: property['Valuation'],
            category: property['Category'],
            uses: property['Uses'],
            address_1: property['Address1'],
            address_2: property['Address2'],
            address_3: property['Address3'],
            address_4: property['Address4'],
            address_5: property['Address5'],
            car_park: property['CarPark'],
            xitm: property['Xitm'],
            yitm: property['Yitm']
        )

        property['ValuationReport'].each do |unit|
          puts "Creating unit."
          property_model.units.create!(
            level: unit['Level'],
            floor_use: unit['FloorUse'],
            area: unit['Area'],
            nav_per_m2: unit['NavPerM2'],
            nav: unit['Nav']
          )
        end
    end
end

# Call Method
properties

Reference: Ruby on Rails: How can I use JSONPath to access (and save to database) nested objects/properties within a JSON array?

Custom ordering by JSON column's array of objects in PostgreSQL

With PG10, I would approach the problem something like this:

select mt.*
  from my_table mt
 cross join lateral json_array_elements(my_json_stuff) as j(obj)
 where ((obj->>'year')::int, (obj->>'month')::int) = (2020, 3)
 order by (obj->>'sessions')::int desc;

In response to your comment, if you want to keep rows that do not have a 2020-03 entry, then you can do a left join lateral, which behaves like a left join would. The lateral buys you the implicit join to the row containing the json type that was blown out.

select mt.*
  from my_table mt
  left join lateral json_array_elements(my_json_stuff) as j(obj)
    on ((obj->>'year')::int, (obj->>'month')::int) = (2020, 3)
 order by (obj->>'sessions')::int desc nulls last;

One other note: You do want to use jsonb instead of json going forward.

Query Combinations with Nested Array of Records in JSON Datatype