Query combinations with nested array of records in JSON datatype
Given this table (which you should have provided in a form like this):
CREATE TABLE reports (rep_id int primary key, data json);
INSERT INTO reports (rep_id, data)
VALUES
(1, '{"objects":[{"album": 1, "src":"fooA.png", "pos": "top"}, {"album": 2, "src":"barB.png", "pos": "top"}], "background":"background.png"}')
, (2, '{"objects":[{"album": 1, "src":"fooA.png", "pos": "top"}, {"album": 2, "src":"barC.png", "pos": "top"}], "background":"bacakground.png"}')
, (3, '{"objects":[{"album": 1, "src":"fooA.png", "pos": "middle"},{"album": 2, "src":"barB.png", "pos": "middle"}],"background":"background.png"}')
, (4, '{"objects":[{"album": 1, "src":"fooA.png", "pos": "top"}, {"album": 3, "src":"barB.png", "pos": "top"}], "background":"backgroundA.png"}')
;
JSON records of well known translatable type
Use json_populate_recordset()
for unnesting the recordset "objects"
. The function requires a registered row type to define the names and data types of resulting columns. For the purpose of this demo or generally for ad-hoc queries, a temp table modeled after "objects"
can provide the same:
CREATE TEMP TABLE obj(album int, src text, pos text);
To find the top 3 most common combinations ... of entries that have the same album
, src
, and background
:
SELECT array_agg(r.rep_id) AS ids, count(*) AS ct
FROM reports r
, json_populate_recordset(null::obj, r.data->'objects') o
GROUP BY r.data->>'background'
, o.album
, o.scr
ORDER BY count(*) DESC
LIMIT 3;
Each object counts, no matter whether from the same row or not. You did not define how to handle that exactly. Consequently, rep_id
can pop up multiple times in the array ids
. Add DISTINCT
to array_agg()
to fold possible duplicates. The count ct
can be greater then the length of the array ids
in this case.
Requires Postgres 9.3 for the JSON functions and operators and the implicit JOIN LATERAL
.
JSON records of unknown or untranslatable type
json_array_elements()
just unnests the json array without transforming the result into an SQL row. Access individual fields with JSON operators accordingly.
SELECT array_agg(r.rep_id) AS ids, count(*) AS ct
FROM reports r
, json_array_elements(r.data->'objects') o
GROUP BY r.data->>'background'
, o->>'album'
, o->>'scr'
ORDER BY count(*) DESC
LIMIT 3;
Query for array elements inside JSON type
jsonb
in Postgres 9.4+
You can use the same query as below, just with jsonb_array_elements()
.
But rather use the jsonb
"contains" operator @>
in combination with a matching GIN index on the expression data->'objects'
:
CREATE INDEX reports_data_gin_idx ON reports
USING gin ((data->'objects') jsonb_path_ops);
SELECT * FROM reports WHERE data->'objects' @> '[{"src":"foo.png"}]';
Since the key objects
holds a JSON array, we need to match the structure in the search term and wrap the array element into square brackets, too. Drop the array brackets when searching a plain record.
More explanation and options:
- Index for finding an element in a JSON array
json
in Postgres 9.3+
Unnest the JSON array with the function json_array_elements()
in a lateral join in the FROM
clause and test for its elements:
SELECT data::text, obj
FROM reports r, json_array_elements(r.data#>'{objects}') obj
WHERE obj->>'src' = 'foo.png';
db<>fiddle here
Old sqlfiddle
Or, equivalent for just a single level of nesting:
SELECT *
FROM reports r, json_array_elements(r.data->'objects') obj
WHERE obj->>'src' = 'foo.png';
->>
, ->
and #>
operators are explained in the manual.
Both queries use an implicit JOIN LATERAL
.
Closely related:
- Query for element of array in JSON column
How do I query using fields inside the new PostgreSQL JSON datatype?
Postgres 9.2
I quote Andrew Dunstan on the pgsql-hackers list:
At some stage there will possibly be some json-processing (as opposed
to json-producing) functions, but not in 9.2.
Doesn't prevent him from providing an example implementation in PLV8 that should solve your problem. (Link is dead now, see modern PLV8 instead.)
Postgres 9.3
Offers an arsenal of new functions and operators to add "json-processing".
- The manual on new JSON functionality.
- The Postgres Wiki on new features in pg 9.3.
The answer to the original question in Postgres 9.3:
SELECT *
FROM json_array_elements(
'[{"name": "Toby", "occupation": "Software Engineer"},
{"name": "Zaphod", "occupation": "Galactic President"} ]'
) AS elem
WHERE elem->>'name' = 'Toby';
Advanced example:
- Query combinations with nested array of records in JSON datatype
For bigger tables you may want to add an expression index to increase performance:
- Index for finding an element in a JSON array
Postgres 9.4
Adds jsonb
(b for "binary", values are stored as native Postgres types) and yet more functionality for both types. In addition to expression indexes mentioned above, jsonb
also supports GIN, btree and hash indexes, GIN being the most potent of these.
- The manual on
json
andjsonb
data types and functions. - The Postgres Wiki on JSONB in pg 9.4
The manual goes as far as suggesting:
In general, most applications should prefer to store JSON data as
jsonb
, unless there are quite specialized needs, such as legacy
assumptions about ordering of object keys.
Bold emphasis mine.
Performance benefits from general improvements to GIN indexes.
Postgres 9.5
Complete jsonb
functions and operators. Add more functions to manipulate jsonb
in place and for display.
- Major good news in the release notes of Postgres 9.5.
Select data based on two constraints for each row, using JSON data as input
Main result for the below results and discussions: Using the multiple OR
query (as suggested by @KA_lin) is faster for small data sets (n < 1000
or so). This approach scales badly for larger data sets however, so I will probably stick with using a query with the TEMPORARY TABLE
approach below in case my data set should grow large in the future. The payload for this is not that high.
CREATE TEMPORARY TABLE my_data_virtual(id INTEGER NOT NULL, version TINYINT(3) NOT NULL);
INSERT INTO my_data_virtual VALUES
(1,0), (2,1), (3,0), (4,0), (5,1),
(6,0), (7,0), (8,1), (9,2), (10,1);
SELECT md.id, md.version
FROM my_data AS md
INNER JOIN my_data_virtual AS mdv
ON md.id = mvd.id AND md.id > mvd.id;
I ran a series of tests using the MySQLdb
and timeit
modules in Python. I created 5 tables: test_100
, test_500
, test_1000
, test_5000
and test_10000
. All the databases were given a single table, data
, which contained the following columns.
+-------------+---------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------+---------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| version | int(11) | NO | | 0 | |
| description | text | YES | | NULL | |
+-------------+---------+------+-----+---------+----------------+
The tables in the databases were then filled with random versions from 0 to 5 and a semi-random amount of lorem ipsum text. The test_100.data
table got 100 rows, the test_500.data
table got 500 rows and so forth. I then ran test for both the query using nested OR
statements and using a temporary table with all id
s and random version between 0 and 5.
Results
Results for nested OR
query. Number of repeats for each n
was 1000.
+----------+-------------+-------------+-------------+-------------+-------------+
| | n = 100 | n = 500 | n = 1000 | n = 5000 | n = 10000 |
+----------+-------------+-------------+-------------+-------------+-------------+
| max | 0.00719 | 0.02213 | 0.04325 | 1.75707 | 8.91687 |
| min | 0.00077 | 0.00781 | 0.02696 | 0.63565 | 5.29613 |
| median | 0.00100 | 0.00917 | 0.02996 | 0.82732 | 5.92217 |
| average | 0.00111 | 0.01001 | 0.03057 | 0.82540 | 5.89577 |
+----------+-------------+-------------+-------------+-------------+-------------+
Results for temporary table query. Number of repeats for each n
was 1000.
+----------+-------------+-------------+-------------+-------------+-------------+
| | n = 100 | n = 500 | n = 1000 | n = 5000 | n = 10000 |
+----------+-------------+-------------+-------------+-------------+-------------+
| max | 0.06352 | 0.07192 | 0.08798 | 0.28648 | 0.26939 |
| min | 0.02119 | 0.02027 | 0.03126 | 0.07677 | 0.12269 |
| median | 0.03075 | 0.03210 | 0.043833 | 0.10068 | 0.15839 |
| average | 0.03121 | 0.03258 | 0.044968 | 0.10342 | 0.16153 |
+----------+-------------+-------------+-------------+-------------+-------------+
It seems that using nested OR
queries is faster up to about n = 1000
. From there on, the the nested OR
scales badly and the temporary table approach wins solidly. In my case I am likely to have a maximum of around 1000 rows, so it seems that I can choose between these two approaches relatively freely.
I will probably go for the temporary table approach in case my data set should become larger than expected. The payload is small in any case.
Notes
- Since the
timeit
module in Python is a bit ticklish, the database is opened and closed for each run/repeat. This might produce some overhead to the timings. - The queries for the temporary table approach were done in 3 steps: 1 for creating the temporary, 1 for inserting the data and 1 for joining the tables.
- The creation of the queries are not part of the timing; they are created outside of the Python
timeit
call. - Since both the versions in the inserted data and the random "client" data are randomly chosen between 0 and 5, it is likely that between 33 % and 50 % of the rows are selected. I have not verified this. This is not really the case I have, as the client data will at any point have almost the same data as the server.
- I tried adding
WHERE id IN (1,2,3...,10)
on both the temporary table approach and the nestedOR
approach, but it neither sped things up nor slowed them down in any of the tests, except for the larger data sets and the multipleOR
approach. Here, the times were slightly lower than without thisWHERE
statement.
Populate multiple tables from a single JSON object with json_populate_recordset
3 Steps:
- Access field of JSON object wich
->
. - Create a derived table from the JSON array of records using
json_populate_recordset()
. - Decompose the row-type for the
INSERT
command.
To reuse the input value for all tables, wrap it in data-modifying CTEs:
WITH input AS (
SELECT '{
"tablename_a":[{"a_id":1,"b_id":2,"c_id":3},
{"a_id":2,"b_id":51,"c_id":3}],
"tablename_b":[{"b_id":2,"name":"John Doe", "z_id":123},
{"b_id":51,"name":"Mary Ann", "z_id":412}],
"tablename_c":[{"c_id":3, "OS type":"Windows 7"}],
"tablename_z":[{"z_id":123, "Whatever":"Something"},
{"z_id":123, "Whatever":"Something else"}]
}'::json AS j
)
, a AS (
INSERT INTO tablename_a
SELECT t.*
FROM input i
, json_populate_recordset(NULL::tablename_a, i.j->'tablename_a') t
)
, b AS (
INSERT INTO tablename_b
SELECT t.*
FROM input i
, json_populate_recordset(NULL::tablename_b, i.j->'tablename_b') t
)
-- ... more ...
INSERT INTO tablename_z
SELECT t.*
FROM input i
, json_populate_recordset(NULL::tablename_z, i.j->'tablename_z') t
;
SQL Fiddle.
Using implicit JOIN LATERAL
. Related:
- Query combinations with nested array of records in JSON datatype
Query nested arrays in ArangoDB
You can inspect what the filter does using a simple trick: you RETURN
the actual filter condition:
db._query(`FOR r IN resource RETURN r.properties[*].property[*]`).toArray()
[
[
[
"urn:project:domain:attribute:surname"
],
[
"urn:project:domain:attribute:givennames"
]
]
]
which makes it pretty clear whats going on. The IN
operator can only work on one dimensional arrays. You could work around this by using FLATTEN()
to remove the sub layers:
db._query(`FOR r IN resource RETURN FLATTEN(r.properties[*].property[*])`).toArray()
[
[
"urn:project:domain:attribute:surname",
"urn:project:domain:attribute:givennames"
]
]
However, while your documents are valid json (I guess its converted from xml?) you should alter the structure as one would do it in json:
"properties" : {
"urn:project:domain:attribute:surname":[
"Simpson"
],
"urn:project:domain:attribute:givennames": [
"Homer",
"Jay"
]
}
Since the FILTER
combination you specify would also find any other Jay
(not only those found in givennames
) and the usage of FLATTEN()
will prohibit using indices in your filter statement. You don't want to use queries that can't use indices on reasonably sized collections for performance reasons.
In Contrast you can use an array index on givennames
with the above document layout:
db.resource.ensureIndex({type: "hash",
fields:
["properties.urn:project:domain:attribute:givennames[*]"]
})
Now doublecheck the explain for the query:
db._explain("FOR r IN resource FILTER 'Jay' IN " +
"r.properties.`urn:project:domain:attribute:givennames` RETURN r")
...
6 IndexNode 1 - FOR r IN resource /* hash index scan */
...
Indexes used:
By Type Collection Unique Sparse Selectivity Fields Ranges
6 hash resource false false 100.00 % \
[ `properties.urn:project:domain:attribute:givennames[*]` ] \
("Jay" in r.`properties`.`urn:project:domain:attribute:givennames`)
that its using the index.
How can I access (and save to my database) nested objects/properties within a JSON array via an API GET request in Ruby on Rails?
This was resolved by creating a Property and then using its ID in Units:
require 'rest-client'
require 'json'
# Define Method - API Request
def properties
response = RestClient.get('https://api.valoff.ie/api/Property/GetProperties?Fields=*&LocalAuthority=CAVAN%20COUNTY%20COUNCIL&CategorySelected=OFFICE&Format=csv&Download=false')
json = JSON.parse(response)
json.each do |property|
puts "Creating property #{property['PropertyNumber']}"
property_model = Property.create!(
publication_date: property['PublicationDate'],
property_number: property['PropertyNumber'],
county: property['County'],
local_authority: property['LocalAuthority'],
valuation: property['Valuation'],
category: property['Category'],
uses: property['Uses'],
address_1: property['Address1'],
address_2: property['Address2'],
address_3: property['Address3'],
address_4: property['Address4'],
address_5: property['Address5'],
car_park: property['CarPark'],
xitm: property['Xitm'],
yitm: property['Yitm']
)
property['ValuationReport'].each do |unit|
puts "Creating unit."
property_model.units.create!(
level: unit['Level'],
floor_use: unit['FloorUse'],
area: unit['Area'],
nav_per_m2: unit['NavPerM2'],
nav: unit['Nav']
)
end
end
end
# Call Method
properties
Reference: Ruby on Rails: How can I use JSONPath to access (and save to database) nested objects/properties within a JSON array?
Custom ordering by JSON column's array of objects in PostgreSQL
With PG10, I would approach the problem something like this:
select mt.*
from my_table mt
cross join lateral json_array_elements(my_json_stuff) as j(obj)
where ((obj->>'year')::int, (obj->>'month')::int) = (2020, 3)
order by (obj->>'sessions')::int desc;
In response to your comment, if you want to keep rows that do not have a 2020-03
entry, then you can do a left join lateral
, which behaves like a left join
would. The lateral
buys you the implicit join to the row containing the json
type that was blown out.
select mt.*
from my_table mt
left join lateral json_array_elements(my_json_stuff) as j(obj)
on ((obj->>'year')::int, (obj->>'month')::int) = (2020, 3)
order by (obj->>'sessions')::int desc nulls last;
One other note: You do want to use jsonb
instead of json
going forward.
Related Topics
Move SQL Server 2008 Database Files to a New Folder Location
How to Use Index in Select Statement
How to Delete Multiple Rows in SQL Where Id = (X to Y)
Check If Entry in Table a Exists in Table B
How to Convert an Int to a Zero Padded String in T-Sql
SQL Ignore Part of Where If Parameter Is Null
Varchar2(N Byte|Char) Default -> Char or Byte
Using Table Variable with Sp_Executesql
SQL Like Operator to Get the Numbers Only
How to Find the .Net Framework Version Used in an Ssis 2008 R2 Package
Bigquery Group_Concat and Order By
Exporting SQL Server Table to Multiple Part Files
List of Stored Procedure from Table
What SQL Databases Support Subqueries in Check Constraints