Store multiple elements in json files in AWS Athena
Question1: Store multiple elements in json files for AWS Athena
I need to rewrite my json file as
{"eventId":"1","eventName":"INSERT","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"New item!","Id":101}}, {"eventId":"2","eventName":"MODIFY","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"This item has changed","Id":101}}, {"eventId":"3","eventName":"REMOVE","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"This item has changed","Id":101}}
That means
Remove the square brackets [ ] Keep each element in one line
{.....................}
{.....................}
{.....................}
Question2. Access nonlinear json attributes
CREATE EXTERNAL TABLE IF NOT EXISTS <tablename> (
`eventId` string,
`eventName` string,
`eventVersion` string,
`eventSource` string,
`awsRegion` string,
`image` struct <`Id` : string,
`Message` : string>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'serialization.format' = '1',
"dots.in.keys" = "true"
) LOCATION 's3://exampletablewithstream-us-west-2/';
Query:
select image.Id, image.message from <tablename>;
Ref:
http://engineering.skybettingandgaming.com/2015/01/20/parsing-json-in-hive/
https://github.com/rcongiu/Hive-JSON-Serde#mapping-hive-keywords
AWS Athena on S3 bucket with some JSON files
Yes, Athena (Presto, Hive) requires that the files stored within the table's LOCATION have a consistent format. I believe you need to move the files to make separate tables for each underlying data schema.
How AWS Athena deals with single line JSONs?
I believe there is no way a file with such JSON can be processed properly because a separator is required in order to distribute work. There is no explicit information in documentation on how to provide a custom separator, and most likely it is not possible in supported JSON SerDe libraries. Besides that, there is no distinct separator between given JSON objects that is not used inside JSON itself. In fact, there is no separator at all.
However, it is possible to use Firehose Data Transformation to buffer incoming data and invoke a Lambda function with each buffer asynchronously. There are predefined Lambda blueprints, and Kinesis Firehose Processing
can be used in this case to add new line characters between JSON objects.
Each transformed record is supposed to contain recordId
, result
and Base64 encoded data
with the transformed payload.
There are multiple examples of such Lambda function, e.g. this python sample in Amazon AWS samples repos on GitHub.
AWS Athena query JSON array with AND Condition
unnest
flattens data into multiple rows, so you can process the array without unnesting using array functions. Version of Presto currently used by Athena does not support any_match
so you will need to use cardinality
+ filter
combination (and it does not support filtering via json path):
-- sample data
WITH dataset (json_str) AS (
VALUES (
json '{
"firstName":"Emily",
"address":{
"streetAddress":"101",
"city":"abc",
"state":"",
"phoneNumbers":[
{
"type":"home",
"number":"11"
},
{
"type":"city",
"number":"4"
}
]
}
}'
),
(
json '{
"firstName":"Smily",
"address":{
"streetAddress":"102",
"city":"def",
"state":"",
"phoneNumbers":[
{
"type":"home",
"number":"1"
},
{
"type":"city",
"number":"1"
}
]
}
}'
)
) -- query
select street_address,
city
from (
select JSON_EXTRACT_SCALAR(json_str, '$.address.streetAddress') as street_address,
JSON_EXTRACT_SCALAR(json_str, '$.address.city') as city,
cast(
JSON_EXTRACT(json_str, '$.address.phoneNumbers') as array(json)
) phones
from dataset
)
where cardinality(
filter(
phones,
js->json_extract_scalar(js, '$.type') = 'home'
and try_cast(json_extract_scalar(js, '$.number') as integer) > 2
)
) > 0 -- check for home
and
cardinality(
filter(
phones,
js->json_extract_scalar(js, '$.type') = 'city'
and json_extract_scalar(js, '$.number') = '4'
)
) > 0 -- check for city
Output:
street_address | city |
---|---|
101 | abc |
JSON SerDe in Hive/Athena: turning one JSON object into multiple rows?
You can't make the serde do it automatically, but you can achieve what you're after in a query. You can then create a view to simulate a table with the data elements unwrapped.
The way you do this is to use the UNNEST
keyword. This produces one new row per element in an array:
SELECT
foo,
bar,
element
FROM my_table, UNNEST(data) AS t(element)
If your JSON looked like this:
{"foo": "f1", "bar": "b1", "data": [1, 2, 3]}
{"foo": "f2", "bar": "b2", "data": [4, 5]}
The result of the query would look like this:
foo | bar | element
----+-----+--------
f1 | b1 | 1
f1 | b1 | 2
f1 | b1 | 3
f2 | b2 | 4
f2 | b2 | 5
Related Topics
How to Generate Crud Stored Procedures from a Table in SQL Server Management Studio
Bigquery SQL for Sliding Window Aggregate
How to Create an "On-The-Fly" Mapping Table Within a Select Statement in Postgresql
Use Row Values as Columns in Postgresql
Error Importing Azure Bacpac File to Local Db Error Incorrect Syntax Near External
Regular Expression to Remove Comments from SQL Statement
Reverse String Word by Word Using SQL
Pivot/Crosstab Query in Oracle 10G (Dynamic Column Number)
Extract Day of Week from Date Field in Postgresql Assuming Weeks Start on Monday
SQL Distinct Keyword Bogs Down Performance
Insert Multiple Rows into Single Column
What Free SQL Formatting Tools Exist
How to Further Optimize a Derived Table Query Which Performs Better Than the Joined Equivalent