Store Multiple Elements in JSON Files in Aws Athena

Store multiple elements in json files in AWS Athena

Question1: Store multiple elements in json files for AWS Athena

I need to rewrite my json file as

{"eventId":"1","eventName":"INSERT","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"New item!","Id":101}}, {"eventId":"2","eventName":"MODIFY","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"This item has changed","Id":101}}, {"eventId":"3","eventName":"REMOVE","eventVersion":"1.0","eventSource":"aws:dynamodb","awsRegion":"us-west-2","image":{"Message":"This item has changed","Id":101}}

That means

Remove the square brackets [ ] Keep each element in one line

{.....................}
{.....................}
{.....................}

Question2. Access nonlinear json attributes

CREATE EXTERNAL TABLE IF NOT EXISTS <tablename> (
  `eventId` string,
  `eventName` string,
  `eventVersion` string,
  `eventSource` string,
  `awsRegion` string,
  `image` struct <`Id` : string,
                  `Message` : string>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = '1',
   "dots.in.keys" = "true"
) LOCATION 's3://exampletablewithstream-us-west-2/';

Query:

select image.Id, image.message from <tablename>;

Ref:

http://engineering.skybettingandgaming.com/2015/01/20/parsing-json-in-hive/

https://github.com/rcongiu/Hive-JSON-Serde#mapping-hive-keywords

AWS Athena on S3 bucket with some JSON files

Yes, Athena (Presto, Hive) requires that the files stored within the table's LOCATION have a consistent format. I believe you need to move the files to make separate tables for each underlying data schema.

How AWS Athena deals with single line JSONs?

I believe there is no way a file with such JSON can be processed properly because a separator is required in order to distribute work. There is no explicit information in documentation on how to provide a custom separator, and most likely it is not possible in supported JSON SerDe libraries. Besides that, there is no distinct separator between given JSON objects that is not used inside JSON itself. In fact, there is no separator at all.

However, it is possible to use Firehose Data Transformation to buffer incoming data and invoke a Lambda function with each buffer asynchronously. There are predefined Lambda blueprints, and Kinesis Firehose Processing can be used in this case to add new line characters between JSON objects.

Each transformed record is supposed to contain recordId, result and Base64 encoded data with the transformed payload.
There are multiple examples of such Lambda function, e.g. this python sample in Amazon AWS samples repos on GitHub.

AWS Athena query JSON array with AND Condition

unnest flattens data into multiple rows, so you can process the array without unnesting using array functions. Version of Presto currently used by Athena does not support any_match so you will need to use cardinality + filter combination (and it does not support filtering via json path):

-- sample data
WITH dataset (json_str) AS (
    VALUES (
            json '{
            "firstName":"Emily",
            "address":{
               "streetAddress":"101",
               "city":"abc",
               "state":"",
               "phoneNumbers":[
                  {
                     "type":"home",
                     "number":"11"
                  },
                  {
                     "type":"city",
                     "number":"4"
                  }
               ]
            }
         }'
        ),
        (
            json '{
            "firstName":"Smily",
            "address":{
               "streetAddress":"102",
               "city":"def",
               "state":"",
               "phoneNumbers":[
                  {
                     "type":"home",
                     "number":"1"
                  },
                  {
                     "type":"city",
                     "number":"1"
                  }
               ]
            }
         }'
        )
) -- query
select street_address,
    city
from (
        select JSON_EXTRACT_SCALAR(json_str, '$.address.streetAddress') as street_address,
            JSON_EXTRACT_SCALAR(json_str, '$.address.city') as city,
            cast(
                JSON_EXTRACT(json_str, '$.address.phoneNumbers') as array(json)
            ) phones
        from dataset
    )
where cardinality(
        filter(
            phones,
            js->json_extract_scalar(js, '$.type') = 'home'
                and try_cast(json_extract_scalar(js, '$.number') as integer) > 2
        )
    ) > 0 -- check for home
    and
    cardinality(
        filter(
            phones,
            js->json_extract_scalar(js, '$.type') = 'city'
                and json_extract_scalar(js, '$.number') = '4'
        )
    ) > 0 -- check for city

Output:

street_address	city
101	abc

JSON SerDe in Hive/Athena: turning one JSON object into multiple rows?

You can't make the serde do it automatically, but you can achieve what you're after in a query. You can then create a view to simulate a table with the data elements unwrapped.

The way you do this is to use the UNNEST keyword. This produces one new row per element in an array:

SELECT
  foo,
  bar,
  element
FROM my_table, UNNEST(data) AS t(element)

If your JSON looked like this:

{"foo": "f1", "bar": "b1", "data": [1, 2, 3]}
{"foo": "f2", "bar": "b2", "data": [4, 5]}

The result of the query would look like this:

foo | bar | element
----+-----+--------
f1  | b1  | 1
f1  | b1  | 2
f1  | b1  | 3
f2  | b2  | 4
f2  | b2  | 5

Store Multiple Elements in JSON Files in Aws Athena