How to Get Input File Name as Column in Aws Athena External Tables

How to get input file name as column in AWS Athena external tables

You can do this with the $path pseudo column.

select "$path" from table

Athena table - file's path

From Find the Source File for an Athena Table Row:

  • To find out which Amazon S3 file contains data that is returned by a specific row in an Athena table, run a SELECT $path query.
  • To find out which Athena table rows are associated with a specific Amazon S3 file, run a SELECT query with the WHERE $path condition.

From CREATE EXTERNAL TABLE - Amazon Redshift:

By default, Amazon Redshift creates external tables with the pseudocolumns $path and $size. Select these columns to view the path to the data files on Amazon S3 and the size of the data files for each row returned by a query. The $path and $size column names must be delimited with double quotation marks. A SELECT * clause doesn't return the pseudocolumns. You must explicitly include the $path and $size column names in your query, as the following example shows:

select "$path", "$size"
from spectrum.sales_part
where saledate = '2008-12-01';

AWS Redshift Spectrum - how to get the s3 filenames in the external table

Since recently, you can use specific pseudo-columns to access the path and the size of the object in S3 for lineage information.

http://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html#c-spectrum-external-tables-pseudocolumns

An example for such a query would be:

>> select distinct "$path", "$size" from spectrum.sales_part;

$path | $size
---------------------------------------+-------
s3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-01/ | 1616
s3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-02/ | 1444
s3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-02/ | 1444

Selecting specific files for athena

Unfortunately the filesystem abstraction that Athena uses for S3 doesn't support this. It requires table locations to look like directories, and Athena will add a slash to the end of the location when listing files.

There is a way to create tables that contain only a selection of files, but as far as I know it does not support wildcards, only explicit lists of files.

What you do is you create a table with

STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'

and then instead of pointing the LOCATION of the table to the actual files, you point it to a prefix with a single symlink.txt file (or point each partition to a prefix with a single symlink.txt). In the symlink.txt file you add the S3 URIs of the files to include in the table, one per line.

The only documentation that I know of for this feature is the S3 Inventory documentation for integrating with Athena.

You can also find a full example in this Stackoverflow response: https://stackoverflow.com/a/55069330/1109



Related Topics



Leave a reply



Submit