How to get input file name as column in AWS Athena external tables
You can do this with the $path pseudo column.
select "$path" from table
Athena table - file's path
From Find the Source File for an Athena Table Row:
- To find out which Amazon S3 file contains data that is returned by a specific row in an Athena table, run a
SELECT $path
query.- To find out which Athena table rows are associated with a specific Amazon S3 file, run a SELECT query with the
WHERE $path
condition.
From CREATE EXTERNAL TABLE - Amazon Redshift:
By default, Amazon Redshift creates external tables with the pseudocolumns
$path
and$size
. Select these columns to view the path to the data files on Amazon S3 and the size of the data files for each row returned by a query. The$path
and$size
column names must be delimited with double quotation marks. ASELECT *
clause doesn't return the pseudocolumns. You must explicitly include the$path
and$size
column names in your query, as the following example shows:
select "$path", "$size"
from spectrum.sales_part
where saledate = '2008-12-01';
AWS Redshift Spectrum - how to get the s3 filenames in the external table
Since recently, you can use specific pseudo-columns to access the path and the size of the object in S3 for lineage information.
http://docs.aws.amazon.com/redshift/latest/dg/c-spectrum-external-tables.html#c-spectrum-external-tables-pseudocolumns
An example for such a query would be:
>> select distinct "$path", "$size" from spectrum.sales_part;
$path | $size
---------------------------------------+-------
s3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-01/ | 1616
s3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-02/ | 1444
s3://awssampledbuswest2/tickit/spectrum/sales_partition/saledate=2008-02/ | 1444
Selecting specific files for athena
Unfortunately the filesystem abstraction that Athena uses for S3 doesn't support this. It requires table locations to look like directories, and Athena will add a slash to the end of the location when listing files.
There is a way to create tables that contain only a selection of files, but as far as I know it does not support wildcards, only explicit lists of files.
What you do is you create a table with
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
and then instead of pointing the LOCATION
of the table to the actual files, you point it to a prefix with a single symlink.txt
file (or point each partition to a prefix with a single symlink.txt
). In the symlink.txt
file you add the S3 URIs of the files to include in the table, one per line.
The only documentation that I know of for this feature is the S3 Inventory documentation for integrating with Athena.
You can also find a full example in this Stackoverflow response: https://stackoverflow.com/a/55069330/1109
Related Topics
SQL Column Definition: Default Value and Not Null Redundant
Is There Any Difference Between "!=" and "<>" in Oracle SQL
Sql: How to Select Earliest Row
"Order by ... Using" Clause in Postgresql
How to Get Week Start and End Date String in Postgresql
How to Count the Number of Times a Character Appears in a SQL Column
How to Execute a Stored Procedure in a SQL Agent Job
Exec Failed Because the Name Not a Valid Identifier
MySQL Slow on First Query, Then Fast for Related Queries
How to Give a Unique Constraint to a Combination of Columns in Oracle
New Line Issue When Copying Data from SQL Server 2012 to Excel
How to Insert Multiple Rows into Oracle with a Sequence Value
Which Is Better: Bookmark/Key Lookup or Index Seek
How to Select the Comparison of Two Columns as One Column in Oracle
SQL Server 2008 Query to Find Rows Containing Non-Alphanumeric Characters in a Column