How to Record Created_At and Updated_At Timestamps in Hive

How to record created_at and updated_at timestamps in Hive?

Hive does not provide such mechanism. You can achieve this by using UDF in your select: from_unixtime(unix_timestamp()) as created_at. Note this will be executed in each mapper or reducer and may return different values. If you need the same value for all the dataset (for Hive version before 1.2.0), pass the variable to the script and use it inside as: '${hiveconf:created_at}' as created_at

Update: current_timestamp returns the current timestamp at the start of query evaluation (as of Hive 1.2.0). All calls of current_timestamp within the same query return the same value. unix_timestamp() Gets current Unix timestamp in seconds. This function is non-deterministic and prevents proper optimization of queries - this has been deprecated since 2.0 in favour of CURRENT_TIMESTAMP constant. So, it's not a function, it's a constant!
See this docs: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

For hive queries CURRENT_TIMESTAMP is preferable when you rewrite tables or partitions or insert into because all the file(s) anyway are being rewritten, not records, therefore the created_at timestamp should be the same.

Having both a Created and Last Updated timestamp columns in MySQL 4.0

From the MySQL 5.5 documentation:

One TIMESTAMP column in a table can have the current timestamp as the default value for initializing the column, as the auto-update value, or both. It is not possible to have the current timestamp be the default value for one column and the auto-update value for another column.

Changes in MySQL 5.6.5:

Previously, at most one TIMESTAMP column per table could be automatically initialized or updated to the current date and time. This restriction has been lifted. Any TIMESTAMP column definition can have any combination of DEFAULT CURRENT_TIMESTAMP and ON UPDATE CURRENT_TIMESTAMP clauses. In addition, these clauses now can be used with DATETIME column definitions. For more information, see Automatic Initialization and Updating for TIMESTAMP and DATETIME.

Get full data view for two tables in Hive?

You can use UNION ALL:

select tr_id, res_id, info_json, created_at, updated_at, src
from
(select tr_id, res_id, info_json, created_at, updated_at, 'NoArch' as src
from Table2NoArch

union all

select tr_id, res_id, info_json, null created_at, null updated_at, 'Arch' as src
from Table1Arch
)u
where res_id in (111,333,444)

created_at and updated_at are absent in one Table1Arch, NULLs are selected, you can use current_timestamp or current_date instead.

Added src column, so you can easily find out the source of data.

Union two tables having unix_timestamp() function in both

unix_timestamp()

Gets current Unix timestamp in seconds.

This function is
non-deterministic and prevents proper optimization of queries -

this has been deprecated since 2.0 in favour of CURRENT_TIMESTAMP

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

insert overwrite table TableC 
select field1,field2, unix_timestamp(current_timestamp) as field3 from table_A
UNION
select field1,field2, unix_timestamp(current_timestamp) as field3 from table_B

Additional work-arounds

insert overwrite table TableC 

select field1,field2,unix_timestamp() as field3

from ( select field1,field2 from table_A
union all select field1,field2 from table_B
) t

group by field1,field2

or

insert overwrite table TableC 

select field1,field2,unix_timestamp() as field3

from ( select field1,field2 from table_A
union select field1,field2 from table_B
) t

Not able to reference Hive date variable in later set statements

Hive variable substitution is simple text replacement. This statement:

set my_date=select to_date(date_sub(last_day(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd')),1)); will assign string 'select to_date(date_sub(last_day(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd')),1))' to the variable my_date. Hive does not calculate variables unfortunately. And your final statement will be resolved as

select date_format('select to_date(date_sub(last_day(FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd')),1))','yyyy-MM');

And this is incorrect select statement.

You can calculate the variable in the separate script and pass it to another script using shell, like in this answer. See also https://stackoverflow.com/a/56450129/2700344

You can print variable inside the Hive script using shell echo command:

! echo my_date contains '${hiveconf:my_date}';

Also do not use FROM_UNIXTIME(UNIX_TIMESTAMP(),'yyyy-MM-dd'). Use current_date() instead, see this answer for more details: https://stackoverflow.com/a/41140298/2700344.



Related Topics



Leave a reply



Submit