How to See Contents of Hive Orc Files in Linux

How to see contents of Hive orc files in linux

Updated answer in year 2020:

Per @Owen's answer, ORC has grown up and matured as it's own Apache project. A completed list of ORC Adopters shows how prevalent it is now supported across many varieties of Big Data technologies.

Credit to @Owen and the ORC Apache project team, ORC's project site has a fully maintained up-to-date documentation on using either the Java or C++ stand alone tool on ORC file stored on a Linux local file system. Which carried on the torch for the original Hive+ORC Apache wiki page.

Original answer dated: May 30 '14 at 16:27

The ORC file dump utility comes with hive (0.11 or higher):

hive --orcfiledump <hdfs-location-of-orc-file>

Source
link

On HDFS, I want to display normal text for a hive table stored in ORC format

From the linux shell command there is an utility called hive --orcfiledump

To see the metadata of an ORC file in HDFS you can invoke the command like:

[@localhost ~ ]$ hive --orcfiledump <path to HDFS ORC file>;

To see the content of your ORC file in plain text you can invoke the command with -d option:

[@localhost ~ ]$ hive --orcfiledump -d <path to HDFS ORC file>;

As an example:

hive> DESCRIBE FORMATTED orders_orc;
Location: hdfs://localhost:8020/user/hive/warehouse/training_retail.db/orders_orc
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat

hive> exit;

[@localhost ~ ]$ hdfs dfs -ls /user/hive/warehouse/training_retail.db/orders_orc
Found 1 items
-rwxrwxrwx 1 training hive 163094 2020-04-20 09:39 /user/hive/warehouse/training_retail.db/orders_orc/000000_0

[@localhost ~ ]$ hdfs dfs -tail /user/hive/warehouse/training_retail.db/orders_orc/000000_0
��+"%ў�.�b.����8V$tߗ��\|�?�xM��
*�ڣ�������!�2���_���Ͳ�V���
r�E(����~�uM�/&��&x=-�&2�T��o��JD���Q��m5��#���8Iqe����A�^�ێ"���@�t�w�m�A ���3|�����NL�Q����p�d�#:}S-D�Wq�_"����

[@localhost ~ ]$ hive --orcfiledump /user/hive/warehouse/training_retail.db/orders_orc/000000_0;
Structure for /user/hive/warehouse/training_retail.db/orders_orc/000000_0
File Version: 0.12 with HIVE_8732
20/04/20 10:19:58 INFO orc.ReaderImpl: Reading ORC rows from /user/hive/warehouse/training_retail.db/orders_orc/000000_0 with {include: null, offset: 0, length: 9223372036854775807}
Rows: 68883
Compression: ZLIB
Compression size: 262144
Type: struct<_col0:int,_col1:string,_col2:int,_col3:string>
....
File length: 163094 bytes
Padding length: 0 bytes
Padding ratio: 0%

[@localhost ~ ]$ hive --orcfiledump -d /user/hive/warehouse/training_retail.db/orders_orc/000000_0 | head -n 5
{"_col0":1,"_col1":"2013-07-25 00:00:00.0","_col2":11599,"_col3":"CLOSED"}
{"_col0":2,"_col1":"2013-07-25 00:00:00.0","_col2":256,"_col3":"PENDING_PAYMENT"}
{"_col0":3,"_col1":"2013-07-25 00:00:00.0","_col2":12111,"_col3":"COMPLETE"}
{"_col0":4,"_col1":"2013-07-25 00:00:00.0","_col2":8827,"_col3":"CLOSED"}
{"_col0":5,"_col1":"2013-07-25 00:00:00.0","_col2":11318,"_col3":"COMPLETE"}

You can follow this link for details:

How to see contents of Hive orc files in linux

is it possible to read a orcserde hadoop file from command line?

Looks like you'll have to dump it from Hive. This appears to be a duplicate of this question - How to see contents of Hive orc files in linux

The hive --orcfiledump <location-of-orc-file> or something similar depending on your Hive version should do it. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC

How to get the schema (columns and their types) of ORC files stored in HDFS?

Found a way to get the details (column names, data types) I wanted via Spark

data = sqlContext.sql("SELECT * FROM orc.`<HDFS_path>`");
data.printSchema()

This will print the output in below format, which is exaclty the information I want to extract from the ORC files on HDFS:

root
|-- <column_name1>: <datatype> (nullable = <true/false>)
|-- <column_name2>: <datatype> (nullable = <true/false>)
|-- <column_name3>: <datatype> (nullable = <true/false>)
|-- <column_name4>: <datatype> (nullable = <true/false>)
|-- <column_name5>: <datatype> (nullable = <true/false>)

How to update a hive table's data after copied orc files with hdfs into the folder of that table

If the table is not partitioned then once the files are in HDFS in the folder that is specified in the LOCATION clause, then the data should be available for querying.

If the table is partitioned then u first need to run an ADD PARTITION statement.

Create hive table from file stored in hdfs in orc format

You can not do that in only one step. The create table statement doesn't process the data, just specify the format and the location.

My suggestion is that you create a temporal table using the "STORED AS TEXTFILE" create statement and create the final table using ORC as storage format (using an empty location).

Then insert in the "ORC table" all the rows from the temporal "text table".

Insert [overwrite] table orcTable select col1, col2 from textTable;

Insert Overwrite will replace all the data in the table with the new data. If you only want to add new data you will use "Insert table . . ."

After the import you could delete the temporal "text table".

Partitioned ORC table shows up empty in Hive

You just need to update the partition info on the table so Hive can list the partitions presents. This is done through the MSCK REPAIR command:

spark.sql("MSCK REPAIR TABLE <tableName>")

More info on this command here

Quick example here

scala> spark.sql("select * from table").show
20/03/28 17:12:46 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
+------+------+
|column|LOADED|
+------+------+
+------+------+

scala> spark.sql("MSCK REPAIR TABLE table")


scala> spark.sql("select * from table").show
+------+----------+
|column| LOADED|
+------+----------+
| a|2019-11-16|
| c|2019-11-16|
| b|2019-11-17|
+------+----------+

Sorted Table in Hive (ORC file format)

Bucketed table is an outdated concept.

You do not need to write CLUSTERED BY in table DDL.

When loading table use distribute by partition key to reduce pressure on reducers especially when writing ORC, which requires intermediate buffers for building ORC and if each reducer loads many partitions it may cause OOM exception.

When the table is big, you can limit the max file size using bytes.per.reducer like this:

set hive.exec.reducers.bytes.per.reducer=67108864;--or even less

If you have more data, more reducers will be started, more files created. This is more flexible than loading fixed number of buckets.

This will also work better because for small tables you do not need to create smaller buckets.

ORC has internal indexes and bloom filters. Applying SORT you can improve index and bloom filters efficiency because all similar data will be stored together. Also this can improve compression depending on your data enthropy.

If distribution by partition key is not enough because you have some data skew and the data is big, you can additionally distribute by random. It is better to distribute by column if you have evenly distributed data. If not, distribute by random to avoid single long running reducer problem.

Finally your insert statement may look loke this:

set hive.exec.reducers.bytes.per.reducer=33554432; --32Mb per reducer

INSERT OVERWRITE TABLE trades PARTITION (dt)
SELECT trade_id, name, contract_type, ts, dt
FROM raw_trades
DISTRIBUTE BY dt, --partition key is a must for big data
trade_id, --some other key if the data is too big and key is
--evenly distributed (no skew)
FLOOR(RAND()*100.0)%20 --random to distribute additionally on 20 equal parts

SORT BY contract_type; --sort data if you want filtering by this key
--to work better using internal index

Do not use CLUSTERED BY in table DDL because using DISTRIBUTE BY, ORC w indexes and bloom filters + SORT during insert you can achieve the same in more flexible way.

Distribute + sort can reduce the size of ORC files extremely by x3 or x4 times. Similar data can be better compressed and makes internal indexes more efficient.

Read also this: https://stackoverflow.com/a/55375261/2700344
This is related answer about about sorting: https://stackoverflow.com/a/47416027/2700344

The only case when you can use CLUSTER BY in table DDL is when you joining two big tables which can be bucketed by exactly the same number of buckets to be able to use sort-merge-bucket-map-join, but practically it is so rare case when you can bucket two big tables in the same way. Having only 1 bucket makes no sense because for small tables you can use map-join, just sort the data during insert to reduce the compressed data size.



Related Topics



Leave a reply



Submit