How to see contents of Hive orc files in linux
Updated answer in year 2020:
Per @Owen's answer, ORC has grown up and matured as it's own Apache project. A completed list of ORC Adopters shows how prevalent it is now supported across many varieties of Big Data technologies.
Credit to @Owen and the ORC Apache project team, ORC's project site has a fully maintained up-to-date documentation on using either the Java or C++ stand alone tool on ORC file stored on a Linux local file system. Which carried on the torch for the original Hive+ORC Apache wiki page.
Original answer dated: May 30 '14 at 16:27
The ORC file dump utility comes with hive (0.11 or higher):
hive --orcfiledump <hdfs-location-of-orc-file>
Source
link
On HDFS, I want to display normal text for a hive table stored in ORC format
From the linux shell
command there is an utility called hive --orcfiledump
To see the metadata of an ORC
file in HDFS
you can invoke the command like:
[@localhost ~ ]$ hive --orcfiledump <path to HDFS ORC file>;
To see the content of your ORC
file in plain text you can invoke the command with -d
option:
[@localhost ~ ]$ hive --orcfiledump -d <path to HDFS ORC file>;
As an example:
hive> DESCRIBE FORMATTED orders_orc;
Location: hdfs://localhost:8020/user/hive/warehouse/training_retail.db/orders_orc
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde
InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
hive> exit;
[@localhost ~ ]$ hdfs dfs -ls /user/hive/warehouse/training_retail.db/orders_orc
Found 1 items
-rwxrwxrwx 1 training hive 163094 2020-04-20 09:39 /user/hive/warehouse/training_retail.db/orders_orc/000000_0
[@localhost ~ ]$ hdfs dfs -tail /user/hive/warehouse/training_retail.db/orders_orc/000000_0
��+"%ў�.�b.����8V$tߗ��\|�?�xM��
*�ڣ�������!�2���_���Ͳ�V���
r�E(����~�uM�/&��&x=-�&2�T��o��JD���Q��m5��#���8Iqe����A�^�ێ"���@�t�w�m�A ���3|�����NL�Q����p�d�#:}S-D�Wq�_"����
[@localhost ~ ]$ hive --orcfiledump /user/hive/warehouse/training_retail.db/orders_orc/000000_0;
Structure for /user/hive/warehouse/training_retail.db/orders_orc/000000_0
File Version: 0.12 with HIVE_8732
20/04/20 10:19:58 INFO orc.ReaderImpl: Reading ORC rows from /user/hive/warehouse/training_retail.db/orders_orc/000000_0 with {include: null, offset: 0, length: 9223372036854775807}
Rows: 68883
Compression: ZLIB
Compression size: 262144
Type: struct<_col0:int,_col1:string,_col2:int,_col3:string>
....
File length: 163094 bytes
Padding length: 0 bytes
Padding ratio: 0%
[@localhost ~ ]$ hive --orcfiledump -d /user/hive/warehouse/training_retail.db/orders_orc/000000_0 | head -n 5
{"_col0":1,"_col1":"2013-07-25 00:00:00.0","_col2":11599,"_col3":"CLOSED"}
{"_col0":2,"_col1":"2013-07-25 00:00:00.0","_col2":256,"_col3":"PENDING_PAYMENT"}
{"_col0":3,"_col1":"2013-07-25 00:00:00.0","_col2":12111,"_col3":"COMPLETE"}
{"_col0":4,"_col1":"2013-07-25 00:00:00.0","_col2":8827,"_col3":"CLOSED"}
{"_col0":5,"_col1":"2013-07-25 00:00:00.0","_col2":11318,"_col3":"COMPLETE"}
You can follow this link for details:
How to see contents of Hive orc files in linux
is it possible to read a orcserde hadoop file from command line?
Looks like you'll have to dump it from Hive. This appears to be a duplicate of this question - How to see contents of Hive orc files in linux
The hive --orcfiledump <location-of-orc-file>
or something similar depending on your Hive version should do it. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+ORC
How to get the schema (columns and their types) of ORC files stored in HDFS?
Found a way to get the details (column names, data types) I wanted via Spark
data = sqlContext.sql("SELECT * FROM orc.`<HDFS_path>`");
data.printSchema()
This will print the output in below format, which is exaclty the information I want to extract from the ORC files on HDFS:
root
|-- <column_name1>: <datatype> (nullable = <true/false>)
|-- <column_name2>: <datatype> (nullable = <true/false>)
|-- <column_name3>: <datatype> (nullable = <true/false>)
|-- <column_name4>: <datatype> (nullable = <true/false>)
|-- <column_name5>: <datatype> (nullable = <true/false>)
How to update a hive table's data after copied orc files with hdfs into the folder of that table
If the table is not partitioned then once the files are in HDFS in the folder that is specified in the LOCATION clause, then the data should be available for querying.
If the table is partitioned then u first need to run an ADD PARTITION statement.
Create hive table from file stored in hdfs in orc format
You can not do that in only one step. The create table statement doesn't process the data, just specify the format and the location.
My suggestion is that you create a temporal table using the "STORED AS TEXTFILE" create statement and create the final table using ORC as storage format (using an empty location).
Then insert in the "ORC table" all the rows from the temporal "text table".
Insert [overwrite] table orcTable select col1, col2 from textTable;
Insert Overwrite will replace all the data in the table with the new data. If you only want to add new data you will use "Insert table . . ."
After the import you could delete the temporal "text table".
Partitioned ORC table shows up empty in Hive
You just need to update the partition info on the table so Hive can list the partitions presents. This is done through the MSCK REPAIR command:
spark.sql("MSCK REPAIR TABLE <tableName>")
More info on this command here
Quick example here
scala> spark.sql("select * from table").show
20/03/28 17:12:46 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
+------+------+
|column|LOADED|
+------+------+
+------+------+
scala> spark.sql("MSCK REPAIR TABLE table")
scala> spark.sql("select * from table").show
+------+----------+
|column| LOADED|
+------+----------+
| a|2019-11-16|
| c|2019-11-16|
| b|2019-11-17|
+------+----------+
Sorted Table in Hive (ORC file format)
Bucketed table is an outdated concept.
You do not need to write CLUSTERED BY in table DDL.
When loading table use distribute by partition key
to reduce pressure on reducers especially when writing ORC, which requires intermediate buffers for building ORC and if each reducer loads many partitions it may cause OOM exception.
When the table is big, you can limit the max file size using bytes.per.reducer like this:
set hive.exec.reducers.bytes.per.reducer=67108864;--or even less
If you have more data, more reducers will be started, more files created. This is more flexible than loading fixed number of buckets.
This will also work better because for small tables you do not need to create smaller buckets.
ORC has internal indexes and bloom filters. Applying SORT you can improve index and bloom filters efficiency because all similar data will be stored together. Also this can improve compression depending on your data enthropy.
If distribution by partition key is not enough because you have some data skew and the data is big, you can additionally distribute by random. It is better to distribute by column if you have evenly distributed data. If not, distribute by random to avoid single long running reducer problem.
Finally your insert statement may look loke this:
set hive.exec.reducers.bytes.per.reducer=33554432; --32Mb per reducer
INSERT OVERWRITE TABLE trades PARTITION (dt)
SELECT trade_id, name, contract_type, ts, dt
FROM raw_trades
DISTRIBUTE BY dt, --partition key is a must for big data
trade_id, --some other key if the data is too big and key is
--evenly distributed (no skew)
FLOOR(RAND()*100.0)%20 --random to distribute additionally on 20 equal parts
SORT BY contract_type; --sort data if you want filtering by this key
--to work better using internal index
Do not use CLUSTERED BY in table DDL because using DISTRIBUTE BY, ORC w indexes and bloom filters + SORT during insert you can achieve the same in more flexible way.
Distribute + sort can reduce the size of ORC files extremely by x3 or x4 times. Similar data can be better compressed and makes internal indexes more efficient.
Read also this: https://stackoverflow.com/a/55375261/2700344
This is related answer about about sorting: https://stackoverflow.com/a/47416027/2700344
The only case when you can use CLUSTER BY in table DDL is when you joining two big tables which can be bucketed by exactly the same number of buckets to be able to use sort-merge-bucket-map-join, but practically it is so rare case when you can bucket two big tables in the same way. Having only 1 bucket makes no sense because for small tables you can use map-join, just sort the data during insert to reduce the compressed data size.
Related Topics
Using Curl in a Bash Script and Getting Curl: (3) Illegal Characters Found in Url
How to Point a Docker Image to My .M2 Directory for Running Maven in Docker on a MAC
Does Linking an '-Lpthread' Changes Application Behaviour? (Linux, Glibc)
What Makes a Kernel/Os Real-Time
Linux Command Output as a Parameter of Another Command
What Is Export_Symbol_Gpl in Linux Kernel Code
Sftp on Linux Server Gives Error "Received Message Too Long"
Accessing a Cygwin Symlink from Windows
Rename File Command in Unix with Timestamp
"Stdin: Is Not a Tty" from Cronjob
How to Forward Localhost Port on My Container to Localhost on My Host
Linux Keyboard Event Capturing /Dev/Inputx
Debugging Shared Libraries with Gdbserver
How to Program for Linux's New 'Fanotify' File System Monitoring Feature
Svn: Ignoring an Already Committed File
Understanding Sendfile() and Splice()
Linux Script That Monitors File Changes Within Folders (Like Autospec Does!)