How to view Apache Parquet file in Windows?
What is Apache Parquet?Apache Parquet is a binary file format that stores data in a columnar fashion.
Data inside a Parquet file is similar to an RDBMS style table where you have columns and rows. But instead of accessing the data one row at a time, you typically access it one column at a time.
Apache Parquet is one of the modern big data storage formats. It has several advantages, some of which are:
- Columnar storage: efficient data retrieval, efficient compression, etc...
- Metadata is at the end of the file: allows Parquet files to be generated from a stream of data. (common in big data scenarios)
- Supported by all Apache big data products
No. Parquet files can be stored in any file system, not just HDFS. As mentioned above it is a file format. So it's just like any other file where it has a name and a .parquet extension. What will usually happen in big data environments though is that one dataset will be split (or partitioned) into multiple parquet files for even more efficiency.
All Apache big data products support Parquet files by default. So that is why it might seem like it only can exist in the Apache ecosystem.
How can I create/read Parquet Files?As mentioned, all current Apache big data products such as Hadoop, Hive, Spark, etc. support Parquet files by default.
So it's possible to leverage these systems to generate or read Parquet data. But this is far from practical. Imagine that in order to read or create a CSV file you had to install Hadoop/HDFS + Hive and configure them. Luckily there are other solutions.
To create your own parquet files:
- In Java please see my following post: Generate Parquet File using Java
- In .NET please see the following library: parquet-dotnet
To view parquet file contents:
- Please try the following Windows utility: https://github.com/mukunku/ParquetViewer
Are there other methods?
Possibly. But not many exist and they mostly aren't well documented. This is due to Parquet being a very complicated file format (I could not even find a formal definition). The ones I've listed are the only ones I'm aware of as I'm writing this response
Inspect Parquet from command line
You can use parquet-tools
with the command cat
and the --json
option in order to view the files without a local copy and in the JSON format.
Here is an example:
parquet-tools cat --json hdfs://localhost/tmp/save/part-r-00000-6a3ccfae-5eb9-4a88-8ce8-b11b2644d5de.gz.parquet
This prints out the data in JSON format:
{"name":"gil","age":48,"city":"london"}
{"name":"jane","age":30,"city":"new york"}
{"name":"jordan","age":18,"city":"toronto"}
Disclaimer: this was tested in Cloudera CDH 5.12.0
How to view a parquet file in intellij
The answer for you is no, at least now.
But if the reason you want to view Parquet tables on Intellij is because you want to view Parquet file with GUI tool, I suggest you use tools Bigdata File Viewer.
It's a desktop application to view Parquet and also other binary format data like ORC and AVRO. It's pure Java application so that can be run at Linux, Mac and also Windows.
It supports complex data type like array, map, etc.
Finding Parquet File Created with Apache Drill
Check your dfs
plugin via web host (xx.xx.xx.xx:8047/storage/dfs
)
By default temp
directory:
"tmp": {
"location": "/tmp",
"writable": true,
"defaultInputFormat": null
}
your file will be at location(assuming you have not chaned tmp
directory) :
/tmp/Users/[username]/Desktop/drill_example_parquet
Related Topics
How to Sort Integer Digits in Ascending Order Without Strings or Arrays
In Firebase After Uploading Image How to Get Url
How to Get the Autoincremented Id When I Insert a Record in a Table Via Jdbctemplate
How to Convert Date Which I Got from Firebase Server as Am Getting Error Date
How to Put a Scanner Input into an Array... for Example a Couple of Numbers
Getting Column Names from a JPA Native Query
String.Replaceall Single Backslashes With Double Backslashes
Loop Over All Fields in a Java Class
How to View Apache Parquet File in Windows
Jpa: Update Only Specific Fields
Spring Boot: Cannot Access Rest Controller on Localhost (404)
How to Configure Hikaricp in My Spring Boot App in My Application.Properties Files
How to Fix Expected Begin_Object But Was String in Retrofit
Java Jackson Deserialization of Nested Objects
Extract Text Br Tags in Selenium Java
How to Get All Keys from a Json-Object as a String Array in Java