How can I efficiently read multiple json files into a Dataframe or JavaRDD?
You can use exactly the same code to read multiple JSON files. Just pass a path-to-a-directory / path-with-wildcards instead of path to a single file.
DataFrameReader
also provides json
method with a following signature:
json(jsonRDD: JavaRDD[String])
which can be used to parse JSON already loaded into JavaRDD
.
Parse Json Object with an Array and Map to Multiple Pairs with Apache Spark in Java
If you load json data into a DataFrame
:
DataFrame df = sqlContext.read().json("/path/to/json");
You could easily do this by explode
.
df.select(
df.col("device_id"),
df.col("timestamp"),
org.apache.spark.sql.functions.explode(df.col("rooms")).as("room")
);
For input:
{"device_id": "1", "timestamp": 1436941050, "rooms": ["Office", "Foyer"]}
{"device_id": "2", "timestamp": 1435677490, "rooms": ["Office", "Lab"]}
{"device_id": "3", "timestamp": 1436673850, "rooms": ["Office", "Foyer"]}
You will get:
+---------+------+----------+
|device_id| room| timestamp|
+---------+------+----------+
| 1|Office|1436941050|
| 1| Foyer|1436941050|
| 2|Office|1435677490|
| 2| Lab|1435677490|
| 3|Office|1436673850|
| 3| Foyer|1436673850|
+---------+------+----------+
How to read multiple text files into a single RDD?
You can specify whole directories, use wildcards and even CSV of directories and wildcards. E.g.:
sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file")
As Nick Chammas points out this is an exposure of Hadoop's FileInputFormat
and therefore this also works with Hadoop (and Scalding).
Parsing multiple files into SparkRDD
Why do you want a JavaPairRDD<Integer, List<UserActivity>>
? Don't you think that JavaPairRDD<Integer, UserActivity>
would be enough? I think it will allow you to avoid many problems latter on.
If you want to transform a JavaPairRDD in another JavaPairRDD you can use a map, see this post
Related Topics
How to Generate a Unique and Short File Name in Java
Java to Jackson Json Serialization: Money Fields
How to Call an Excel Vba Macro from Java Code
How to Exit an Android App Programmatically
Map Yaml to Object Hashmap in Springboot
How to Test If Json Collection Object Is Empty in Java
Efficient Way of Processing Large CSV File Using Java
How to Write New Line Character to a File in Java
Simple Export and Import of a Sqlite Database on Android
Calculating and Printing the Nth Prime Number
Error: Could Not Find or Load Main Class in Intellij Ide
Android Webview Displaying Blank Page
Setting Default Values to Null Fields When Mapping With Jackson
Android Room Persistent: Appdatabase_Impl Does Not Exist
Spring MVC - Get Httpservletresponse Body
Spring Boot JPA - Onetomany Relationship Causes Infinite Loop
Spring Kafka - How to Reset Offset to Latest With a Group Id
At Runtime, Find All Classes in a Java Application That Extend a Base Class