How to Efficiently Read Multiple Json Files into a Dataframe or Javardd

How can I efficiently read multiple json files into a Dataframe or JavaRDD?

You can use exactly the same code to read multiple JSON files. Just pass a path-to-a-directory / path-with-wildcards instead of path to a single file.

DataFrameReader also provides json method with a following signature:

json(jsonRDD: JavaRDD[String])

which can be used to parse JSON already loaded into JavaRDD.

Parse Json Object with an Array and Map to Multiple Pairs with Apache Spark in Java

If you load json data into a DataFrame:

DataFrame df = sqlContext.read().json("/path/to/json");

You could easily do this by explode.

df.select(
df.col("device_id"),
df.col("timestamp"),
org.apache.spark.sql.functions.explode(df.col("rooms")).as("room")
);

For input:

{"device_id": "1", "timestamp": 1436941050, "rooms": ["Office", "Foyer"]}
{"device_id": "2", "timestamp": 1435677490, "rooms": ["Office", "Lab"]}
{"device_id": "3", "timestamp": 1436673850, "rooms": ["Office", "Foyer"]}

You will get:

+---------+------+----------+
|device_id| room| timestamp|
+---------+------+----------+
| 1|Office|1436941050|
| 1| Foyer|1436941050|
| 2|Office|1435677490|
| 2| Lab|1435677490|
| 3|Office|1436673850|
| 3| Foyer|1436673850|
+---------+------+----------+

How to read multiple text files into a single RDD?

You can specify whole directories, use wildcards and even CSV of directories and wildcards. E.g.:

sc.textFile("/my/dir1,/my/paths/part-00[0-5]*,/another/dir,/a/specific/file")

As Nick Chammas points out this is an exposure of Hadoop's FileInputFormat and therefore this also works with Hadoop (and Scalding).

Parsing multiple files into SparkRDD

Why do you want a JavaPairRDD<Integer, List<UserActivity>>? Don't you think that JavaPairRDD<Integer, UserActivity> would be enough? I think it will allow you to avoid many problems latter on.

If you want to transform a JavaPairRDD in another JavaPairRDD you can use a map, see this post



Related Topics



Leave a reply



Submit