Pyspark: Explode JSON in Column to Multiple Columns

Pyspark: explode json in column to multiple columns

As long as you are using Spark version 2.1 or higher, pyspark.sql.functions.from_json should get you your desired result, but you would need to first define the required schema

from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType

schema = StructType(
[
StructField('key1', StringType(), True),
StructField('key2', StringType(), True)
]
)

df.withColumn("data", from_json("data", schema))\
.select(col('id'), col('point'), col('data.*'))\
.show()

which should give you

+---+-----+----+----+
| id|point|key1|key2|
+---+-----+----+----+
|abc| 6| 124| 345|
|df1| 7| 777| 888|
|4bd| 6| 111| 788|
+---+-----+----+----+

PySpark Explode JSON String into Multiple Columns

If you simplify the output like mentioned, you can define a simple JSON schema and convert JSON string into StructType and read each fields

Input

df = spark.createDataFrame([("[{'to': 'Sam', 'position': 'guard'},{'to': 'John', 'position': 'center'},{'to': 'Andrew', 'position': 'forward'}]",1)], "col1:string, col2:int")

# +-----------------------------------------------------------------------------------------------------------------+----+
# |col1 |col2|
# +-----------------------------------------------------------------------------------------------------------------+----+
# |[{'to': 'Sam', 'position': 'guard'},{'to': 'John', 'position': 'center'},{'to': 'Andrew', 'position': 'forward'}]|1 |
# +-----------------------------------------------------------------------------------------------------------------+----+

And this is the transformation

from pyspark.sql import functions as F
from pyspark.sql import types as T

schema = T.ArrayType(T.StructType([
T.StructField('to', T.StringType()),
T.StructField('position', T.StringType())
]))

(df
.withColumn('temp', F.explode(F.from_json('col1', schema=schema)))
.select(
F.col('col2'),
F.col('temp.to').alias('col3'),
F.col('temp.position').alias('col4'),
)
.show()
)

# Output
# +----+------+-------+
# |col2| col3| col4|
# +----+------+-------+
# | 1| Sam| guard|
# | 1| John| center|
# | 1|Andrew|forward|
# +----+------+-------+

Explode multiple columns from nested JSON but it is giving extra records

You only need to explode Data column, then you can select fields from the resulting struct column (Code, Id...). What duplicates the rows here is that you're exploding 2 arrays Data.Code and Data.Id.

Try this instead:

import pyspark.sql.functions as F

df_Json.withColumn("Data", F.explode("Data")).select("Data.Code", "Data.Id").show()

#+----+------+
#|Code| Id|
#+----+------+
#| ABC|123456|
#| XYZ|987654|
#+----+------+

Or using inline function directly on Data array:

df_Json.selectExpr("inline(Data)").show()

#+----+----+------+----+
#|Code| Geo| ID|Type|
#+----+----+------+----+
#| ABC|East|123456| Yes|
#| XYZ|West|987654| No|
#+----+----+------+----+

Pyspark exploding nested JSON into multiple columns and rows

As first step the Json is transformed into an array of (level, tag, key, value)-tuples using an udf. The second step is to explode the array to get the individual rows:

from pyspark.sql import functions as F
from pyspark.sql import types as T

df = ...

def to_array(lvl):
def to_tuple(lvl):
levels=lvl.asDict()
for l in levels:
level=l
tags = levels[l].asDict()
for t in tags:
keys = tags[t].asDict()
for k in keys:
v=keys[k]
yield (l, t, k, v)
return list(to_tuple(lvl))

outputschema=T.ArrayType(T.StructType([
T.StructField("level", T.StringType(), True),
T.StructField("tag", T.StringType(), True),
T.StructField("key", T.StringType(), True),
T.StructField("value", T.StringType(), True)
]))

to_array_udf = F.udf(to_array, outputschema)

df.withColumn("tmp", to_array_udf("Levels")) \
.withColumn("tmp", F.explode("tmp")) \
.select("Levels", "tmp.*") \
.show()

Output:

+--------------------+------+----+----+------+
| Levels| level| tag| key| value|
+--------------------+------+----+----+------+
|{{{value1, value2...|level1|tag1|key1|value1|
|{{{value1, value2...|level1|tag1|key2|value2|
|{{{value1, value2...|level1|tag1|key3|value3|
|{{{value1, value2...|level2|tag1|key1|value1|
|{{{value1, value2...|level3|tag1|key1|value1|
|{{{value1, value2...|level3|tag1|key2|value2|
|{{{value1, value2...|level3|tag1|key3|value3|
|{{{value1, value2...|level3|tag2|key1|value1|
+--------------------+------+----+----+------+

How to explode column with multiple records into multiple Columns in Spark

Assume your dataset is main. First, we have to explode the content of risk_table, since if we don't, we will get arrays as values of columns, which we do not like, so:

df1 = df1.withColumn("explode", explode(col("risk_table")))

now, explode column has one object per row; there are a lot of ways to create columns from objects, but I like to use the selectExpr:

.selectExpr("id", "symbol_id", // or whatever other field you like
"explode.index as index_0", // then target the key with dot operator
"explode.risk_buy as risk_buy_index_0",
"explode.reward_buy as reward_buy_index_0"
// add your other wanted values
)

Dummy input:

+--------------------------+---+---------+
|risk_table |id |symbol_id|
+--------------------------+---+---------+
|[{1, 0.25, 0.3, 0.1, 0.3}]|1 |1 |
+--------------------------+---+---------+

Final output:

+---+---------+-------+----------------+------------------+
| id|symbol_id|index_0|risk_buy_index_0|reward_buy_index_0|
+---+---------+-------+----------------+------------------+
| 1| 1| 1| 0.25| 0.3|
+---+---------+-------+----------------+------------------+

How to explode multiple columns (which are dictionaries with the same key) of a pyspark dataframe into rows

I would do something like this :

from pyspark.sql import functions as F

df.select(
"s__",
F.expr("""
stack(
4,
"pct_ci_tr",
pct_ci_tr,
"pct_ci_rn",
pct_ci_rn,
"pct_ci_ttv",
pct_ci_ttv,
"pct_ci_comm",
pct_ci_comm,
) as (lib, map_values)"""
),
).select("s__", "lib", F.explode(F.col("map_values")))


Related Topics



Leave a reply



Submit