Pyspark: explode json in column to multiple columns
As long as you are using Spark version 2.1 or higher, pyspark.sql.functions.from_json
should get you your desired result, but you would need to first define the required schema
from pyspark.sql.functions import from_json, col
from pyspark.sql.types import StructType, StructField, StringType
schema = StructType(
[
StructField('key1', StringType(), True),
StructField('key2', StringType(), True)
]
)
df.withColumn("data", from_json("data", schema))\
.select(col('id'), col('point'), col('data.*'))\
.show()
which should give you
+---+-----+----+----+
| id|point|key1|key2|
+---+-----+----+----+
|abc| 6| 124| 345|
|df1| 7| 777| 888|
|4bd| 6| 111| 788|
+---+-----+----+----+
PySpark Explode JSON String into Multiple Columns
If you simplify the output like mentioned, you can define a simple JSON schema and convert JSON string into StructType
and read each fields
Input
df = spark.createDataFrame([("[{'to': 'Sam', 'position': 'guard'},{'to': 'John', 'position': 'center'},{'to': 'Andrew', 'position': 'forward'}]",1)], "col1:string, col2:int")
# +-----------------------------------------------------------------------------------------------------------------+----+
# |col1 |col2|
# +-----------------------------------------------------------------------------------------------------------------+----+
# |[{'to': 'Sam', 'position': 'guard'},{'to': 'John', 'position': 'center'},{'to': 'Andrew', 'position': 'forward'}]|1 |
# +-----------------------------------------------------------------------------------------------------------------+----+
And this is the transformation
from pyspark.sql import functions as F
from pyspark.sql import types as T
schema = T.ArrayType(T.StructType([
T.StructField('to', T.StringType()),
T.StructField('position', T.StringType())
]))
(df
.withColumn('temp', F.explode(F.from_json('col1', schema=schema)))
.select(
F.col('col2'),
F.col('temp.to').alias('col3'),
F.col('temp.position').alias('col4'),
)
.show()
)
# Output
# +----+------+-------+
# |col2| col3| col4|
# +----+------+-------+
# | 1| Sam| guard|
# | 1| John| center|
# | 1|Andrew|forward|
# +----+------+-------+
Explode multiple columns from nested JSON but it is giving extra records
You only need to explode Data
column, then you can select fields from the resulting struct column (Code
, Id
...). What duplicates the rows here is that you're exploding 2 arrays Data.Code
and Data.Id
.
Try this instead:
import pyspark.sql.functions as F
df_Json.withColumn("Data", F.explode("Data")).select("Data.Code", "Data.Id").show()
#+----+------+
#|Code| Id|
#+----+------+
#| ABC|123456|
#| XYZ|987654|
#+----+------+
Or using inline
function directly on Data
array:
df_Json.selectExpr("inline(Data)").show()
#+----+----+------+----+
#|Code| Geo| ID|Type|
#+----+----+------+----+
#| ABC|East|123456| Yes|
#| XYZ|West|987654| No|
#+----+----+------+----+
Pyspark exploding nested JSON into multiple columns and rows
As first step the Json is transformed into an array of (level, tag, key, value)
-tuples using an udf. The second step is to explode the array to get the individual rows:
from pyspark.sql import functions as F
from pyspark.sql import types as T
df = ...
def to_array(lvl):
def to_tuple(lvl):
levels=lvl.asDict()
for l in levels:
level=l
tags = levels[l].asDict()
for t in tags:
keys = tags[t].asDict()
for k in keys:
v=keys[k]
yield (l, t, k, v)
return list(to_tuple(lvl))
outputschema=T.ArrayType(T.StructType([
T.StructField("level", T.StringType(), True),
T.StructField("tag", T.StringType(), True),
T.StructField("key", T.StringType(), True),
T.StructField("value", T.StringType(), True)
]))
to_array_udf = F.udf(to_array, outputschema)
df.withColumn("tmp", to_array_udf("Levels")) \
.withColumn("tmp", F.explode("tmp")) \
.select("Levels", "tmp.*") \
.show()
Output:
+--------------------+------+----+----+------+
| Levels| level| tag| key| value|
+--------------------+------+----+----+------+
|{{{value1, value2...|level1|tag1|key1|value1|
|{{{value1, value2...|level1|tag1|key2|value2|
|{{{value1, value2...|level1|tag1|key3|value3|
|{{{value1, value2...|level2|tag1|key1|value1|
|{{{value1, value2...|level3|tag1|key1|value1|
|{{{value1, value2...|level3|tag1|key2|value2|
|{{{value1, value2...|level3|tag1|key3|value3|
|{{{value1, value2...|level3|tag2|key1|value1|
+--------------------+------+----+----+------+
How to explode column with multiple records into multiple Columns in Spark
Assume your dataset is main
. First, we have to explode the content of risk_table
, since if we don't, we will get arrays as values of columns, which we do not like, so:
df1 = df1.withColumn("explode", explode(col("risk_table")))
now, explode
column has one object per row; there are a lot of ways to create columns from objects, but I like to use the selectExpr:
.selectExpr("id", "symbol_id", // or whatever other field you like
"explode.index as index_0", // then target the key with dot operator
"explode.risk_buy as risk_buy_index_0",
"explode.reward_buy as reward_buy_index_0"
// add your other wanted values
)
Dummy input:
+--------------------------+---+---------+
|risk_table |id |symbol_id|
+--------------------------+---+---------+
|[{1, 0.25, 0.3, 0.1, 0.3}]|1 |1 |
+--------------------------+---+---------+
Final output:
+---+---------+-------+----------------+------------------+
| id|symbol_id|index_0|risk_buy_index_0|reward_buy_index_0|
+---+---------+-------+----------------+------------------+
| 1| 1| 1| 0.25| 0.3|
+---+---------+-------+----------------+------------------+
How to explode multiple columns (which are dictionaries with the same key) of a pyspark dataframe into rows
I would do something like this :
from pyspark.sql import functions as F
df.select(
"s__",
F.expr("""
stack(
4,
"pct_ci_tr",
pct_ci_tr,
"pct_ci_rn",
pct_ci_rn,
"pct_ci_ttv",
pct_ci_ttv,
"pct_ci_comm",
pct_ci_comm,
) as (lib, map_values)"""
),
).select("s__", "lib", F.explode(F.col("map_values")))
Related Topics
Preprocessing in Scikit Learn - Single Sample - Depreciation Warning
Type Hints with User Defined Classes
Django Model Field Default Based Off Another Field in Same Model
Iso to Datetime Object: 'Z' Is a Bad Directive
Keep Persistent Variables in Memory Between Runs of Python Script
Using Python's Multiprocessing Module to Execute Simultaneous and Separate Seawat/Modflow Model Runs
Pandas Reading CSV as String Type
How to Rotate a Matplotlib Plot Through 90 Degrees
How to Assign the Value of a Variable Using Eval in Python
Python's Sum VS. Numpy's Numpy.Sum
Matrix Multiplication in Pure Python
For Loops and Iterating Through Lists
Scaling of Tkinter Gui in 4K (3840*2160) Resolution
How to Convert a List to a List of Tuples
How to Replace (Or Strip) an Extension from a Filename in Python
Python How to Read N Number of Lines at a Time