get value out of dataframe
collect()
returns your results as a python list. To get the value out of the list you just need to take the first element like this:
saleDF.groupBy("salesNum").mean()).collect()[0]
Getting value in a dataframe in PySpark
Get the first record from the Row object using index 0 and get the value using the index "count"
from pyspark.sql.functions import col
data.groupby("card_bank", "failed").count().filter(col("failed") == "true").collect()[0]["count"]
Spark - extracting single value from DataFrame
You can use head
df.head().getInt(0)
or first
df.first().getInt(0)
Check DataFrame scala docs for more details
How to get a value from one pyspark dataframe using where clause
You should put the where
clause before the select
clause, otherwise it always return nothing because the column in the where
clause does not exist.
datasource_df = datasource_df.where(F.col('id') == account_id).select(F.col('host'))
Also for this type of query, it's better to do a join
, instead of collecting dataframes and comparing them row by row.
Pyspark dataframe column value dependent on value from another row
You can use first
function with ignorenulls=True
over a Window. But you need to identify groups of manufacturer
in order to partition by that group
.
As you didn't give any ID
column I'm using monotonically_increasing_id
and a cumulative conditional sum to create a group column:
from pyspark.sql import functions as F
df1 = df.withColumn(
"row_id",
F.monotonically_increasing_id()
).withColumn(
"group",
F.sum(F.when(F.col("manufacturer") == "Factory", 1)).over(Window.orderBy("row_id"))
).withColumn(
"product_id",
F.when(
F.col("product_id") == 0,
F.first("product_id", ignorenulls=True).over(Window.partitionBy("group").orderBy("row_id"))
).otherwise(F.col("product_id"))
).drop("row_id", "group")
df1.show()
#+-------------+----------+
#| manufacturer|product_id|
#+-------------+----------+
#| Factory| AE222|
#|Sub-Factory-1| AE222|
#|Sub-Factory-2| AE222|
#| Factory| AE333|
#|Sub-Factory-1| AE333|
#|Sub-Factory-2| AE333|
#+-------------+----------+
get value in pyspark dataframe
content
are arrays that contains arrays ... nested arrays. You have to use explode
4 times I believe to explode each array.
Assuming df is your dataframe, I would do something like this :
from pyspark.sql import functions as F
df = df.withColumn(
"content",
F.explode("back.content")
)
df = df.withColumn(
"content",
F.explode("content.content")
)
df = df.withColumn(
"content",
F.explode("content.content")
)
df = df.withColumn(
"content",
F.explode("content.content")
)
df = df.withColumn(
"text",
F.col("content.text"),
)
Select values from MapType Column in UDF PySpark
It's because your map does not have anything at key=1.
df_temp = spark.createDataFrame([(100,),(101,),(102,)],['CUSTOMER_ID']) \
.withColumn('col_a', F.create_map(F.lit(0.0), F.lit(1.0)))
df_temp.show()
# +-----------+------------+
# |CUSTOMER_ID| col_a|
# +-----------+------------+
# | 100|{0.0 -> 1.0}|
# | 101|{0.0 -> 1.0}|
# | 102|{0.0 -> 1.0}|
# +-----------+------------+
df_temp = df_temp.withColumn('col_a_0', F.col('col_a')[0])
df_temp = df_temp.withColumn('col_a_1', F.col('col_a')[1])
df_temp.show()
# +-----------+------------+-------+-------+
# |CUSTOMER_ID| col_a|col_a_0|col_a_1|
# +-----------+------------+-------+-------+
# | 100|{0.0 -> 1.0}| 1.0| null|
# | 101|{0.0 -> 1.0}| 1.0| null|
# | 102|{0.0 -> 1.0}| 1.0| null|
# +-----------+------------+-------+-------+
Related Topics
How to Constantly Run Python Script in the Background on Windows
Comparing Items in Lists Within Same Indices Python
Bold Formatting in Python Console
Efficient Way of Having a Function Only Execute Once in a Loop
How to Calculate Average a Dictionary from List of Dictionary Data
How to Plot Multiple Pandas Columns
Most Pythonic Way to Kill a Thread After Some Period of Time
How to Skip Specific Indexes in an Array
How to Use Chrome Webdriver in Selenium to Download Files in Python
How to Install Python Packages from the Tar.Gz File Without Using Pip Install
Counting the No. of Black to White Pixels in the Image Using Opencv
Update Json Element in Json Object Using Python
Using SQL Server Stored Procedures from Python (Pyodbc)
I Want to Reshape 2D Array into 3D Array
How to Import a File in Python With Spaces in the Name
How to Compile Python Script to Binary Executable
How to Get Interactive Plots Again in Spyder/Ipython/Matplotlib