Getting Value in a Dataframe in Pyspark

get value out of dataframe

collect() returns your results as a python list. To get the value out of the list you just need to take the first element like this:

saleDF.groupBy("salesNum").mean()).collect()[0] 

Getting value in a dataframe in PySpark

Get the first record from the Row object using index 0 and get the value using the index "count"

from pyspark.sql.functions import col
data.groupby("card_bank", "failed").count().filter(col("failed") == "true").collect()[0]["count"]

Spark - extracting single value from DataFrame

You can use head

df.head().getInt(0)

or first

df.first().getInt(0)

Check DataFrame scala docs for more details

How to get a value from one pyspark dataframe using where clause

You should put the where clause before the select clause, otherwise it always return nothing because the column in the where clause does not exist.

datasource_df = datasource_df.where(F.col('id') == account_id).select(F.col('host'))

Also for this type of query, it's better to do a join, instead of collecting dataframes and comparing them row by row.

Pyspark dataframe column value dependent on value from another row

You can use first function with ignorenulls=True over a Window. But you need to identify groups of manufacturer in order to partition by that group.

As you didn't give any ID column I'm using monotonically_increasing_id and a cumulative conditional sum to create a group column:

from pyspark.sql import functions as F

df1 = df.withColumn(
"row_id",
F.monotonically_increasing_id()
).withColumn(
"group",
F.sum(F.when(F.col("manufacturer") == "Factory", 1)).over(Window.orderBy("row_id"))
).withColumn(
"product_id",
F.when(
F.col("product_id") == 0,
F.first("product_id", ignorenulls=True).over(Window.partitionBy("group").orderBy("row_id"))
).otherwise(F.col("product_id"))
).drop("row_id", "group")

df1.show()
#+-------------+----------+
#| manufacturer|product_id|
#+-------------+----------+
#| Factory| AE222|
#|Sub-Factory-1| AE222|
#|Sub-Factory-2| AE222|
#| Factory| AE333|
#|Sub-Factory-1| AE333|
#|Sub-Factory-2| AE333|
#+-------------+----------+

get value in pyspark dataframe

content are arrays that contains arrays ... nested arrays. You have to use explode 4 times I believe to explode each array.
Assuming df is your dataframe, I would do something like this :

from pyspark.sql import functions as F

df = df.withColumn(
"content",
F.explode("back.content")
)

df = df.withColumn(
"content",
F.explode("content.content")
)

df = df.withColumn(
"content",
F.explode("content.content")
)

df = df.withColumn(
"content",
F.explode("content.content")
)

df = df.withColumn(
"text",
F.col("content.text"),
)

Select values from MapType Column in UDF PySpark

It's because your map does not have anything at key=1.

df_temp = spark.createDataFrame([(100,),(101,),(102,)],['CUSTOMER_ID']) \
.withColumn('col_a', F.create_map(F.lit(0.0), F.lit(1.0)))
df_temp.show()
# +-----------+------------+
# |CUSTOMER_ID| col_a|
# +-----------+------------+
# | 100|{0.0 -> 1.0}|
# | 101|{0.0 -> 1.0}|
# | 102|{0.0 -> 1.0}|
# +-----------+------------+

df_temp = df_temp.withColumn('col_a_0', F.col('col_a')[0])
df_temp = df_temp.withColumn('col_a_1', F.col('col_a')[1])

df_temp.show()
# +-----------+------------+-------+-------+
# |CUSTOMER_ID| col_a|col_a_0|col_a_1|
# +-----------+------------+-------+-------+
# | 100|{0.0 -> 1.0}| 1.0| null|
# | 101|{0.0 -> 1.0}| 1.0| null|
# | 102|{0.0 -> 1.0}| 1.0| null|
# +-----------+------------+-------+-------+


Related Topics



Leave a reply



Submit