Best Way to Get the Max Value in a Spark Dataframe Column

Best way to get the max value in a Spark dataframe column

>df1.show()
+-----+--------------------+--------+----------+-----------+
|floor| timestamp| uid| x| y|
+-----+--------------------+--------+----------+-----------+
| 1|2014-07-19T16:00:...|600dfbe2| 103.79211|71.50419418|
| 1|2014-07-19T16:00:...|5e7b40e1| 110.33613|100.6828393|
| 1|2014-07-19T16:00:...|285d22e4|110.066315|86.48873585|
| 1|2014-07-19T16:00:...|74d917a1| 103.78499|71.45633073|

>row1 = df1.agg({"x": "max"}).collect()[0]
>print row1
Row(max(x)=110.33613)
>print row1["max(x)"]
110.33613

The answer is almost the same as method3. but seems the "asDict()" in method3 can be removed

Is there any way to get max value from a column in Pyspark other than collect()?

No need to sort, you can just select the maximum:

res = df.select(max(col('col1')).alias('max_col1')).first().max_col1

Or you can use selectExpr

res = df1.selectExpr('max(diff) as max_col1').first().max_col1

How to get the rows with Max value in Spark DataFrame

You can easily to did by extracting the MAX High value and finally applying a filter against the value on the entire Dataframe

Data Preparation

df = pd.DataFrame({
'Date':['2021-01-23','2021-02-09','2009-09-19'],
'High':[89,90,96],
'Low':[43,54,50]
})

sparkDF = sql.createDataFrame(df)

sparkDF.show()

+----------+----+---+
| Date|High|Low|
+----------+----+---+
|2021-01-23| 89| 43|
|2021-02-09| 90| 54|
|2009-09-19| 96| 50|
+----------+----+---+

Filter

max_high = sparkDF.select(F.max(F.col('High')).alias('High')).collect()[0]['High']

>>> 96

sparkDF.filter(F.col('High') == max_high).orderBy(F.col('Date').desc()).limit(1).show()

+----------+----+---+
| Date|High|Low|
+----------+----+---+
|2009-09-19| 96| 50|
+----------+----+---+

Add column to Spark dataframe with the max value that is less than the current record's value

You may try the following which uses max as a window function with when (a case expression) but focuses on the preceding rows

from pyspark.sql import functions as F
from pyspark.sql import Window


df = df.withColumn('previous_service_date',F.max(
F.when(F.col("status")=="PD",F.col("service_date")).otherwise(None)
).over(
Window.partitionBy("product")
.rowsBetween(Window.unboundedPreceding,-1)
))

df.orderBy('service_date').show(truncate=False)
+---+--------------------+-------------------+------+-------+---------------------+
|id |claim_id |service_date |status|product|previous_service_date|
+---+--------------------+-------------------+------+-------+---------------------+
|123|10606134411906233408|2018-09-17 00:00:00|PD |blue |null |
|123|10606147900401009928|2019-01-24 00:00:00|PD |yellow |null |
|123|10606160940704723994|2019-05-23 00:00:00|RV |yellow |2019-01-24 00:00:00 |
|123|10606171648203079553|2019-08-29 00:00:00|RJ |blue |2018-09-17 00:00:00 |
|123|10606186611407311724|2020-01-13 00:00:00|PD |blue |2018-09-17 00:00:00 |
+---+--------------------+-------------------+------+-------+---------------------+

Edit 1

You may also use last as denoted below

df = df.withColumn('previous_service_date',F.last(
F.when(F.col("status")=="PD" ,F.col("service_date")).otherwise(None),True
).over(
Window.partitionBy("product")
.orderBy('service_date')
.rowsBetween(Window.unboundedPreceding,-1)
))

Let me know if this works for you.

get min and max from a specific column scala spark dataframe

How about getting the column name from the metadata:

val selectedColumnName = df.columns(q) //pull the (q + 1)th column from the columns array
df.agg(min(selectedColumnName), max(selectedColumnName))

Getting the maximum value of a column in an Apache Spark Dataframe (Scala)

In your code, the spark max was being mistaken for scala max, so I just specified the max to be from spark.

That will give you the max of id column as an integer value:

%scala
import org.apache.spark.sql.functions.col

val max = df.agg(org.apache.spark.sql.functions.max(col("id"))).collect()(0)(0).asInstanceOf[Int]

Output: max: Int = 6

OR

If you want to create column to store max value of id:

%scala
import org.apache.spark.sql.functions._
df.withColumn("max", lit(df.agg(org.apache.spark.sql.functions.max($"id")).as[Int].first))

How to get the max value of date column in pyspark

I don't understand why you used try/except. The if-statement should be enough. Also you need to use the Spark SQL min/max instead of those in Python. Avoid naming your variables as min/max, which overrides default functions.

import pyspark.sql.functions as F

for col in df.columns:
if dict(df.dtypes)[col]== 'string':
minval, maxval = df.select(F.min(col), F.max(col)).first()
print(maxval)
else:
print(col, 'NA')

how to find the max value of all columns in a spark dataframe

The code will work irrespective of how many columns or mix of datatypes there are.

Note: OP suggested in her comments that for string columns, take the first non-Null value while grouping.

# Import relevant functions
from pyspark.sql.functions import max, first, col

# Take an example DataFrame
values = [('Alice',10,5,None,50),('Bob',15,15,'Simon',10),('Jack',5,1,'Timo',3)]
df = sqlContext.createDataFrame(values,['col1','col2','col3','col4','col5'])
df.show()
+-----+----+----+-----+----+
| col1|col2|col3| col4|col5|
+-----+----+----+-----+----+
|Alice| 10| 5| null| 50|
| Bob| 15| 15|Simon| 10|
| Jack| 5| 1| Timo| 3|
+-----+----+----+-----+----+

# Lists all columns in the DataFrame
seq_of_columns = df.columns
print(seq_of_columns)
['col1', 'col2', 'col3', 'col4', 'col5']

# Using List comprehensions to create a list of columns of String DataType
string_columns = [i[0] for i in df.dtypes if i[1]=='string']
print(string_columns)
['col1', 'col4']

# Using Set function to get non-string columns by subtracting one list from another.
non_string_columns = list(set(seq_of_columns) - set(string_columns))
print(non_string_columns)
['col2', 'col3', 'col5']

Read about first() and ignorenulls here

# Aggregating both string and non-string columns
df = df.select(*[max(col(c)).alias(c) for c in non_string_columns],*[first(col(c),ignorenulls = True).alias(c) for c in string_columns])
df = df[[seq_of_columns]]
df.show()
+-----+----+----+-----+----+
| col1|col2|col3| col4|col5|
+-----+----+----+-----+----+
|Alice| 15| 15|Simon| 50|
+-----+----+----+-----+----+


Related Topics



Leave a reply



Submit