Best Way to Get the Max Value in a Spark Dataframe Column

Best way to get the max value in a Spark dataframe column

|floor| timestamp| uid| x| y|
| 1|2014-07-19T16:00:...|600dfbe2| 103.79211|71.50419418|
| 1|2014-07-19T16:00:...|5e7b40e1| 110.33613|100.6828393|
| 1|2014-07-19T16:00:...|285d22e4|110.066315|86.48873585|
| 1|2014-07-19T16:00:...|74d917a1| 103.78499|71.45633073|

>row1 = df1.agg({"x": "max"}).collect()[0]
>print row1
>print row1["max(x)"]

The answer is almost the same as method3. but seems the "asDict()" in method3 can be removed

Is there any way to get max value from a column in Pyspark other than collect()?

No need to sort, you can just select the maximum:

res ='col1')).alias('max_col1')).first().max_col1

Or you can use selectExpr

res = df1.selectExpr('max(diff) as max_col1').first().max_col1

How to get the rows with Max value in Spark DataFrame

You can easily to did by extracting the MAX High value and finally applying a filter against the value on the entire Dataframe

Data Preparation

df = pd.DataFrame({

sparkDF = sql.createDataFrame(df)

| Date|High|Low|
|2021-01-23| 89| 43|
|2021-02-09| 90| 54|
|2009-09-19| 96| 50|


max_high ='High')).alias('High')).collect()[0]['High']

>>> 96

sparkDF.filter(F.col('High') == max_high).orderBy(F.col('Date').desc()).limit(1).show()

| Date|High|Low|
|2009-09-19| 96| 50|

Add column to Spark dataframe with the max value that is less than the current record's value

You may try the following which uses max as a window function with when (a case expression) but focuses on the preceding rows

from pyspark.sql import functions as F
from pyspark.sql import Window

df = df.withColumn('previous_service_date',F.max(

|id |claim_id |service_date |status|product|previous_service_date|
|123|10606134411906233408|2018-09-17 00:00:00|PD |blue |null |
|123|10606147900401009928|2019-01-24 00:00:00|PD |yellow |null |
|123|10606160940704723994|2019-05-23 00:00:00|RV |yellow |2019-01-24 00:00:00 |
|123|10606171648203079553|2019-08-29 00:00:00|RJ |blue |2018-09-17 00:00:00 |
|123|10606186611407311724|2020-01-13 00:00:00|PD |blue |2018-09-17 00:00:00 |

Edit 1

You may also use last as denoted below

df = df.withColumn('previous_service_date',F.last(
F.when(F.col("status")=="PD" ,F.col("service_date")).otherwise(None),True

Let me know if this works for you.

get min and max from a specific column scala spark dataframe

How about getting the column name from the metadata:

val selectedColumnName = df.columns(q) //pull the (q + 1)th column from the columns array
df.agg(min(selectedColumnName), max(selectedColumnName))

Getting the maximum value of a column in an Apache Spark Dataframe (Scala)

In your code, the spark max was being mistaken for scala max, so I just specified the max to be from spark.

That will give you the max of id column as an integer value:

import org.apache.spark.sql.functions.col

val max = df.agg(org.apache.spark.sql.functions.max(col("id"))).collect()(0)(0).asInstanceOf[Int]

Output: max: Int = 6


If you want to create column to store max value of id:

import org.apache.spark.sql.functions._
df.withColumn("max", lit(df.agg(org.apache.spark.sql.functions.max($"id")).as[Int].first))

How to get the max value of date column in pyspark

I don't understand why you used try/except. The if-statement should be enough. Also you need to use the Spark SQL min/max instead of those in Python. Avoid naming your variables as min/max, which overrides default functions.

import pyspark.sql.functions as F

for col in df.columns:
if dict(df.dtypes)[col]== 'string':
minval, maxval =, F.max(col)).first()
print(col, 'NA')

how to find the max value of all columns in a spark dataframe

The code will work irrespective of how many columns or mix of datatypes there are.

Note: OP suggested in her comments that for string columns, take the first non-Null value while grouping.

# Import relevant functions
from pyspark.sql.functions import max, first, col

# Take an example DataFrame
values = [('Alice',10,5,None,50),('Bob',15,15,'Simon',10),('Jack',5,1,'Timo',3)]
df = sqlContext.createDataFrame(values,['col1','col2','col3','col4','col5'])
| col1|col2|col3| col4|col5|
|Alice| 10| 5| null| 50|
| Bob| 15| 15|Simon| 10|
| Jack| 5| 1| Timo| 3|

# Lists all columns in the DataFrame
seq_of_columns = df.columns
['col1', 'col2', 'col3', 'col4', 'col5']

# Using List comprehensions to create a list of columns of String DataType
string_columns = [i[0] for i in df.dtypes if i[1]=='string']
['col1', 'col4']

# Using Set function to get non-string columns by subtracting one list from another.
non_string_columns = list(set(seq_of_columns) - set(string_columns))
['col2', 'col3', 'col5']

Read about first() and ignorenulls here

# Aggregating both string and non-string columns
df =*[max(col(c)).alias(c) for c in non_string_columns],*[first(col(c),ignorenulls = True).alias(c) for c in string_columns])
df = df[[seq_of_columns]]
| col1|col2|col3| col4|col5|
|Alice| 15| 15|Simon| 50|

Related Topics

Leave a reply