how to get max(date) from given set of data grouped by some fields using pyspark?
For non-numeric but Orderable
types you can use agg
with max
directly:
from pyspark.sql.functions import col, max as max_
df = sc.parallelize([
("2016-04-06 16:36", 1234, 111, 1),
("2016-04-06 17:35", 1234, 111, 5),
]).toDF(["datetime", "userId", "memberId", "value"])
(df.withColumn("datetime", col("datetime").cast("timestamp"))
.groupBy("userId", "memberId")
.agg(max_("datetime")))
## +------+--------+--------------------+
## |userId|memberId| max(datetime)|
## +------+--------+--------------------+
## | 1234| 111|2016-04-06 17:35:...|
## +------+--------+--------------------+
pyspark window min(date) and max(date) of group
Here is one way to approach the problem
- Create a helper
group
column to distinguish between the consecutive rows inloc
peruser
- Then group the dataframe by the columns
user
,loc
andgroup
and aggregate the columndate
usingmin
andmax
- Drop the
group
column and sort the dataframe bystartdate
w = Window.partitionBy('user').orderBy('date')
b = F.lag('loc').over(w) != F.col('loc')
(
df.withColumn('group', b.cast('int'))
.fillna(0, 'group')
.withColumn('group', F.sum('group').over(w))
.groupBy('user', 'loc', 'group')
.agg(F.min('date').alias('startdate'),
F.max('date').alias('enddate'))
.drop('group')
.orderBy('startdate')
)
+----+---+----------+----------+
|user|loc| startdate| enddate|
+----+---+----------+----------+
| a| 1|2021-01-01|2021-01-02|
| a| 2|2021-01-03|2021-01-04|
| a| 1|2021-01-05|2021-01-06|
+----+---+----------+----------+
Retrieval of max date group by other column in spark-sql with scala
Applying max
on a string type column will not give you the maximum date. You need to convert that to a date type column first:
val maxDateDF = spark.sql("SELECT name, max(to_date(birthDate, 'dd/MM/yyyy')) maxDate FROM people group by name")
If you wish to retain the original date formatting, you can convert it back to a string using date_format
:
val maxDateDF = spark.sql("SELECT name, date_format(max(to_date(birthDate, 'dd/MM/yyyy')), 'dd/MM/yyyy') maxDate FROM people group by name")
Spark 2.0 groupBy column and then get max(date) on a datetype column
RelationalGroupedDataset.max
is for numeric values only.
You could try agg()
with the accompanying max
function. In Scala:
import org.apache.spark.sql.functions._
old_df.groupBy($"ID").agg(max("date"))
so in Java it should be:
import static org.apache.spark.sql.functions.*;
old_df.groupBy("ID").agg(max("date"))
GroupBy column and filter rows with maximum value in Pyspark
You can do this without a udf
using a Window
.
Consider the following example:
import pyspark.sql.functions as f
data = [
('a', 5),
('a', 8),
('a', 7),
('b', 1),
('b', 3)
]
df = sqlCtx.createDataFrame(data, ["A", "B"])
df.show()
#+---+---+
#| A| B|
#+---+---+
#| a| 5|
#| a| 8|
#| a| 7|
#| b| 1|
#| b| 3|
#+---+---+
Create a Window
to partition by column A
and use this to compute the maximum of each group. Then filter out the rows such that the value in column B
is equal to the max.
from pyspark.sql import Window
w = Window.partitionBy('A')
df.withColumn('maxB', f.max('B').over(w))\
.where(f.col('B') == f.col('maxB'))\
.drop('maxB')\
.show()
#+---+---+
#| A| B|
#+---+---+
#| a| 8|
#| b| 3|
#+---+---+
Or equivalently using pyspark-sql
:
df.registerTempTable('table')
q = "SELECT A, B FROM (SELECT *, MAX(B) OVER (PARTITION BY A) AS maxB FROM table) M WHERE B = maxB"
sqlCtx.sql(q).show()
#+---+---+
#| A| B|
#+---+---+
#| b| 3|
#| a| 8|
#+---+---+
How to calculate Max(Date) and Min(Date) for DateType in pyspark dataframe?
Aggregate with min
and max
:
from pyspark.sql.functions import min, max
df = spark.createDataFrame([
"2017-01-01", "2018-02-08", "2019-01-03"], "string"
).selectExpr("CAST(value AS date) AS date")
min_date, max_date = df.select(min("date"), max("date")).first()
min_date, max_date
# (datetime.date(2017, 1, 1), datetime.date(2019, 1, 3))
Related Topics
Is the 'As' Keyword Required in Oracle to Define an Alias
SQL Use Case Statement in Where in Clause
"Ora-01438: Value Larger Than Specified Precision Allowed for This Column" When Inserting 3
SQL - Select Rows from Two Different Tables
Differencein These Two Queries as Getting Two Different Result Set
Window Functions: Partition by One Column After Order by Another
Why Does Comparing a SQL Date Variable to Null Behave in This Way
How to Write a Query That Does Something Similar to MySQL's Group_Concat in Oracle
Select Multiple Columns from a Table, But Group by One
What's the Difference Between "Like" and "=" in SQL
Count Null Values from Multiple Columns with SQL
Bigquery - JSON_Extract All Elements from an Array
Add Emoji/Emoticon to SQL Server Table
SQL - Find Missing Int Values in Mostly Ordered Sequential Series
Ora-30926: Unable to Get a Stable Set of Rows in the Source Tables When Merging Tables