Pandas Dense Rank

Pandas DENSE RANK

Use pd.Series.rank with method='dense'

df['Rank'] = df.Year.rank(method='dense').astype(int)

df

Sample Image

How to get dense rank in each partition window in pandas

This is built-in with groupby:

df['dense_rank'] = (df.groupby('Dominant_Topic')['appearance']
.rank(method='dense', ascending=False)
.astype(int)
)

Output:

  Dominant_Topic            word  appearance  dense_rank
0 Topic 0 aaaawww 50 3
1 Topic 0 aacn 100 2
2 Topic 0 aaren 20 4
3 Topic 0 aarongoodwin 200 1
4 Topic 1 aaronjfentress 10 3
5 Topic 1 aaronrodger 20 2
6 Topic 1 aasmiitkap 30 1
7 Topic 2 aavqbketmh 10 1
8 Topic 2 ab 10 1
9 Topic 2 abandon 1 2

SQL to pandas: DENSE_RANK() OVER (PARTITION BY )

There is groupby.rank:

esg_group_dist['rank'] = (esg_group_dist.groupby(['company', 'topic'])
['disance'].rank(method='dense', ascending=False)
)

However, looking at your entire query, it looks like you're trying to extract info where distance is maximum minimum within each group. You can do so faster with

(esg_group_dist[['company', 'topic', 'probability', 'distance', 'sentence']]
.sort_values('distance') # sort values
.drop_duplicates(['company','topic'], keep='last') # keep the first rows
.query('topic=="green energy"') # filter topic
)

Note: to find minimum rows, remove ascending=False and keep='last'. Also there is groupby().idxmin/idxmax() option`.

Pandas - dense rank but keep current group numbers

Replace 0 values to missing values, so if use GroupBy.transform with first get all existing values instead them and then replace missing values by Series.rank with add maximal id and converting to integers:

df = df.replace({'id':{0:np.nan}})
df['id'] = df.groupby('name')['id'].transform('first')

s = df.loc[df["id"].isna(), 'name'].rank(method='dense') + df['id'].max()
df['id'] = df['id'].fillna(s).astype(int)
print (df)
name id
0 Andrew 3
1 Andrew 3
2 James 1
3 James 1
4 Mary 4
5 Andrew 3
6 Michael 2

How to get top 5 dense ranked rows in Pandas?

You can create an array that has the 3 smallest values of your Score column, and then use isin to filter your dataframe:

some_vals = pd.Series(df['Score'].unique()).nsmallest(3).values
df[df['Score'].isin(some_vals)]

Name Score
0 John 1
1 Mark 2
2 Perry 2
3 Dion 3

This way you ensure that all Name's that have a value in Score equal to any of the 3 smallest values will be returned back.

pyspark - Dense-Rank ties method first

The distributed nature of Spark prevents implicitly identifying the order of appearance. If you input dataset contains a column like line_number or row_number then rank(method='first') can be achieved.

Working Example

The following example relies on the dataframe from pd.rank with a included Line_Number field to have explicit ordering.

The dataframe is repartitioned to simulate random ordering after reading data.

import pyspark.sql.functions as F
from pyspark.sql import Window

data = [{"Line_Number": 1, "Animal": "cat", "Number_legs": 4}, {"Line_Number": 2, "Animal": "penguin", "Number_legs": 2},
{"Line_Number": 3, "Animal": "dog", "Number_legs": 4}, {"Line_Number": 4, "Animal": "spider", "Number_legs": 8},
{"Line_Number": 5, "Animal": "snake", "Number_legs": None}]

df = spark.createDataFrame(data).repartition(8)

window_spec = Window.orderBy(F.col("Number_legs").asc_nulls_last(), F.col("Line_Number"))

df.withColumn("rank", F.when(F.col("Number_legs").isNull(), F.lit(None)).otherwise(F.row_number().over(window_spec))).show()

Output

+-------+-----------+-----------+----+
| Animal|Line_Number|Number_legs|rank|
+-------+-----------+-----------+----+
|penguin| 2| 2| 1|
| cat| 1| 4| 2|
| dog| 3| 4| 3|
| spider| 4| 8| 4|
| snake| 5| null|null|
+-------+-----------+-----------+----+

Python Equivalent to SQL Rank

IIUC use GroupBy.rank:

df['date'] = pd.to_datetime(df['date'])
df['rank'] = df.groupby('id')['date'].rank(method='dense').astype(int)
print (df)
id date rank
0 12 2021-06-01 1
1 12 2021-06-15 2
2 12 2021-06-21 3
3 34 2021-06-05 1
4 87 2021-06-19 1
5 53 2021-06-05 1

If datetimes are sorted per groups is possible GroupBy.cumcount:

df = df.sort_values(['id','date'])
df['rank'] = df.groupby('id')['date'].cumcount().add(1)


Related Topics



Leave a reply



Submit