Pandas Dense Rank

Pandas DENSE RANK

Use pd.Series.rank with method='dense'

df['Rank'] = df.Year.rank(method='dense').astype(int)

df

Sample Image

How to get dense rank in each partition window in pandas

This is built-in with groupby:

df['dense_rank'] = (df.groupby('Dominant_Topic')['appearance']
                      .rank(method='dense', ascending=False)
                      .astype(int)
                   )

Output:

  Dominant_Topic            word  appearance  dense_rank
0        Topic 0         aaaawww          50           3
1        Topic 0            aacn         100           2
2        Topic 0           aaren          20           4
3        Topic 0    aarongoodwin         200           1
4        Topic 1  aaronjfentress          10           3
5        Topic 1     aaronrodger          20           2
6        Topic 1      aasmiitkap          30           1
7        Topic 2      aavqbketmh          10           1
8        Topic 2              ab          10           1
9        Topic 2         abandon           1           2

SQL to pandas: DENSE_RANK() OVER (PARTITION BY )

There is groupby.rank:

esg_group_dist['rank'] = (esg_group_dist.groupby(['company', 'topic'])
                             ['disance'].rank(method='dense', ascending=False)
                         )

However, looking at your entire query, it looks like you're trying to extract info where distance is maximum ~~minimum~~ within each group. You can do so faster with

(esg_group_dist[['company', 'topic', 'probability', 'distance', 'sentence']]
     .sort_values('distance')                            # sort values
     .drop_duplicates(['company','topic'], keep='last')  # keep the first rows
     .query('topic=="green energy"')                     # filter topic
)

Note: to find minimum rows, remove ascending=False and keep='last'. Also there is groupby().idxmin/idxmax() option`.

Pandas - dense rank but keep current group numbers

Replace 0 values to missing values, so if use GroupBy.transform with first get all existing values instead them and then replace missing values by Series.rank with add maximal id and converting to integers:

df = df.replace({'id':{0:np.nan}})
df['id'] = df.groupby('name')['id'].transform('first')

s = df.loc[df["id"].isna(), 'name'].rank(method='dense') + df['id'].max()
df['id'] = df['id'].fillna(s).astype(int)
print (df)
      name  id
0   Andrew   3
1   Andrew   3
2    James   1
3    James   1
4     Mary   4
5   Andrew   3
6  Michael   2

How to get top 5 dense ranked rows in Pandas?

You can create an array that has the 3 smallest values of your Score column, and then use isin to filter your dataframe:

some_vals = pd.Series(df['Score'].unique()).nsmallest(3).values
df[df['Score'].isin(some_vals)]

   Name  Score
0   John      1
1   Mark      2
2  Perry      2
3   Dion      3

This way you ensure that all Name's that have a value in Score equal to any of the 3 smallest values will be returned back.

pyspark - Dense-Rank ties method first

The distributed nature of Spark prevents implicitly identifying the order of appearance. If you input dataset contains a column like line_number or row_number then rank(method='first') can be achieved.

Working Example

The following example relies on the dataframe from pd.rank with a included Line_Number field to have explicit ordering.

The dataframe is repartitioned to simulate random ordering after reading data.

import pyspark.sql.functions as F
from pyspark.sql import Window

data = [{"Line_Number": 1, "Animal": "cat", "Number_legs": 4}, {"Line_Number": 2, "Animal": "penguin", "Number_legs": 2},
        {"Line_Number": 3, "Animal": "dog", "Number_legs": 4}, {"Line_Number": 4, "Animal": "spider", "Number_legs": 8},
        {"Line_Number": 5, "Animal": "snake", "Number_legs": None}]

df = spark.createDataFrame(data).repartition(8)

window_spec = Window.orderBy(F.col("Number_legs").asc_nulls_last(), F.col("Line_Number"))

df.withColumn("rank", F.when(F.col("Number_legs").isNull(), F.lit(None)).otherwise(F.row_number().over(window_spec))).show()

Output

+-------+-----------+-----------+----+
| Animal|Line_Number|Number_legs|rank|
+-------+-----------+-----------+----+
|penguin|          2|          2|   1|
|    cat|          1|          4|   2|
|    dog|          3|          4|   3|
| spider|          4|          8|   4|
|  snake|          5|       null|null|
+-------+-----------+-----------+----+

Python Equivalent to SQL Rank

IIUC use GroupBy.rank:

df['date'] = pd.to_datetime(df['date'])
df['rank'] = df.groupby('id')['date'].rank(method='dense').astype(int)
print (df)
   id       date  rank
0  12 2021-06-01     1
1  12 2021-06-15     2
2  12 2021-06-21     3
3  34 2021-06-05     1
4  87 2021-06-19     1
5  53 2021-06-05     1

If datetimes are sorted per groups is possible GroupBy.cumcount:

df = df.sort_values(['id','date'])
df['rank'] = df.groupby('id')['date'].cumcount().add(1)