Naturally Sorting Pandas Dataframe

Naturally sorting Pandas DataFrame

If you want to sort the df, just sort the index or the data and assign directly to the index of the df rather than trying to pass the df as an arg as that yields an empty list:

In [7]:

df.index = natsorted(a)
df.index
Out[7]:
Index(['0hr', '48hr', '72hr', '96hr', '128hr'], dtype='object')

Note that df.index = natsorted(df.index) also works

if you pass the df as an arg it yields an empty list, in this case because the df is empty (has no columns), otherwise it will return the columns sorted which is not what you want:

In [10]:

natsorted(df)
Out[10]:
[]

EDIT

If you want to sort the index so that the data is reordered along with the index then use reindex:

In [13]:

df=pd.DataFrame(index=a, data=np.arange(5))
df
Out[13]:
0
0hr 0
128hr 1
72hr 2
48hr 3
96hr 4
In [14]:

df = df*2
df
Out[14]:
0
0hr 0
128hr 2
72hr 4
48hr 6
96hr 8
In [15]:

df.reindex(index=natsorted(df.index))
Out[15]:
0
0hr 0
48hr 6
72hr 4
96hr 8
128hr 2

Note that you have to assign the result of reindex to either a new df or to itself, it does not accept the inplace param.

Natural sort a data frame column in pandas

You can convert values to ordered categorical with sorted catgories by natsorted and then use sort_values:

import natsort as ns

df['a'] = pd.Categorical(df['a'], ordered=True, categories= ns.natsorted(df['a'].unique()))
df = df.sort_values('a')
print (df)
a b
5 a1 b1
2 a1 b11
4 a3 b4
3 a10 b22
6 a11 b12
1 a20 b2
0 a22 b5

df['b'] = pd.Categorical(df['b'], ordered=True, categories= ns.natsorted(df['b'].unique()))

df = df.sort_values('b')
print (df)
a b
5 a1 b1
1 a20 b2
4 a3 b4
0 a22 b5
2 a1 b11
6 a11 b12
3 a10 b22

Natural sorting in pandas

Use str.extract, sort_values, then use the index to reindex df.

idx = (df.assign(ID2=df.ID.str.extract(r'(\d+)$').astype(int))
.sort_values(['ID2', 'Time'])
.index)

df.iloc[idx]

ID Time oneMissing singleValue empty oneEmpty
0 CS1-1 1 10000.0 NaN None 0.0
2 CS1-1 2 30000.0 NaN None 0.0
3 CS1-2 1 10000.0 NaN None NaN
1 CS1-2 2 20000.0 0.0 None 0.0
5 CS1-2 3 30000.0 NaN None NaN
4 CS1-11 1 NaN 0.0 None NaN

This is under the assumption that your ID column follows the pattern "XXX-NUMBER".


A fool-proof solution will involve the use of the natsort module, which excels at fast natural sorting. With a little elbow-grease, we can argsort your data.

from natsort import natsorted
idx, *_ = zip(*natsorted(
zip(df.index, df.ID, df.Time), key=lambda x: (x[1], x[2])))

df.iloc[list(idx)]

ID Time oneMissing singleValue empty oneEmpty
0 CS1-1 1 10000.0 NaN None 0.0
2 CS1-1 2 30000.0 NaN None 0.0
3 CS1-2 1 10000.0 NaN None NaN
1 CS1-2 2 20000.0 0.0 None 0.0
5 CS1-2 3 30000.0 NaN None NaN
4 CS1-11 1 NaN 0.0 None NaN

Use PyPi to install: pip install natsort.

Naturally sort pandas DataFrame by index

To handle the additional requirement that all "S" come before "N", use the natsort module, and pass a key=... argument to natsorted:

natsorted(df.index, lambda x: (x[:-1], x[-1] == 'N'))
# ['1S', '1N', '2S', '2N', '15N', '16S', '16N']
df.loc[natsorted(df.index, lambda x: (x[:-1], x[-1] == 'N'))]
material
1S 716.2
1N 724.5
2S 723.5
2N 721.5
15N 649.7
16S 703.2
16N 711.7

natsort is more generalised for natural sorting problems, and is a handy tool for times like this. Installation can be done using PyPi.


If the ordering of "S" and "N" is not important, consider argsort for performance:

df.iloc[df.index.str[:-1].astype(int).argsort()]

Or,

df.iloc[np.argsort([int(x[:-1]) for x in df.index])]

     material
1S 716.2
1N 724.5
2S 723.5
2N 721.5
15N 649.7
16S 703.2
16N 711.7

Sort pandas dataframe on index that is string+digits

Onne idea is convert index to Series with Series.str.split to DataFrame, convert second column to integers and sorting both columns and this index is used for change order in original df.index by DataFrame.reindex:

df1 = df.index.to_series().str.split('_',expand=True)
df1[1] = df1[1].astype(int)
df1 = df1.sort_values([0, 1], ascending=[True, False])
print (df1)
0 1
index
A_100 A 100
A_60 A 60
A_30 A 30
B_100 B 100
B_60 B 60
B_30 B 30

df = df.reindex(df1.index)
print (df)
vals
index
A_100 0
A_60 12
A_30 13
B_100 12
B_60 6
B_30 6

Sort a pandas dataframe by 2 columns (one with integers, one with alphanumerics) with priority for integer column

You can do it this way:

  1. Split the second column with alphanumeric strings into 2 columns: one column Letter to hold the first letter and another column Number to hold a number of one or two digits.
  2. Convert Number column from string to integer.
  3. Then, sort these 2 new columns together with the first column of integers

Let's illustrate the process with an example below:

Assume we have the dataframe df as follows:

print(df)

Col1 Col2
0 2 B12
1 11 C2
2 2 A1
3 11 B2
4 2 B1
5 11 C12
6 2 A12
7 11 C1
8 2 A2

Step 1 & 2: Split Col2 into 2 columns Letter & Number + Convert Number column from string to integer:

df['Letter'] = df['Col2'].str[0]               # take 1st char
df['Number'] = df['Col2'].str[1:].astype(int) # take 2nd char onwards and convert to integer

Result:

print(df)

Col1 Col2 Letter Number
0 2 B12 B 12
1 11 C2 C 2
2 2 A1 A 1
3 11 B2 B 2
4 2 B1 B 1
5 11 C12 C 12
6 2 A12 A 12
7 11 C1 C 1
8 2 A2 A 2

Step 3: Sort Col1, Letter and Number with priority: Col1 ---> Number ---> Letter:

df = df.sort_values(by=['Col1', 'Number', 'Letter'])

Result:

print(df)

Col1 Col2 Letter Number
2 2 A1 A 1
4 2 B1 B 1
8 2 A2 A 2
6 2 A12 A 12
0 2 B12 B 12
7 11 C1 C 1
3 11 B2 B 2
1 11 C2 C 2
5 11 C12 C 12

After sorting, you can remove the Letter and Number columns, as follows:

df = df.drop(['Letter', 'Number'], axis=1)

If you want to do all in one step, you can also chain the instructions, as follows:

df = (df.assign(Letter=df['Col2'].str[0], 
Number=df['Col2'].str[1:].astype(int))
.sort_values(by=['Col1', 'Number', 'Letter'])
.drop(['Letter', 'Number'], axis=1)
)

Result:

print(df)

Col1 Col2
2 2 A1
4 2 B1
8 2 A2
6 2 A12
0 2 B12
7 11 C1
3 11 B2
1 11 C2
5 11 C12

How can I sort strings in pandas data file according to the value in specific column?

For last pandas versions is possible use parameter key with split values by _ and converting values to integers:

df_sorted = df.sort_values('Source', key=lambda x: x.str.split('_').str[1].astype(int)) 

Or is possible get positions of sorted values by Series.argsort and pass to DataFrame.iloc:

df_sorted = df.iloc[df['Source'].str.split('_').str[1].astype(int).argsort()]
print (df_sorted)
Source LogP MolWt HBA HBD
0 cne_1 1.1732 263.405 3 1
1 cne_10 2.6639 197.237 2 0
2 cne_100 -0.2886 170.193 4 2
1031 cne_995 3.0179 347.219 4 2
1032 cne_996 4.8419 407.495 6 2
1033 cne_997 3.3560 354.524 3 1
1034 cne_998 7.5465 635.316 4 2
1035 cne_999 3.3514 389.556 4 1
3 cne_1000 1.9644 304.709 5 1
4 cne_1001 1.4986 162.144 3 1

Sort a Pandas Dataframe by Multiple Columns Using Key Argument

You can split column one into its constituent parts, add them as columns to the dataframe and then sort on them with column two. Finally, remove the temporary columns.

>>> (df.assign(lhs=df['one'].str[0], rhs=df['one'].str[1:].astype(int))
.sort_values(['two', 'rhs', 'lhs'])
.drop(columns=['lhs', 'rhs']))
one two
5 A1 1
1 B1 1
3 A2 1
2 A1 2
4 B1 2
0 A2 2


Related Topics



Leave a reply



Submit