How to Get Maximum Length of Each Column in the Data Frame Using Pandas Python

How to get maximum length of each column in the data frame using pandas python

One solution is to use numpy.vectorize. This may be more efficient than pandas-based solutions.

You can use pd.DataFrame.select_dtypes to select object columns.

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': ['abc', 'de', 'abcd'],
'B': ['a', 'abcde', 'abc'],
'C': [1, 2.5, 1.5]})

measurer = np.vectorize(len)

Max length for all columns

res1 = measurer(df.values.astype(str)).max(axis=0)

array([4, 5, 3])

Max length for object columns

res2 = measurer(df.select_dtypes(include=[object]).values.astype(str)).max(axis=0)

array([4, 5])

Or if you need output as a dictionary:

res1 = dict(zip(df, measurer(df.values.astype(str)).max(axis=0)))

{'A': 4, 'B': 5, 'C': 3}

df_object = df.select_dtypes(include=[object])
res2 = dict(zip(df_object, measurer(df_object.values.astype(str)).max(axis=0)))

{'A': 4, 'B': 5}

How to get all the max lenght of each column in a dataframe

You could apply str.len per column and get the max:

df.astype(str).apply(lambda s: s.str.len()).max()

output:

a    3
b 4
c 4
dtype: int64

As dictionary:

d = df.astype(str).apply(lambda s: s.str.len()).max().to_dict()

output: {'a': 3, 'b': 4, 'c': 4}
Or using a dictionary comprehension:

d = {k: df[k].str.len().max() for k in df}

Find length of longest string in Pandas dataframe column

DSM's suggestion seems to be about the best you're going to get without doing some manual microoptimization:

%timeit -n 100 df.col1.str.len().max()
100 loops, best of 3: 11.7 ms per loop

%timeit -n 100 df.col1.map(lambda x: len(x)).max()
100 loops, best of 3: 16.4 ms per loop

%timeit -n 100 df.col1.map(len).max()
100 loops, best of 3: 10.1 ms per loop

Note that explicitly using the str.len() method doesn't seem to be much of an improvement. If you're not familiar with IPython, which is where that very convenient %timeit syntax comes from, I'd definitely suggest giving it a shot for quick testing of things like this.

Update Added screenshot:

enter image description here

pandas - groupby a column and get the max length of another string column with nulls

First idea is use lambda function with Series.str.len and max:

df = (df.groupby('source')['text_column']
.agg(lambda x: x.str.len().max())
.reset_index(name='something'))
print (df)
source something
0 a 9.0
1 b 14.0
2 c 9.0

Or you can first use Series.str.len and then aggregate max:

df = (df['text_column'].str.len()
.groupby(df['source'])
.max()
.reset_index(name='something'))
print (df)

Also if need integers first use DataFrame.dropna:

df = (df.dropna(subset=['text_column'])
.assign(text_column=lambda x: x['text_column'].str.len())
.groupby('source', as_index=False)['text_column']
.max())
print (df)

source text_column
0 a 9
1 b 14
2 c 9

EDIT: for first and second top values use DataFrame.sort_values with GroupBy.head:

df1 = (df.dropna(subset=['text_column'])
.assign(something=lambda x: x['text_column'].str.len())
.sort_values(['source','something'], ascending=[True, False])
.groupby('source', as_index=False)
.head(2))
print (df1)
source text_column something
0 a abcdefghi 9
1 a abcde 5
7 b qazxswedcdcvfr 14
2 b qwertyiop 9
3 c plmnkoijb 9
5 c abcde 5

Alternative solution with SeriesGroupBy.nlargest, obviously slowier:

df1 = (df.dropna(subset=['text_column'])
.assign(something=lambda x: x['text_column'].str.len())
.groupby('source')['something']
.nlargest(2)
.reset_index(level=1, drop=True)
.reset_index())
print (df1)

source something
0 a 9
1 a 5
2 b 14
3 b 9
4 c 9
5 c 5

Last solution for new columns by top1, top2:

df=df.dropna(subset=['text_column']).assign(something=lambda x: x['text_column'].str.len())

df = df.sort_values(['source','something'], ascending=[True, False])
df['g'] = df.groupby('source').cumcount().add(1)

df = (df[df['g'].le(2)].pivot('source','g','something')
.add_prefix('top')
.rename_axis(index=None, columns=None))
print (df)
top1 top2
a 9 5
b 14 9
c 9 5

Find length of the longest column in pandas

You can try using count followed by max. According to pandas documentation for the count:

Count non-NA cells for each column or row.

print(dataFrame.count().max())

Create a column which is string who has the max length within each rows - Pandas Dataframe

Option 1:
You can simply replace NaNs with an empty string '' and find a max:

df = df.fillna('')
df['Name'] = df.max(axis=1)

Option 2:
Use apply and explicitly skip NaNs while finding a maximum:

df['Name'] = df.apply(lambda x: max([l for l in x.values if not pd.isnull(l)], key=len), axis=1)

How to determine the length of lists in a pandas dataframe column

You can use the str accessor for some list operations as well. In this example,

df['CreationDate'].str.len()

returns the length of each list. See the docs for str.len.

df['Length'] = df['CreationDate'].str.len()
df
Out:
CreationDate Length
2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux] 3
2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2] 4
2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik] 4

For these operations, vanilla Python is generally faster. pandas handles NaNs though. Here are timings:

ser = pd.Series([random.sample(string.ascii_letters, 
random.randint(1, 20)) for _ in range(10**6)])

%timeit ser.apply(lambda x: len(x))
1 loop, best of 3: 425 ms per loop

%timeit ser.str.len()
1 loop, best of 3: 248 ms per loop

%timeit [len(x) for x in ser]
10 loops, best of 3: 84 ms per loop

%timeit pd.Series([len(x) for x in ser], index=ser.index)
1 loop, best of 3: 236 ms per loop


Related Topics



Leave a reply



Submit