How to get maximum length of each column in the data frame using pandas python
One solution is to use numpy.vectorize
. This may be more efficient than pandas
-based solutions.
You can use pd.DataFrame.select_dtypes
to select object
columns.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': ['abc', 'de', 'abcd'],
'B': ['a', 'abcde', 'abc'],
'C': [1, 2.5, 1.5]})
measurer = np.vectorize(len)
Max length for all columns
res1 = measurer(df.values.astype(str)).max(axis=0)
array([4, 5, 3])
Max length for object columns
res2 = measurer(df.select_dtypes(include=[object]).values.astype(str)).max(axis=0)
array([4, 5])
Or if you need output as a dictionary:
res1 = dict(zip(df, measurer(df.values.astype(str)).max(axis=0)))
{'A': 4, 'B': 5, 'C': 3}
df_object = df.select_dtypes(include=[object])
res2 = dict(zip(df_object, measurer(df_object.values.astype(str)).max(axis=0)))
{'A': 4, 'B': 5}
How to get all the max lenght of each column in a dataframe
You could apply
str.len
per column and get the max
:
df.astype(str).apply(lambda s: s.str.len()).max()
output:
a 3
b 4
c 4
dtype: int64
As dictionary:
d = df.astype(str).apply(lambda s: s.str.len()).max().to_dict()
output: {'a': 3, 'b': 4, 'c': 4}
Or using a dictionary comprehension:
d = {k: df[k].str.len().max() for k in df}
Find length of longest string in Pandas dataframe column
DSM's suggestion seems to be about the best you're going to get without doing some manual microoptimization:
%timeit -n 100 df.col1.str.len().max()
100 loops, best of 3: 11.7 ms per loop
%timeit -n 100 df.col1.map(lambda x: len(x)).max()
100 loops, best of 3: 16.4 ms per loop
%timeit -n 100 df.col1.map(len).max()
100 loops, best of 3: 10.1 ms per loop
Note that explicitly using the str.len()
method doesn't seem to be much of an improvement. If you're not familiar with IPython, which is where that very convenient %timeit
syntax comes from, I'd definitely suggest giving it a shot for quick testing of things like this.
Update Added screenshot:
pandas - groupby a column and get the max length of another string column with nulls
First idea is use lambda function with Series.str.len
and max
:
df = (df.groupby('source')['text_column']
.agg(lambda x: x.str.len().max())
.reset_index(name='something'))
print (df)
source something
0 a 9.0
1 b 14.0
2 c 9.0
Or you can first use Series.str.len
and then aggregate max
:
df = (df['text_column'].str.len()
.groupby(df['source'])
.max()
.reset_index(name='something'))
print (df)
Also if need integers first use DataFrame.dropna
:
df = (df.dropna(subset=['text_column'])
.assign(text_column=lambda x: x['text_column'].str.len())
.groupby('source', as_index=False)['text_column']
.max())
print (df)
source text_column
0 a 9
1 b 14
2 c 9
EDIT: for first and second top values use DataFrame.sort_values
with GroupBy.head
:
df1 = (df.dropna(subset=['text_column'])
.assign(something=lambda x: x['text_column'].str.len())
.sort_values(['source','something'], ascending=[True, False])
.groupby('source', as_index=False)
.head(2))
print (df1)
source text_column something
0 a abcdefghi 9
1 a abcde 5
7 b qazxswedcdcvfr 14
2 b qwertyiop 9
3 c plmnkoijb 9
5 c abcde 5
Alternative solution with SeriesGroupBy.nlargest
, obviously slowier:
df1 = (df.dropna(subset=['text_column'])
.assign(something=lambda x: x['text_column'].str.len())
.groupby('source')['something']
.nlargest(2)
.reset_index(level=1, drop=True)
.reset_index())
print (df1)
source something
0 a 9
1 a 5
2 b 14
3 b 9
4 c 9
5 c 5
Last solution for new columns by top1, top2:
df=df.dropna(subset=['text_column']).assign(something=lambda x: x['text_column'].str.len())
df = df.sort_values(['source','something'], ascending=[True, False])
df['g'] = df.groupby('source').cumcount().add(1)
df = (df[df['g'].le(2)].pivot('source','g','something')
.add_prefix('top')
.rename_axis(index=None, columns=None))
print (df)
top1 top2
a 9 5
b 14 9
c 9 5
Find length of the longest column in pandas
You can try using count
followed by max
. According to pandas documentation for the count:
Count non-NA cells for each column or row.
print(dataFrame.count().max())
Create a column which is string who has the max length within each rows - Pandas Dataframe
Option 1:
You can simply replace NaNs with an empty string '' and find a max:
df = df.fillna('')
df['Name'] = df.max(axis=1)
Option 2:
Use apply and explicitly skip NaNs while finding a maximum:
df['Name'] = df.apply(lambda x: max([l for l in x.values if not pd.isnull(l)], key=len), axis=1)
How to determine the length of lists in a pandas dataframe column
You can use the str
accessor for some list operations as well. In this example,
df['CreationDate'].str.len()
returns the length of each list. See the docs for str.len
.
df['Length'] = df['CreationDate'].str.len()
df
Out:
CreationDate Length
2013-12-22 15:25:02 [ubuntu, mac-osx, syslinux] 3
2009-12-14 14:29:32 [ubuntu, mod-rewrite, laconica, apache-2.2] 4
2013-12-22 15:42:00 [ubuntu, nat, squid, mikrotik] 4
For these operations, vanilla Python is generally faster. pandas handles NaNs though. Here are timings:
ser = pd.Series([random.sample(string.ascii_letters,
random.randint(1, 20)) for _ in range(10**6)])
%timeit ser.apply(lambda x: len(x))
1 loop, best of 3: 425 ms per loop
%timeit ser.str.len()
1 loop, best of 3: 248 ms per loop
%timeit [len(x) for x in ser]
10 loops, best of 3: 84 ms per loop
%timeit pd.Series([len(x) for x in ser], index=ser.index)
1 loop, best of 3: 236 ms per loop
Related Topics
Get Discord User Id from Username
Python Format Size Application (Converting B to Kb, Mb, Gb, Tb)
How to Do a Conditional Count After Groupby on a Pandas Dataframe
Converting a List into Comma Separated and Add Quotes in Python
Python - Automatically Adjust Width of an Excel File'S Columns
How to Read a Specific Line from a Text File in Python
Pandas Counting and Summing Specific Conditions
Using Look Up Tables in Python
Using a Pandas Dataframe as a Lookup Table
How to Put a Space Between Two String Items in Python
Regex Check If Specific Multiple Words Present in a Sentence
Python Pandas Dataframe Get All Combinations of Column Values
How to Select the Last Column of Dataframe
Finding the Most Frequent Character in a String
How to Concatenate/Append Multiple Spark Dataframes Column Wise in Pyspark