How to Count the Total Number of Words in a Pandas Dataframe Cell and Add Those to a New Column

How do I count the total number of words in a Pandas dataframe cell and add those to a new column?

Let's say you have a dataframe df that you've generated using

df = pandas.read_csv('dataset.csv')

You would then add a new column with the word count by doing the following:

df['new_column'] = df.columnToCount.apply(lambda x: len(str(x).split(' ')))

Keep in mind the space in the split is important since you're splitting on new words. You may want to remove punctuation or numbers and reduce to lowercase before performing this as well.

df = df.apply(lambda x: x.astype(str).str.lower())
df = df.replace('\d+', '', regex = True)
df = df.replace('[^\w\s\+]', '', regex = True)

Count number of words per row

str.split + str.len

str.len works nicely for any non-numeric column.

df['totalwords'] = df['col'].str.split().str.len()


str.count

If your words are single-space separated, you may simply count the spaces plus 1.

df['totalwords'] = df['col'].str.count(' ') + 1


List Comprehension

This is faster than you think!

df['totalwords'] = [len(x.split()) for x in df['col'].tolist()]

How to find the maximum number of words in a pandas dataframe column of strings?

You can use .str and for index .idxmax:

import pandas as pd

something = ["Hello how are you", "I am doing great", "Lets go camping"]

test = pd.DataFrame(something)
test.columns = ["Response"]

length_of_the_messages = test["Response"].str.split("\\s+")

print(length_of_the_messages)
print("Max number of words = ", length_of_the_messages.str.len().max())
print("Index = ", length_of_the_messages.str.len().idxmax())

Prints:

0    [Hello, how, are, you]
1 [I, am, doing, great]
2 [Lets, go, camping]
Name: Response, dtype: object

Max number of words = 4
Index = 0

Pandas sum of all word counts in column

You could use the vectorized string operations:

In [7]: df["a"].str.split().str.len().sum()
Out[7]: 6

which comes from

In [8]: df["a"].str.split()
Out[8]:
0 [some, words]
1 [lots, more, words]
2 [hi]
Name: a, dtype: object

In [9]: df["a"].str.split().str.len()
Out[9]:
0 2
1 3
2 1
Name: a, dtype: int64

In [10]: df["a"].str.split().str.len().sum()
Out[10]: 6

Return the list of each word in a pandas cell and the total count of that word in the entire column

Here is one way that gives the result you want, although avoids sklearn entirely:

def counts(data, column):
full_list = []
datr = data[column].tolist()
total_words = " ".join(datr).split(' ')
# per rows
for i in range(len(datr)):
#first per row get the words
word_list = re.sub("[^\w]", " ", datr[i]).split()
#cycle per word
total_row = []
for word in word_list:
count = []
count = total_words.count(word)
val = (word, count)
total_row.append(val)
full_list.append(total_row)
return full_list

df['column2'] = counts(df,'column1')
df
column1 column2
0 apple is a fruit [(apple, 3), (is, 2), (a, 1), (fruit, 3)]
1 fruit sucks [(fruit, 3), (sucks, 1)]
2 apple tasty fruit [(apple, 3), (tasty, 1), (fruit, 3)]
3 fruits what else [(fruits, 1), (what, 1), (else, 1)]
4 yup apple map [(yup, 1), (apple, 3), (map, 1)]
5 fire in the hole [(fire, 1), (in, 1), (the, 1), (hole, 1)]
6 that is true [(that, 1), (is, 2), (true, 1)]

Count occurrences of each of certain words in pandas dataframe

Update: Original answer counts those rows which contain a substring.

To count all the occurrences of a substring you can use .str.count:

In [21]: df = pd.DataFrame(['hello', 'world', 'hehe'], columns=['words'])

In [22]: df.words.str.count("he|wo")
Out[22]:
0 1
1 1
2 2
Name: words, dtype: int64

In [23]: df.words.str.count("he|wo").sum()
Out[23]: 4

The str.contains method accepts a regular expression:

Definition: df.words.str.contains(self, pat, case=True, flags=0, na=nan)
Docstring:
Check whether given pattern is contained in each string in the array

Parameters
----------
pat : string
Character sequence or regular expression
case : boolean, default True
If True, case sensitive
flags : int, default 0 (no flags)
re module flags, e.g. re.IGNORECASE
na : default NaN, fill value for missing values.

For example:

In [11]: df = pd.DataFrame(['hello', 'world'], columns=['words'])

In [12]: df
Out[12]:
words
0 hello
1 world

In [13]: df.words.str.contains(r'[hw]')
Out[13]:
0 True
1 True
Name: words, dtype: bool

In [14]: df.words.str.contains(r'he|wo')
Out[14]:
0 True
1 True
Name: words, dtype: bool

To count the occurences you can just sum this boolean Series:

In [15]: df.words.str.contains(r'he|wo').sum()
Out[15]: 2

In [16]: df.words.str.contains(r'he').sum()
Out[16]: 1

Counting the Frequency of words in a pandas data frame

IIUIC, use value_counts()

In [3361]: df.Firm_Name.str.split(expand=True).stack().value_counts()
Out[3361]:
Society 3
Ltd 2
James's 1
R.X. 1
Yah 1
Associates 1
St 1
Kensington 1
MMV 1
Big 1
& 1
The 1
Co 1
Oil 1
Building 1
dtype: int64

Or,

pd.Series(np.concatenate([x.split() for x in df.Firm_Name])).value_counts()

Or,

pd.Series(' '.join(df.Firm_Name).split()).value_counts()

For top N, for example 3

In [3379]: pd.Series(' '.join(df.Firm_Name).split()).value_counts()[:3]
Out[3379]:
Society 3
Ltd 2
James's 1
dtype: int64

Details

In [3380]: df
Out[3380]:
URN Firm_Name
0 104472 R.X. Yah & Co
1 104873 Big Building Society
2 109986 St James's Society
3 114058 The Kensington Society Ltd
4 113438 MMV Oil Associates Ltd


Related Topics



Leave a reply



Submit