How to Count the Total Number of Words in a Pandas Dataframe Cell and Add Those to a New Column

How do I count the total number of words in a Pandas dataframe cell and add those to a new column?

Let's say you have a dataframe df that you've generated using

df = pandas.read_csv('dataset.csv')

You would then add a new column with the word count by doing the following:

df['new_column'] = df.columnToCount.apply(lambda x: len(str(x).split(' ')))

Keep in mind the space in the split is important since you're splitting on new words. You may want to remove punctuation or numbers and reduce to lowercase before performing this as well.

df = df.apply(lambda x: x.astype(str).str.lower())
df = df.replace('\d+', '', regex = True)
df = df.replace('[^\w\s\+]', '', regex = True)

Count number of words per row

`str.split` + `str.len`

str.len works nicely for any non-numeric column.

df['totalwords'] = df['col'].str.split().str.len()

`str.count`

If your words are single-space separated, you may simply count the spaces plus 1.

df['totalwords'] = df['col'].str.count(' ') + 1

List Comprehension

This is faster than you think!

df['totalwords'] = [len(x.split()) for x in df['col'].tolist()]

How to find the maximum number of words in a pandas dataframe column of strings?

You can use .str and for index .idxmax:

import pandas as pd

something = ["Hello how are you", "I am doing great", "Lets go camping"]

test = pd.DataFrame(something)
test.columns = ["Response"]

length_of_the_messages = test["Response"].str.split("\\s+")

print(length_of_the_messages)
print("Max number of words = ", length_of_the_messages.str.len().max())
print("Index = ", length_of_the_messages.str.len().idxmax())

Prints:

0    [Hello, how, are, you]
1     [I, am, doing, great]
2       [Lets, go, camping]
Name: Response, dtype: object

Max number of words =  4
Index =  0

Pandas sum of all word counts in column

You could use the vectorized string operations:

In [7]: df["a"].str.split().str.len().sum()
Out[7]: 6

which comes from

In [8]: df["a"].str.split()
Out[8]: 
0          [some, words]
1    [lots, more, words]
2                   [hi]
Name: a, dtype: object

In [9]: df["a"].str.split().str.len()
Out[9]: 
0    2
1    3
2    1
Name: a, dtype: int64

In [10]: df["a"].str.split().str.len().sum()
Out[10]: 6

Return the list of each word in a pandas cell and the total count of that word in the entire column

Here is one way that gives the result you want, although avoids sklearn entirely:

def counts(data, column):
    full_list = []
    datr = data[column].tolist()
    total_words = " ".join(datr).split(' ')
    # per rows
    for i in range(len(datr)):
        #first per row get the words
        word_list = re.sub("[^\w]", " ",  datr[i]).split()
        #cycle per word
        total_row = []
        for word in word_list:
            count = []
            count = total_words.count(word)
            val = (word, count)
            total_row.append(val)
        full_list.append(total_row)
    return full_list

df['column2'] = counts(df,'column1')
df
         column1                                    column2
0   apple is a fruit  [(apple, 3), (is, 2), (a, 1), (fruit, 3)]
1        fruit sucks                   [(fruit, 3), (sucks, 1)]
2  apple tasty fruit       [(apple, 3), (tasty, 1), (fruit, 3)]
3   fruits what else        [(fruits, 1), (what, 1), (else, 1)]
4      yup apple map           [(yup, 1), (apple, 3), (map, 1)]
5   fire in the hole  [(fire, 1), (in, 1), (the, 1), (hole, 1)]
6       that is true            [(that, 1), (is, 2), (true, 1)]

Count occurrences of each of certain words in pandas dataframe

Update: Original answer counts those rows which contain a substring.

To count all the occurrences of a substring you can use .str.count:

In [21]: df = pd.DataFrame(['hello', 'world', 'hehe'], columns=['words'])

In [22]: df.words.str.count("he|wo")
Out[22]:
0    1
1    1
2    2
Name: words, dtype: int64

In [23]: df.words.str.count("he|wo").sum()
Out[23]: 4

The str.contains method accepts a regular expression:

Definition: df.words.str.contains(self, pat, case=True, flags=0, na=nan)
Docstring:
Check whether given pattern is contained in each string in the array

Parameters
----------
pat : string
    Character sequence or regular expression
case : boolean, default True
    If True, case sensitive
flags : int, default 0 (no flags)
    re module flags, e.g. re.IGNORECASE
na : default NaN, fill value for missing values.

For example:

In [11]: df = pd.DataFrame(['hello', 'world'], columns=['words'])

In [12]: df
Out[12]:
   words
0  hello
1  world

In [13]: df.words.str.contains(r'[hw]')
Out[13]:
0    True
1    True
Name: words, dtype: bool

In [14]: df.words.str.contains(r'he|wo')
Out[14]:
0    True
1    True
Name: words, dtype: bool

To count the occurences you can just sum this boolean Series:

In [15]: df.words.str.contains(r'he|wo').sum()
Out[15]: 2

In [16]: df.words.str.contains(r'he').sum()
Out[16]: 1

Counting the Frequency of words in a pandas data frame

IIUIC, use value_counts()

In [3361]: df.Firm_Name.str.split(expand=True).stack().value_counts()
Out[3361]:
Society       3
Ltd           2
James's       1
R.X.          1
Yah           1
Associates    1
St            1
Kensington    1
MMV           1
Big           1
&             1
The           1
Co            1
Oil           1
Building      1
dtype: int64

Or,

pd.Series(np.concatenate([x.split() for x in df.Firm_Name])).value_counts()

Or,

pd.Series(' '.join(df.Firm_Name).split()).value_counts()

For top N, for example 3

In [3379]: pd.Series(' '.join(df.Firm_Name).split()).value_counts()[:3]
Out[3379]:
Society    3
Ltd        2
James's    1
dtype: int64

Details

In [3380]: df
Out[3380]:
      URN                   Firm_Name
0  104472               R.X. Yah & Co
1  104873        Big Building Society
2  109986          St James's Society
3  114058  The Kensington Society Ltd
4  113438      MMV Oil Associates Ltd

How to Count the Total Number of Words in a Pandas Dataframe Cell and Add Those to a New Column

How do I count the total number of words in a Pandas dataframe cell and add those to a new column?

Count number of words per row

`str.split` + `str.len`

`str.count`

List Comprehension

How to find the maximum number of words in a pandas dataframe column of strings?

Pandas sum of all word counts in column

Return the list of each word in a pandas cell and the total count of that word in the entire column

Count occurrences of each of certain words in pandas dataframe

Counting the Frequency of words in a pandas data frame

Related Topics

Leave a reply

How do I count the total number of words in a Pandas dataframe cell and add those to a new column?

Count number of words per row

str.split + str.len

str.count

List Comprehension

How to find the maximum number of words in a pandas dataframe column of strings?

Pandas sum of all word counts in column

Return the list of each word in a pandas cell and the total count of that word in the entire column

Count occurrences of each of certain words in pandas dataframe

Counting the Frequency of words in a pandas data frame

Related Topics

Leave a reply

`str.split` + `str.len`

`str.count`