How do I count the total number of words in a Pandas dataframe cell and add those to a new column?
Let's say you have a dataframe df that you've generated using
df = pandas.read_csv('dataset.csv')
You would then add a new column with the word count by doing the following:
df['new_column'] = df.columnToCount.apply(lambda x: len(str(x).split(' ')))
Keep in mind the space in the split is important since you're splitting on new words. You may want to remove punctuation or numbers and reduce to lowercase before performing this as well.
df = df.apply(lambda x: x.astype(str).str.lower())
df = df.replace('\d+', '', regex = True)
df = df.replace('[^\w\s\+]', '', regex = True)
Count number of words per row
str.split
+ str.len
str.len
works nicely for any non-numeric column.
df['totalwords'] = df['col'].str.split().str.len()
str.count
If your words are single-space separated, you may simply count the spaces plus 1.
df['totalwords'] = df['col'].str.count(' ') + 1
List Comprehension
This is faster than you think!
df['totalwords'] = [len(x.split()) for x in df['col'].tolist()]
How to find the maximum number of words in a pandas dataframe column of strings?
You can use .str
and for index .idxmax
:
import pandas as pd
something = ["Hello how are you", "I am doing great", "Lets go camping"]
test = pd.DataFrame(something)
test.columns = ["Response"]
length_of_the_messages = test["Response"].str.split("\\s+")
print(length_of_the_messages)
print("Max number of words = ", length_of_the_messages.str.len().max())
print("Index = ", length_of_the_messages.str.len().idxmax())
Prints:
0 [Hello, how, are, you]
1 [I, am, doing, great]
2 [Lets, go, camping]
Name: Response, dtype: object
Max number of words = 4
Index = 0
Pandas sum of all word counts in column
You could use the vectorized string operations:
In [7]: df["a"].str.split().str.len().sum()
Out[7]: 6
which comes from
In [8]: df["a"].str.split()
Out[8]:
0 [some, words]
1 [lots, more, words]
2 [hi]
Name: a, dtype: object
In [9]: df["a"].str.split().str.len()
Out[9]:
0 2
1 3
2 1
Name: a, dtype: int64
In [10]: df["a"].str.split().str.len().sum()
Out[10]: 6
Return the list of each word in a pandas cell and the total count of that word in the entire column
Here is one way that gives the result you want, although avoids sklearn
entirely:
def counts(data, column):
full_list = []
datr = data[column].tolist()
total_words = " ".join(datr).split(' ')
# per rows
for i in range(len(datr)):
#first per row get the words
word_list = re.sub("[^\w]", " ", datr[i]).split()
#cycle per word
total_row = []
for word in word_list:
count = []
count = total_words.count(word)
val = (word, count)
total_row.append(val)
full_list.append(total_row)
return full_list
df['column2'] = counts(df,'column1')
df
column1 column2
0 apple is a fruit [(apple, 3), (is, 2), (a, 1), (fruit, 3)]
1 fruit sucks [(fruit, 3), (sucks, 1)]
2 apple tasty fruit [(apple, 3), (tasty, 1), (fruit, 3)]
3 fruits what else [(fruits, 1), (what, 1), (else, 1)]
4 yup apple map [(yup, 1), (apple, 3), (map, 1)]
5 fire in the hole [(fire, 1), (in, 1), (the, 1), (hole, 1)]
6 that is true [(that, 1), (is, 2), (true, 1)]
Count occurrences of each of certain words in pandas dataframe
Update: Original answer counts those rows which contain a substring.
To count all the occurrences of a substring you can use .str.count
:
In [21]: df = pd.DataFrame(['hello', 'world', 'hehe'], columns=['words'])
In [22]: df.words.str.count("he|wo")
Out[22]:
0 1
1 1
2 2
Name: words, dtype: int64
In [23]: df.words.str.count("he|wo").sum()
Out[23]: 4
The str.contains
method accepts a regular expression:
Definition: df.words.str.contains(self, pat, case=True, flags=0, na=nan)
Docstring:
Check whether given pattern is contained in each string in the array
Parameters
----------
pat : string
Character sequence or regular expression
case : boolean, default True
If True, case sensitive
flags : int, default 0 (no flags)
re module flags, e.g. re.IGNORECASE
na : default NaN, fill value for missing values.
For example:
In [11]: df = pd.DataFrame(['hello', 'world'], columns=['words'])
In [12]: df
Out[12]:
words
0 hello
1 world
In [13]: df.words.str.contains(r'[hw]')
Out[13]:
0 True
1 True
Name: words, dtype: bool
In [14]: df.words.str.contains(r'he|wo')
Out[14]:
0 True
1 True
Name: words, dtype: bool
To count the occurences you can just sum this boolean Series:
In [15]: df.words.str.contains(r'he|wo').sum()
Out[15]: 2
In [16]: df.words.str.contains(r'he').sum()
Out[16]: 1
Counting the Frequency of words in a pandas data frame
IIUIC, use value_counts()
In [3361]: df.Firm_Name.str.split(expand=True).stack().value_counts()
Out[3361]:
Society 3
Ltd 2
James's 1
R.X. 1
Yah 1
Associates 1
St 1
Kensington 1
MMV 1
Big 1
& 1
The 1
Co 1
Oil 1
Building 1
dtype: int64
Or,
pd.Series(np.concatenate([x.split() for x in df.Firm_Name])).value_counts()
Or,
pd.Series(' '.join(df.Firm_Name).split()).value_counts()
For top N, for example 3
In [3379]: pd.Series(' '.join(df.Firm_Name).split()).value_counts()[:3]
Out[3379]:
Society 3
Ltd 2
James's 1
dtype: int64
Details
In [3380]: df
Out[3380]:
URN Firm_Name
0 104472 R.X. Yah & Co
1 104873 Big Building Society
2 109986 St James's Society
3 114058 The Kensington Society Ltd
4 113438 MMV Oil Associates Ltd
Related Topics
Could Not Translate Host Name "Db" to Address Using Postgres, Docker Compose and Psycopg2
How to Check If Keras Is Using Gpu Version of Tensorflow
How to Overwrite Part of a Text File in Python
How to Change Python Version in Anaconda Spyder
How to Update a Pyspark Dataframe With New Values from Another Dataframe
Deleting Rows from CSV Based on Cell Contents from Another Csv
Plot Different Dataframes in the Same Figure
How to Split Text Without Spaces into List of Words
Typeerror: the Json Object Must Be Str, Not 'Bytes'
Python Opencv Load Image from Byte String
Pyspark Regexp_Replace With List Elements Are Not Replacing the String
Broadcast One Channel in Numpy Array into Three Channels
Check If Value from One Dataframe Exists in Another Dataframe
How to Divide Each Column of Pandas Dataframe by a Series
Key Error When Selecting Columns in Pandas Dataframe After Read_Csv