How to Find the Count of Multiple Words in a Text File

How do I find the count of multiple words in a text file?

Since you have a couple names, regular expressions is the way to go on this one. At first I thought it was as simple as just a grep count on the regular expression of joe or tom, but fount that this did not account for the scenario where tom and joe are on the same line (or tom and tom for that matter).

test.txt:

tom is really really cool!  joe for the win!
tom is actually lame.

$ grep -c '\<\(tom\|joe\)\>' test.txt
2

As you can see from the test.txt file, 2 is the wrong answer, so we needed to account for names being on the same line.

I then used grep -o to show only the part of a matching line that matches the pattern where it gave the correct pattern matches of tom or joe in the file. I then piped the results into number of lines into wc for the line count.

$ grep -o '\(joe\|tom\)' test.txt|wc -l
3

3...the correct answer! Hope this helps

Count of Specific words in Multiple text files

Edit: As per your new request, I have added the "total_words" column. The code has been updated.

Sample Image


Below is a code that works. Just change the "folderpath" variable to the path of the folder with the text files, and change the "target_file" variable to where you want the output csv file to be created.

Sample csv output:

Sample Image

Code:

from collections import Counter
import glob
import os
import re

header = ['annual', 'investment', 'statement', 'range' , 'deposit' , 'supercalifragilisticexpialidocious']
folderpath = r'C:\Users\USERname4\Desktop\myfolder'
target_file = r'C:\Users\USERname4\Desktop\mycsv.csv'

queueWAP = []
def writeAndPrint(fileObject,toBeWAP,opCode=0):
global queueWAP
if (opCode == 0):
fileObject.write(toBeWAP)
print(toBeWAP)
if (opCode == 1):
queueWAP.append(toBeWAP)
if (opCode == 2):
for temp4 in range(len(queueWAP)):
fileObject.write(queueWAP[temp4])
print(queueWAP[temp4])
queueWAP = []
mycsvfile = open(target_file, 'w')
writeAndPrint(mycsvfile,"file_name,total_words")
for temp1 in header:
writeAndPrint(mycsvfile,","+temp1)
writeAndPrint(mycsvfile,"\n")
filepaths = glob.glob(folderpath + r"\*.txt")
for file in filepaths:
with open(file) as f:
writeAndPrint(mycsvfile,file.split("\\")[-1])
counter = Counter()
words = re.findall(r'\w+', f.read().lower())
counter = counter + Counter(words)
for temp2 in header:
temp3 = False
temp5 = 0
for myword in counter.items():
temp5 = temp5 + 1
if myword[0] == temp2:
writeAndPrint(mycsvfile,","+str(myword[1]),1)
temp3 = True
if temp3 == False:
writeAndPrint(mycsvfile,","+"0",1)
writeAndPrint(mycsvfile,","+str(temp5))
writeAndPrint(mycsvfile,"",2)
writeAndPrint(mycsvfile,"\n")
mycsvfile.close()

How to look for multiple strings in a text and count the number of strings found?

You can apply dict comprehension to generate dictionary with required data:

text = "some random text apple, some text ginger, some other blob data"
words = "some", "text", "blob"
result = {word: text.count(word) for word in words}

Output:

{'some': 3, 'text': 2, 'blob': 1}

Upd.

To solve problem with recognizing words I recommend to use regular expression:

import re
...
result = {word: re.subn(r"\b{}\b".format(word), "", text)[1] for word in words}

Count words in multiple files and show the count and in how many files it appeared

This is untested, but this shows the theory:

from collections import Counter

files = Counter()
words = Counter()

for fn in list_of_files:
thisfile = set()
for word in open(fn).read().split():
words[word] += 1
thisfile.add( word )
for word in thisfile:
files[word] += 1

writer = open("Output.txt", "w", encoding="utf8")
for word in files.keys():
print( f"{word};{words[word]};{files[word]}", file=writer)

Counting a list of words in multiple textfiles in a directory Python

I would suggest using the re module, the Counter object from the collections module, and the Path object from the pathlib module.

import re
from collections import Counter
from pathlib import Path

counter = Counter() #Create a counter object for keeping track of wordcounts.
for word in your_list_of_words: #iterate through your list of words and for each word...
for file in Path("your_directory").glob("*.txt"): #Iterate through all .txt files in "your_directory"
with open(file, 'r') as stream: #open the file
counter.update(re.findall(word, stream.read())) #Update your counter object with the count of all the instances of 'word' found in 'file'.

This will give you a total count of each word across all files. If you want a count of each word for each file you may want to use a dictionary and update it each time. e.g.

counter = {}
for word in your_list_of_words:
counter[word] = {}
for file in Path("your_directory").glob("*.txt"):
with open(file, 'r') as stream:
counter[word][file] = len(re.findall(word, stream.read()))

Worth noting this will find all instances of that word, even if it's in the middle of another word e.g.

re.findall('cat', "catastrophic catcalling cattery cats")
returns
['cat', 'cat', 'cat', 'cat']

so you may want to play with the regex, e.g.

word = 'cat'
re.findall(fr"\b{word}\b", "catastrophic catcalling cattery cats"))
returns []

which may be more what you're looking for.

How to count all the words in a textfile with multiple space characters

Basically, as Tom outlines in his answer, you need a state machine with the two states In_A_Word and Not_In_A_Word and then count whenever your state changes from Not_In_A_Word to In_A_Word.

Something along the lines of (pseudo-code):

var
InWord: Boolean;
Ch: Char;
begin
InWord := False;
while not eof(file) do begin
read(file, Ch);
if Ch in ['A'..'Z', 'a'..'z'] then begin
if not InWord then begin
InWord := True;
Words := Words + 1;
end;
end else
InWord := False
end;
end;


Related Topics



Leave a reply



Submit