Word Count from a Txt File Program

Python word count program from txt file

Easy, you just need to find the 5 most common words in the file.

So you could do something like this:

wordcount = sorted(wordcount.items(), key=lambda x: x[1], reverse=True)

And then, this dictionary will be sorted by values(remember that sorted return a list).

You can use the following code to get the 5 most common words:

for k, v in wordcount[:5]):
print (k, v)

So the full code looks like:

wordcount = {}

with open('alice.txt') as file: # with can auto close the file
for word in file.read().split():
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1

wordcount = sorted(wordcount.items(), key=lambda x: x[1], reverse=True)

for k, v in wordcount[:5]:
print(k, v)

Also, here is a more simple way to do this use use collections.Counter:

from collections import Counter
with open('alice.txt') as file: # with can auto close the file
wordcount = Counter(file.read().split())

for k, v in wordcount.most_common(5):
print(k, v)

The output is same as the first solution.

PHP to count words from a txt file

No need to reinvent the wheel. PHP has a built in function for counting words in a string: str_word_count().

Using it in combination with file_get_contents() to get the file contents, you can make the code way smaller.

This should do what you want:

$wordCount = str_word_count(file_get_contents('trial.txt'));

C - How to count words in a txt file?

Use a simple FSM coded in C:

#include <stdio.h>
#include <ctype.h>

enum {INITIAL,WORD,SPACE};

int main()
{
int c;
int state = INITIAL;
int wcount = 0;

c = getchar();
while (c != EOF)
{
switch (state)
{
case INITIAL: wcount = 0;
if (isalpha(c) || c=='\'')
{
wcount++;
state = WORD;
}
else
state = SPACE;
break;

case WORD: if (!isalpha(c) && c!='\'')
state = SPACE;
break;

case SPACE: if (isalpha(c) || c=='\'')
{
wcount++;
state = WORD;
}
}
c = getchar();
}
printf ("%d words\n", wcount);
return 0;
}

Python: Counting words from a directory of txt files and writing word counts to a separate txt file

I would strongly urge you to not repurpose stdout for writing data to a file as part of the normal course of your program. I also wonder how you can ever have a word "count < 0". I assume you meant "count == 0".

The main problem that your code has is in this line:

for filepath in glob.iglob(os.path.join("path", '*.txt')):

The string constant "path" I'm pretty sure doesn't belong there. I think you want filepath there instead. I would think that this problem would prevent your code from working at all.

Here's a version of your code where I fixed these issues and added the logic to write to two different output files based on the count:

import sys
import os
import glob

out1 = open("/tmp/so/seen.txt", "w")
out2 = open("/tmp/so/missing.txt", "w")

def count_words_in_dir(dirpath, words, action=None):
for filepath in glob.iglob(os.path.join(dirpath, '*.txt')):
with open(filepath) as f:
data = f.read()
for key, val in words.items():
# print("key is " + key + "\n")
ct = data.count(key)
words[key] = ct
if action:
action(filepath, words)

def print_summary(filepath, words):
for key, val in sorted(words.items()):
whichout = out1 if val > 0 else out2
print(filepath, file=whichout)
print('{0}: {1}'.format(key, val), file=whichout)

filepath = sys.argv[1]
keys = ["country", "friend", "turnip"]
words = dict.fromkeys(keys, 0)

count_words_in_dir(filepath, words, action=print_summary)

out1.close()
out2.close()

Result:

file seen.txt:

/Users/steve/tmp/so/dir/data2.txt
friend: 1
/Users/steve/tmp/so/dir/data.txt
country: 2
/Users/steve/tmp/so/dir/data.txt
friend: 1

file missing.txt:

/Users/steve/tmp/so/dir/data2.txt
country: 0
/Users/steve/tmp/so/dir/data2.txt
turnip: 0
/Users/steve/tmp/so/dir/data.txt
turnip: 0

(excuse me for using some search words that were a bit more interesting than yours)

Python project on TXT file, how to read-count words-lines and sorting

To be honest, there are multiple problems with your code.
You are calling the builtin open three times. This means your code reads the whole file three times when one time should be enough. And whenever you are doing file.read() you are trying to read the whole file into memory. While this works fine for small files, a file that is too large to fit into memory will result in a MemoryError.

Your functions do way to much. They

  • Open a file.
  • They parse the file's content.
  • They print the calculated statistics.

As a general advice, functions and objects should follow the Single-responsibility principle.

Currently your code does not work at all because in your function most_appear_words the brackets for the call to the print function are missing. Also, you should never import any item with a name starting with an underscore like collections._OrderedDictValuesView. The underscore indicates that this view is for internal use only. You probably want to import collections.Counter here.

You do not provide a minimal reproducible example. So it is not clear how you are actually calling the functions in your code sample.

However, it looks like word_frequency is missing a return statement. In order to make your code work as it is, you would have to do something like

def word_frequency(path):
dictionary = {}

# <insert your code here that updates dictionary>
return dictionary

def most_appear_words(dictionary):
new_d = collections.Counter()
# <insert your code here that updates and prints new_d>

if __name__ == '__main__':
# <insert your code here>

# feed the return of word_frequency to most_appear_words:
d = word_frequency(your_path)
most_appear_words(d)

I hope this will help you getting your code to work.


Please note, however, that I suggest a different approach:
Have one function responsible for opening and processing the file (word_iterator).
Have one function responsible for doing the statistics, i.e. counting words and letters (word_count).
Have one function to print the results to the console (print_statistics).

My suggested solution to the task would be:

from collections import Counter
import string

def word_iterator(fp):
t = str.maketrans('', '', string.punctuation + string.digits)

word_no = 0
with open(fp) as in_file:
for line_no, line in enumerate(in_file, start=1):
line = line.translate(t)
words = line.split()
for w in words:
word_no += 1
yield line_no, word_no, w.lower()

def word_count(word_iter):
words = Counter()
line_no = 0
word_no = 0
n_chars = 0

for line_no, word_no, word in word_iter:
n_chars += len(word)
words.update([word])

result = {
'n_lines': line_no,
'n_words': word_no,
'n_chars': n_chars,
'words': words
}

return result

def print_statistics(wc, top_n1=3, top_n2=None):
print(' Word Count '.center(20, '='))
print(f'File {fn} consists of')
print(f' {wc["n_lines"]:5} lines')
print(f' {wc["n_words"]:5} words')
print(f' {wc["n_chars"]:5} characters')

print()
print(' Word Frequency '.center(20, '='))

print(f'The {top_n1} most frequent words are:')
for word, count in wc['words'].most_common(top_n1):
print(f' {word} ({count} times)')

if top_n2:
print()
print(f'The {top_n2} most frequent words are:')
top_words = [w for w, _ in wc['words'].most_common(top_n2)]
print(', '.join(top_words))

if __name__ == '__main__':
fn = 'text_file.txt'

stat = word_count(word_iterator(fn))

print_statistics(stat, top_n1=3, top_n2=1000)

With the sample output

==== Word Count ====
File text_file.txt consists of
7 lines
104 words
492 characters

== Word Frequency ==
The 3 most frequent words are:
a (5 times)
the (4 times)
it (3 times)

The 1000 most frequent words are:
a, the, it, content, of, lorem, ipsum, and, is, that, will, by, readable, page, using, as, here, like, many, web, their, sometimes, long, established, fact, reader, be, distracted, when, looking, at, its, layout, point, has, moreorless, normal, distribution, letters, opposed, to, making, look, english, desktop, publishing, packages, editors, now, use, default, model, text, search, for, uncover, sites, still, in, infancy, various, versions, have, evolved, over, years, accident, on, purpose, injected, humour

Counting the number of words of a given text file

fgets only reads up until the first newline is met or the buffer is filled.

If you want to read all the lines in your file, utilize the fact that fgets returns NULL when it can not read anything more as BLUEPIXY points out in his comment:

while (fgets(str, 1000000, txtFile))
{
len = strlen(str);
ignoreSpace = 1;
for (i = 0; i < len; i++)
{
if (str[i] == ' ')
{
if (!ignoreSpace)
{
count++;
ignoreSpace = 1;
}
}
else
{
ignoreSpace = 0;
}
}
if (!ignoreSpace)
count++;
}


Related Topics



Leave a reply



Submit