Python word count program from txt file
Easy, you just need to find the 5 most common words in the file.
So you could do something like this:
wordcount = sorted(wordcount.items(), key=lambda x: x[1], reverse=True)
And then, this dictionary will be sorted by values(remember that sorted
return a list).
You can use the following code to get the 5 most common words:
for k, v in wordcount[:5]):
print (k, v)
So the full code looks like:
wordcount = {}
with open('alice.txt') as file: # with can auto close the file
for word in file.read().split():
if word not in wordcount:
wordcount[word] = 1
else:
wordcount[word] += 1
wordcount = sorted(wordcount.items(), key=lambda x: x[1], reverse=True)
for k, v in wordcount[:5]:
print(k, v)
Also, here is a more simple way to do this use use collections.Counter
:
from collections import Counter
with open('alice.txt') as file: # with can auto close the file
wordcount = Counter(file.read().split())
for k, v in wordcount.most_common(5):
print(k, v)
The output is same as the first solution.
PHP to count words from a txt file
No need to reinvent the wheel. PHP has a built in function for counting words in a string: str_word_count().
Using it in combination with file_get_contents() to get the file contents, you can make the code way smaller.
This should do what you want:
$wordCount = str_word_count(file_get_contents('trial.txt'));
C - How to count words in a txt file?
Use a simple FSM coded in C:
#include <stdio.h>
#include <ctype.h>
enum {INITIAL,WORD,SPACE};
int main()
{
int c;
int state = INITIAL;
int wcount = 0;
c = getchar();
while (c != EOF)
{
switch (state)
{
case INITIAL: wcount = 0;
if (isalpha(c) || c=='\'')
{
wcount++;
state = WORD;
}
else
state = SPACE;
break;
case WORD: if (!isalpha(c) && c!='\'')
state = SPACE;
break;
case SPACE: if (isalpha(c) || c=='\'')
{
wcount++;
state = WORD;
}
}
c = getchar();
}
printf ("%d words\n", wcount);
return 0;
}
Python: Counting words from a directory of txt files and writing word counts to a separate txt file
I would strongly urge you to not repurpose stdout
for writing data to a file as part of the normal course of your program. I also wonder how you can ever have a word "count < 0". I assume you meant "count == 0".
The main problem that your code has is in this line:
for filepath in glob.iglob(os.path.join("path", '*.txt')):
The string constant "path"
I'm pretty sure doesn't belong there. I think you want filepath
there instead. I would think that this problem would prevent your code from working at all.
Here's a version of your code where I fixed these issues and added the logic to write to two different output files based on the count:
import sys
import os
import glob
out1 = open("/tmp/so/seen.txt", "w")
out2 = open("/tmp/so/missing.txt", "w")
def count_words_in_dir(dirpath, words, action=None):
for filepath in glob.iglob(os.path.join(dirpath, '*.txt')):
with open(filepath) as f:
data = f.read()
for key, val in words.items():
# print("key is " + key + "\n")
ct = data.count(key)
words[key] = ct
if action:
action(filepath, words)
def print_summary(filepath, words):
for key, val in sorted(words.items()):
whichout = out1 if val > 0 else out2
print(filepath, file=whichout)
print('{0}: {1}'.format(key, val), file=whichout)
filepath = sys.argv[1]
keys = ["country", "friend", "turnip"]
words = dict.fromkeys(keys, 0)
count_words_in_dir(filepath, words, action=print_summary)
out1.close()
out2.close()
Result:
file seen.txt:
/Users/steve/tmp/so/dir/data2.txt
friend: 1
/Users/steve/tmp/so/dir/data.txt
country: 2
/Users/steve/tmp/so/dir/data.txt
friend: 1
file missing.txt:
/Users/steve/tmp/so/dir/data2.txt
country: 0
/Users/steve/tmp/so/dir/data2.txt
turnip: 0
/Users/steve/tmp/so/dir/data.txt
turnip: 0
(excuse me for using some search words that were a bit more interesting than yours)
Python project on TXT file, how to read-count words-lines and sorting
To be honest, there are multiple problems with your code.
You are calling the builtin open
three times. This means your code reads the whole file three times when one time should be enough. And whenever you are doing file.read()
you are trying to read the whole file into memory. While this works fine for small files, a file that is too large to fit into memory will result in a MemoryError
.
Your functions do way to much. They
- Open a file.
- They parse the file's content.
- They print the calculated statistics.
As a general advice, functions and objects should follow the Single-responsibility principle.
Currently your code does not work at all because in your function most_appear_words
the brackets for the call to the print
function are missing. Also, you should never import any item with a name starting with an underscore like collections._OrderedDictValuesView
. The underscore indicates that this view is for internal use only. You probably want to import collections.Counter
here.
You do not provide a minimal reproducible example. So it is not clear how you are actually calling the functions in your code sample.
However, it looks like word_frequency
is missing a return
statement. In order to make your code work as it is, you would have to do something like
def word_frequency(path):
dictionary = {}
# <insert your code here that updates dictionary>
return dictionary
def most_appear_words(dictionary):
new_d = collections.Counter()
# <insert your code here that updates and prints new_d>
if __name__ == '__main__':
# <insert your code here>
# feed the return of word_frequency to most_appear_words:
d = word_frequency(your_path)
most_appear_words(d)
I hope this will help you getting your code to work.
Please note, however, that I suggest a different approach:
Have one function responsible for opening and processing the file (word_iterator
).
Have one function responsible for doing the statistics, i.e. counting words and letters (word_count
).
Have one function to print the results to the console (print_statistics
).
My suggested solution to the task would be:
from collections import Counter
import string
def word_iterator(fp):
t = str.maketrans('', '', string.punctuation + string.digits)
word_no = 0
with open(fp) as in_file:
for line_no, line in enumerate(in_file, start=1):
line = line.translate(t)
words = line.split()
for w in words:
word_no += 1
yield line_no, word_no, w.lower()
def word_count(word_iter):
words = Counter()
line_no = 0
word_no = 0
n_chars = 0
for line_no, word_no, word in word_iter:
n_chars += len(word)
words.update([word])
result = {
'n_lines': line_no,
'n_words': word_no,
'n_chars': n_chars,
'words': words
}
return result
def print_statistics(wc, top_n1=3, top_n2=None):
print(' Word Count '.center(20, '='))
print(f'File {fn} consists of')
print(f' {wc["n_lines"]:5} lines')
print(f' {wc["n_words"]:5} words')
print(f' {wc["n_chars"]:5} characters')
print()
print(' Word Frequency '.center(20, '='))
print(f'The {top_n1} most frequent words are:')
for word, count in wc['words'].most_common(top_n1):
print(f' {word} ({count} times)')
if top_n2:
print()
print(f'The {top_n2} most frequent words are:')
top_words = [w for w, _ in wc['words'].most_common(top_n2)]
print(', '.join(top_words))
if __name__ == '__main__':
fn = 'text_file.txt'
stat = word_count(word_iterator(fn))
print_statistics(stat, top_n1=3, top_n2=1000)
With the sample output
==== Word Count ====
File text_file.txt consists of
7 lines
104 words
492 characters
== Word Frequency ==
The 3 most frequent words are:
a (5 times)
the (4 times)
it (3 times)
The 1000 most frequent words are:
a, the, it, content, of, lorem, ipsum, and, is, that, will, by, readable, page, using, as, here, like, many, web, their, sometimes, long, established, fact, reader, be, distracted, when, looking, at, its, layout, point, has, moreorless, normal, distribution, letters, opposed, to, making, look, english, desktop, publishing, packages, editors, now, use, default, model, text, search, for, uncover, sites, still, in, infancy, various, versions, have, evolved, over, years, accident, on, purpose, injected, humour
Counting the number of words of a given text file
fgets
only reads up until the first newline is met or the buffer is filled.
If you want to read all the lines in your file, utilize the fact that fgets returns NULL when it can not read anything more as BLUEPIXY points out in his comment:
while (fgets(str, 1000000, txtFile))
{
len = strlen(str);
ignoreSpace = 1;
for (i = 0; i < len; i++)
{
if (str[i] == ' ')
{
if (!ignoreSpace)
{
count++;
ignoreSpace = 1;
}
}
else
{
ignoreSpace = 0;
}
}
if (!ignoreSpace)
count++;
}
Related Topics
Why Does This Not Work as an Array Membership Test
Exif Manipulation Library for Python
Pandas Dataframe with Multiindex Column - Merge Levels
How to Convert a Timezone Aware String to Datetime in Python Without Dateutil
Getting Only Element from a Single-Element List in Python
What Is a '"Python"' Layer in Caffe
Python 3.5 - "Geckodriver Executable Needs to Be in Path"
How to Remove Gaps Between Subplots in Matplotlib
Get a List of All the Encodings Python Can Encode To
How to Display Tooltips in Tkinter
Differencebetween .Quit and .Quit in Pygame
Writing to Existing Workbook Using Xlwt
Pandas Create New Column with Count from Groupby
Word Count from a Txt File Program
How to Get the Correct Dimensions for a Pygame Rectangle Created from an Image