Count Unique Words in a Text File (Python)

Count unique words in a text file (Python)

There seems to be an error in the code snippet, since k is not declared. I am assuming you were trying to count the number of unique words instead.

Also, there are better ways to find unique values in a list by converting it into a set. Values in a set will not contain duplicated values.

Check out the code snippet below.

words = []
count = 0

with open ("text.txt","r") as f:
# Get a list of lines in the file and covert it into a set
words = set(f.readlines())
count = len(words)

print(count)

How to count unique words from a text file after a specific string in every line?

Your original text file is very sad, as it seems to contain representations of python dicts written in text format, one per line!

This is a very bad way of generating a text data file. You should change the code that generates this file, to generate another format like csv or json file instead of naively writing string representations to a text file. If you use csv or json, then you have libraries already written and tested to help you parse the contents and extract each element easily.

If you still want that, you can use ast.literal_eval to actually run the code on each line:

import ast
import collections
with open(filename) as infile:
print(collections.Counter(ast.literal_eval(line)['Character'] for line in infile))

EDIT: Now that you added an example of the file generation, I can suggest you use another format, like json:

def stimcount():
results = []
for rel_node in root.findall("emospan:CharacterRelation",ns):
if rel_node.attrib['Relation']=="Stimulus":
source = rel_node.attrib['Governor']
target = rel_node.attrib['Dependent']
for span_node in root.findall("emospan:CharacterEmotion",ns):
if span_node.attrib[my_id]==source:

print(span_node.attrib['Emotion'])

if span_node.attrib[my_id]==target:
print(span_node.attrib)
results.append(span_node.attrib)

with open('results.txt', 'w') as f:
json.dump(results, f)

Then your code that reads the data could be as simple as:

with open('results.txt') as f:
results = json.load(f)
r = collections.Counter(d['Character'] for d in results)
for n, (ch, number) in enumerate(r.items()):
print('{} - {}, {}'.format(n, ch, number))

Another option is to use csv format. It allows you to specify a list of interesting columns and ignore the rest:

def stimcount():
with open('results.txt', 'w') as f:
cf = csv.DictWriter(f, ['begin', 'end', 'Character'], extrasaction='ignore')
cf.writeheader()
for rel_node in root.findall("emospan:CharacterRelation",ns):
if rel_node.attrib['Relation']=="Stimulus":
source = rel_node.attrib['Governor']
target = rel_node.attrib['Dependent']
for span_node in root.findall("emospan:CharacterEmotion",ns):
if span_node.attrib[my_id]==source:

print(span_node.attrib['Emotion'])

if span_node.attrib[my_id]==target:
print(span_node.attrib)
cf.writerow(span_node.attrib)

Then to read it easily:

with open('results.txt') as f:
cf = csv.DictReader(f)
r = collections.Counter(d['Character'] for d in cf)
for n, (ch, number) in enumerate(r.items()):
print('{} - {}, {}'.format(n, ch, number))

Pyspark operations on text, counting words, unique words, most common words

I've downloaded the book from the Gutenberg Project: Moby Dick; Or, The Whale by Herman Melville in Plain Text UTF-8.

Delete the obvious additional text from top and bottom and save it to a file: mobydick.

There's a function spark.read.text which reads a text file and creates a new row for each line. The idea is to split rows, explode them and group them by words, after that just perform needed calculations.

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

spark = SparkSession.builder.getOrCreate()
df = spark.read.text("mobydick")
df = df.filter(F.col("value") != "") # Remove empty rows

word_counts = (
df.withColumn("word", F.explode(F.split(F.col("value"), "\s+")))
.withColumn("word", F.regexp_replace("word", "[^\w]", ""))
.groupBy("word")
.count()
.sort("count", ascending=False)
)

# Top 10
word_counts.show(10)

# All words count
word_counts.agg(F.sum("count").alias("count_all_words")).show()

# Whale count
word_counts.filter(F.col("word").rlike("(?i)whale")).agg(
F.sum("count").alias("whale_count")
).show()

# Unique count
print("Unique words: ", word_counts.count())

Result:

+----+-----+                                                                    
|word|count|
+----+-----+
|the |13701|
|of |6551 |
|and |5992 |
|to |4513 |
|a |4491 |
|in |3905 |
|that|2865 |
|his |2462 |
|it |2089 |
|I |1942 |
+----+-----+

+---------------+
|count_all_words|
+---------------+
|212469 |
+---------------+

+-----------+
|whale_count|
+-----------+
|1687 |
+-----------+

Unique words: 21837

With more cleaning you can get the exact results. I guess the unique words are a bit off since they require more cleaning and maybe stemming.

Counting distinct words in a text file: different results in Shell and Python

Okay, so, the problem in the shellscript (assuming you want the shellscript to behave like python does) is in the very first command you're supplying.

consider the input

apple cherry bone0 cherry

the python function will, at the step that strips out words containing non-alphas, turn that into

apple cherry cherry

while your shellscript will simply do

apple cherry bone cherry

This is because of the first line of the shellscript, which simply knocks out numbers (from my quick test of it in isolation). Instead, you want the first line to be something like grep -wo -E [a-zA-Z]+, which will reject words that don't match that specific regex. (aka any words that contain anything other than letters)

also, credit where it's due, I got the patch from here

So, the fixed shellscript is (in nice function form)

function count_vocab() {
grep -wo -E '[a-zA-Z]+' |
tr ' [:upper:]' '\n[:lower:]' |
tr -s '\n' |
sed "s/^['-]*//;s/['-]$//" |
sort |
uniq -c |
wc -l
}

invoked like (after you have defined the function)

count_vocab < INPUT_TEXT_FILE > COUNT_FILE

How to count unique words in python with function?

The answer from Kohelet neglects characters such as , and ", which in OP's case would find people and people, to be two unique words. To make sure you only get actual words you need to take care of the unwanted characters. To remove the , and ", you could add the following:

text ='aa, aa bb cc'

def unique_words(text):
words = text.replace('"','').replace(',', '').split()
unique = list(set(words))
return len(unique)

unique_words(text)

# out
3

There are numerous ways to add text to be replaced

JSON File: Count Unique Words Instead of Single Letters with Python

If I understood correctly, you should loop through your data, taking each object (I called it row), taking its data element Text Main and do the rest of your processing.

# your importing code, etc...

# processing:
for row in data:
line = row['Text Main']
# Remove the leading spaces and newline character
line = line.strip()

# Convert the characters in line to
# lowercase to avoid case mismatch
line = line.lower()

# Remove the punctuation marks from the line
line = line.translate(line.maketrans("", "", string.punctuation))

# Split the line into words
words = line.split(" ")

# Iterate over each word in line
for word in words:
# Check if the word is already in dictionary
if word in d:
# Increment count of word by 1
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1

how do I count unique words of text files in specific directory with Python?

textfile=open('somefile.txt','r')
text_list=[line.split(' ') for line in textfile]
unique_words=[word for word in text_list if word not in unique_words]
print(len(unique_words))

That's the general gist of it



Related Topics



Leave a reply



Submit