Count unique words in a text file (Python)
There seems to be an error in the code snippet, since k
is not declared. I am assuming you were trying to count
the number of unique words instead.
Also, there are better ways to find unique values in a list by converting it into a set. Values in a set will not contain duplicated values.
Check out the code snippet below.
words = []
count = 0
with open ("text.txt","r") as f:
# Get a list of lines in the file and covert it into a set
words = set(f.readlines())
count = len(words)
print(count)
How to count unique words from a text file after a specific string in every line?
Your original text file is very sad, as it seems to contain representations of python dicts written in text format, one per line!
This is a very bad way of generating a text data file. You should change the code that generates this file, to generate another format like csv or json file instead of naively writing string representations to a text file. If you use csv or json, then you have libraries already written and tested to help you parse the contents and extract each element easily.
If you still want that, you can use ast.literal_eval to actually run the code on each line:
import ast
import collections
with open(filename) as infile:
print(collections.Counter(ast.literal_eval(line)['Character'] for line in infile))
EDIT: Now that you added an example of the file generation, I can suggest you use another format, like json:
def stimcount():
results = []
for rel_node in root.findall("emospan:CharacterRelation",ns):
if rel_node.attrib['Relation']=="Stimulus":
source = rel_node.attrib['Governor']
target = rel_node.attrib['Dependent']
for span_node in root.findall("emospan:CharacterEmotion",ns):
if span_node.attrib[my_id]==source:
print(span_node.attrib['Emotion'])
if span_node.attrib[my_id]==target:
print(span_node.attrib)
results.append(span_node.attrib)
with open('results.txt', 'w') as f:
json.dump(results, f)
Then your code that reads the data could be as simple as:
with open('results.txt') as f:
results = json.load(f)
r = collections.Counter(d['Character'] for d in results)
for n, (ch, number) in enumerate(r.items()):
print('{} - {}, {}'.format(n, ch, number))
Another option is to use csv format. It allows you to specify a list of interesting columns and ignore the rest:
def stimcount():
with open('results.txt', 'w') as f:
cf = csv.DictWriter(f, ['begin', 'end', 'Character'], extrasaction='ignore')
cf.writeheader()
for rel_node in root.findall("emospan:CharacterRelation",ns):
if rel_node.attrib['Relation']=="Stimulus":
source = rel_node.attrib['Governor']
target = rel_node.attrib['Dependent']
for span_node in root.findall("emospan:CharacterEmotion",ns):
if span_node.attrib[my_id]==source:
print(span_node.attrib['Emotion'])
if span_node.attrib[my_id]==target:
print(span_node.attrib)
cf.writerow(span_node.attrib)
Then to read it easily:
with open('results.txt') as f:
cf = csv.DictReader(f)
r = collections.Counter(d['Character'] for d in cf)
for n, (ch, number) in enumerate(r.items()):
print('{} - {}, {}'.format(n, ch, number))
Pyspark operations on text, counting words, unique words, most common words
I've downloaded the book from the Gutenberg Project: Moby Dick; Or, The Whale by Herman Melville in Plain Text UTF-8.
Delete the obvious additional text from top and bottom and save it to a file:
mobydick
.
There's a function spark.read.text
which reads a text file and creates a new row for each line. The idea is to split rows, explode them and group them by words, after that just perform needed calculations.
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
spark = SparkSession.builder.getOrCreate()
df = spark.read.text("mobydick")
df = df.filter(F.col("value") != "") # Remove empty rows
word_counts = (
df.withColumn("word", F.explode(F.split(F.col("value"), "\s+")))
.withColumn("word", F.regexp_replace("word", "[^\w]", ""))
.groupBy("word")
.count()
.sort("count", ascending=False)
)
# Top 10
word_counts.show(10)
# All words count
word_counts.agg(F.sum("count").alias("count_all_words")).show()
# Whale count
word_counts.filter(F.col("word").rlike("(?i)whale")).agg(
F.sum("count").alias("whale_count")
).show()
# Unique count
print("Unique words: ", word_counts.count())
Result:
+----+-----+
|word|count|
+----+-----+
|the |13701|
|of |6551 |
|and |5992 |
|to |4513 |
|a |4491 |
|in |3905 |
|that|2865 |
|his |2462 |
|it |2089 |
|I |1942 |
+----+-----+
+---------------+
|count_all_words|
+---------------+
|212469 |
+---------------+
+-----------+
|whale_count|
+-----------+
|1687 |
+-----------+
Unique words: 21837
With more cleaning you can get the exact results. I guess the unique words are a bit off since they require more cleaning and maybe stemming.
Counting distinct words in a text file: different results in Shell and Python
Okay, so, the problem in the shellscript (assuming you want the shellscript to behave like python does) is in the very first command you're supplying.
consider the input
apple cherry bone0 cherry
the python function will, at the step that strips out words containing non-alphas, turn that into
apple cherry cherry
while your shellscript will simply do
apple cherry bone cherry
This is because of the first line of the shellscript, which simply knocks out numbers (from my quick test of it in isolation). Instead, you want the first line to be something like grep -wo -E [a-zA-Z]+
, which will reject words that don't match that specific regex. (aka any words that contain anything other than letters)
also, credit where it's due, I got the patch from here
So, the fixed shellscript is (in nice function form)
function count_vocab() {
grep -wo -E '[a-zA-Z]+' |
tr ' [:upper:]' '\n[:lower:]' |
tr -s '\n' |
sed "s/^['-]*//;s/['-]$//" |
sort |
uniq -c |
wc -l
}
invoked like (after you have defined the function)
count_vocab < INPUT_TEXT_FILE > COUNT_FILE
How to count unique words in python with function?
The answer from Kohelet neglects characters such as ,
and "
, which in OP's case would find people
and people,
to be two unique words. To make sure you only get actual words you need to take care of the unwanted characters. To remove the ,
and "
, you could add the following:
text ='aa, aa bb cc'
def unique_words(text):
words = text.replace('"','').replace(',', '').split()
unique = list(set(words))
return len(unique)
unique_words(text)
# out
3
There are numerous ways to add text to be replaced
JSON File: Count Unique Words Instead of Single Letters with Python
If I understood correctly, you should loop through your data
, taking each object (I called it row
), taking its data element Text Main
and do the rest of your processing.
# your importing code, etc...
# processing:
for row in data:
line = row['Text Main']
# Remove the leading spaces and newline character
line = line.strip()
# Convert the characters in line to
# lowercase to avoid case mismatch
line = line.lower()
# Remove the punctuation marks from the line
line = line.translate(line.maketrans("", "", string.punctuation))
# Split the line into words
words = line.split(" ")
# Iterate over each word in line
for word in words:
# Check if the word is already in dictionary
if word in d:
# Increment count of word by 1
d[word] = d[word] + 1
else:
# Add the word to dictionary with count 1
d[word] = 1
how do I count unique words of text files in specific directory with Python?
textfile=open('somefile.txt','r')
text_list=[line.split(' ') for line in textfile]
unique_words=[word for word in text_list if word not in unique_words]
print(len(unique_words))
That's the general gist of it
Related Topics
Importing Large Tab-Delimited .Txt File into Python
Format/Suppress Scientific Notation from Pandas Aggregation Results
Sqlalchemy, Prevent Duplicate Rows
How to Delete a Character in an Item in a List (Python)
Get Current Url from Browser Using Python
How to Serialize Sqlalchemy Result to Json
How to Get the Coordinates of the Bounding Box in Yolo Object Detection
Convert the String 2.90K to 2900 or 5.2M to 5200000 in Pandas Dataframe
How to Make a Grade Calculator in Python
Deleting Dataframe Row in Pandas If a Combination of Column Values Equals a Tuple in a List
Matplotlib Bar Chart: Space Out Bars
How to Further Filter a Result of Resultset
Get Rid of Columns With Null Value in Json Output
Using Pyserial to Send Binary Data
Convert CSV File to Pipe Delimited File in Python
Pandas To_Csv() Slow Saving Large Dataframe
Arrange a Text File Using Python
Using Regex to Find All Phrases That Are Completely Capitalized