Get only unique words from a sentence in Python
seq = "mango mango peach".split()
[x for x in seq if x not in seq[seq.index(x)+1:]]
How to print unique words from an inputted string
String.split(' ')
takes a string and creates a list of elements divided by a space (' '
).
set(foo)
takes a collection foo and returns a set
collection of only the distinct elements in foo
.
What you want is this: unique_words = set(str1.split(' '))
The default value for the split separator is whitespace. I wanted to show that you can supply your own value to this method.
How to get unique words from a list quickly?
You need to do it all lazily and with as few intermediate lists and as possible (reducing allocations and processing time).
All unique words from a file:
import itertools
def unique_words_from_file(fpath):
with open(fpath, "r") as f:
return set(itertools.chain.from_iterable(map(str.split, f)))
Let's explain the ideas here.
File objects are iterable objects, which means that you can iterate over the lines of a file!
Then we want the words from each line, which is splitting them. In this case, we use map
in Python3
(or itertools.imap
in Python2
) to create an object with that computation over our file lines. map
and imap
are also lazy, which means that no intermediate list is allocated by default and that is awesome because we will not be spending any resources on something we don't need!
Since str.split
returns a list, our map
result would be a succession of lists of strings, but we need to iterate over each of those strings. For doing that there is no need of building another list
, we can use itertools.chain
to flatten that result!
Finally, we call to set, which will iterate over those words and kept just a single one for each of them. Voila!
Let's make an improvement! can we make str.split
also lazy?
Yes! check this SO answer:
import itertools
import re
def split_iter(string):
return (x.group(0) for x in re.finditer(r"[A-Za-z']+", string))
def unique_words_from_file(fpath):
with open(fpath, "r") as f:
return set(itertools.chain.from_iterable(map(split_iter, f)))
Print a list of unique words from a text file after removing punctuation, and find longest word
Here's a solution that uses str.translate()
to throw away all bad characters (+ newline) before we ever do the split()
. (Normally we'd use a regex with re.sub()
, but you're not allowed.) This makes the cleaning a one-liner, which is really neat:
bad = "[],.\n"
bad_transtable = str.maketrans(bad, ' ' * len(bad))
# We can directly read and clean the entire output, without a reader object:
cleaned_input = open('doc.txt').read().translate(bad_transtable)
#with open("doc.txt") as reader:
# cleaned_input = reader.read().translate(bad_transtable)
# Get list of unique words, in decreasing length
unique_words = sorted(set(cleaned_input.split()), key=lambda w: -len(w))
with open("unique.txt", "w") as writer:
for word in unique_words:
writer.write(f'{word}\n')
max_length = len(unique_words[0])
print ([word for word in unique_words if len(word) == max_length])
Notes:
- since the input is already 100% cleaned and split, no need to append to a list/insert to a set as we go, then have to make another cleaning pass later. We can just create
unique_words
directly! (usingset()
to keep only uniques). And while we're at it, we might as well usesorted(..., key=lambda w: -len(w))
to sort it in decreasing length. Only need to callsort()
once. And no iterative-append to lists. - hence we guarantee that
max_length = len(unique_words[0])
- this approach is also going to be more performant than nested loops
for line in <lines>: for word in line.split(): ...iterative append() to wordlist
- no need to do explicit
writer/reader
.open()/.close()
, that's what thewith
statement does for you. (It's also more elegant for handling IO when exceptions happen.) - you could also merge the printing of the max_length words inside the writer loop. But it's cleaner code to keep them separate.
- note we use f-string formatting
f'{word}\n'
to add the newline back when wewrite()
an output line - in Python we use lower_case_with_underscores for variable names, hence
max_length
notmaxLength
. See PEP8 - in fact here, we don't strictly need a with-statement for the writer, if all we're going to do is slurp its entire contents in one go in with
open('doc.txt').read()
. (That's not scaleable for huge files, you'd have to read in chunks or n lines). str.maketrans()
is a builtin, but if your teacher objects to the module reference, you can also call it on a bound string e.g.' '.maketrans()
str.maketrans()
is really a throwback to the days when we only had 95 printable ASCII characters, not Unicode. It still works on Unicode, but building and using huge translation dicts is annoying and uses memory, regex on Unicode is easier, you can define entire character classes.
Alternative solution if you don't yet know str.translate()
dirty_input = open('doc.txt').read()
cleaned_input = dirty_input
# If you can't use either 're.sub()' or 'str.translate()', have to manually
# str.replace() each bad char one-by-one (or else use a method like str.isalpha())
for bad_char in bad:
cleaned_input = cleaned_input.replace(bad_char, ' ')
And if you wanted to be ridiculously minimalist, you could write the entire output file in one line with a list-comprehension. Don't do this, it would be terrible for debugging, e.g if you couldn't open/write/overwrite the output file, or got IOError, or unique_words wasn't a list, etc:
open("unique.txt", "w").writelines([f'{word}\n' for word in unique_words])
How can I get unique words from a DataFrame column of strings?
if you have strings in column then you would have to split every sentence into list of words and then put all list in one list - you can use it sum()
for this - it should give you all words. To get unique words you can convert it to set()
- and later you can convert back to list()
But at start you would have to clean sentences to remove chars like .
, ?
, etc. I uses regex
to keep only some chars and space. Eventually you would have to convert all words into lower or upper case.
import pandas as pd
df = pd.DataFrame({
'sentences': [
'is so sad for my apl friend.',
'omg this is terrible.',
'what is this new song?',
]
})
unique = set(df['sentences'].str.replace('[^a-zA-Z ]', '').str.lower().str.split(' ').sum())
print(list(sorted(unique)))
Result
['apl', 'for', 'friend', 'is', 'my', 'new', 'omg', 'sad', 'so', 'song', 'terrible', 'this', 'what']
EDIT: as @HenryYik mentioned in comment - findall('\w+')
can be used instead of split()
but also instead of replace()
unique = set(df['sentences'].str.lower().str.findall("\w+").sum())
EDIT: I tested it with data from
http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/
All works fast except column.sum()
or sum(column)
- I measured time for 1000 rows and calculated for 1 500 000 rows and it would need 35 minutes.
Much faster is to use itertools.chain()
- it would need about 8 seconds.
import itertools
words = df['sentences'].str.lower().str.findall("\w+")
words = list(itertools.chain(words))
unique = set(words)
but it can be converterd to set()
directly.
words = df['sentences'].str.lower().str.findall("\w+")
unique = set()
for x in words:
unique.update(x)
and it takes about 5 seconds
Full code:
import pandas as pd
import time
print(time.strftime('%H:%M:%S'), 'start')
print('-----')
#------------------------------------------------------------------------------
start = time.time()
# `read_csv()` can read directly from internet and compressed to zip
#url = 'http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip'
url = 'SentimentAnalysisDataset.csv'
# need to skip two rows which are incorrect
df = pd.read_csv(url, sep=',', dtype={'ItemID':int, 'Sentiment':int, 'SentimentSource':str, 'SentimentText':str}, skiprows=[8835, 535881])
end = time.time()
print(time.strftime('%H:%M:%S'), 'load:', end-start, 's')
print('-----')
#------------------------------------------------------------------------------
start = end
words = df['SentimentText'].str.lower().str.findall("\w+")
#df['words'] = words
end = time.time()
print(time.strftime('%H:%M:%S'), 'words:', end-start, 's')
print('-----')
#------------------------------------------------------------------------------
start = end
unique = set()
for x in words:
unique.update(x)
end = time.time()
print(time.strftime('%H:%M:%S'), 'set:', end-start, 's')
print('-----')
#------------------------------------------------------------------------------
print(list(sorted(unique))[:10])
Result
00:27:04 start
-----
00:27:08 load: 4.10780930519104 s
-----
00:27:23 words: 14.803470849990845 s
-----
00:27:27 set: 4.338541269302368 s
-----
['0', '00', '000', '0000', '00000', '000000000000', '0000001', '000001', '000014', '00004873337e0033fea60']
Add unique words from a text file to a list in python
Here are the problems with your code and a corrected version follows:
fname = open("romeo.txt") # better to open files in a `with` statement
lst = list() # lst = [] is more Pythonic
for line in fname:
line = line.rstrip() # not required, `split()` will do this anyway
words = line.split(' ') # don't specify a delimiter, `line.split()` will split on all white space
for word in words:
if word in lst: continue
lst = lst + words # this is the reason that you end up with duplicates... words is the list of all words for this line!
lst.sort() # don't sort in the for loop, just once afterwards.
print lst
So it almost works, however, you should be appending only the current word
to the list, not all of the words
that you got from the line with split()
. If you simply changed the line:
lst = lst + words
to
lst.append(word)
it will work.
Here is a corrected version:
with open("romeo.txt") as infile:
lst = []
for line in infile:
words = line.split()
for word in words:
if word not in lst:
lst.append(word) # append only this word to the list, not all words on this line
lst.sort()
print(lst)
As others have suggested, a set
is a good way to handle this. This is about as simple as it gets:
with open('romeo.txt') as infile:
print(sorted(set(infile.read().split())))
Using sorted()
you don't need to keep a reference to the list. If you do want to use the sorted list elsewhere, just do this:
with open('romeo.txt') as infile:
unique_words = sorted(set(infile.read().split()))
print(unique_words)
Reading the entire file into memory may not be viable for large files. You can use a generator to efficiently read the file without cluttering up the main code. This generator will read the file one line at a time and it will yield one word at a time. It will not read the entire file in one go, unless the file consists of one long line (which your sample data clearly doesn't):
def get_words(f):
for line in f:
for word in line.split():
yield word
with open('romeo.txt') as infile:
unique_words = sorted(set(get_words(infile)))
Related Topics
What Do Numbers Starting With 0 Mean in Python
Finding Out Who Got the Highest Mark Among the Students
Convert Commas Decimal Separators to Dots Within a Dataframe
Finding the Most Frequent Character in a String
How to Read a Specific Line from a Text File in Python
How to Merge 2 CSV Files Together by Multiple Columns in Python
How to Compute the Gradients of Image Using Python
Python Handling Socket.Error: [Errno 104] Connection Reset by Peer
Skip First Couple of Lines While Reading Lines in Python File
How to Store Python Dictionary in to MySQL Db Through Python
How to Remove All Characters Before a Specific Character in Python
How to Convert Datetime by Removing Nanoseconds
Printing the Number of Days in a Given Month and Year [Python]
How to Name a File by a Variable Name in Python
How to Concatenate/Append Multiple Spark Dataframes Column Wise in Pyspark
Pandas Counting and Summing Specific Conditions
How to Test If an Enum Member With a Certain Name Exists
Vary the Color of Each Bar in Bargraph Using Particular Value