Get Only Unique Words from a Sentence in Python

Get only unique words from a sentence in Python

seq = "mango mango peach".split()
[x for x in seq if x not in seq[seq.index(x)+1:]]

How to print unique words from an inputted string

String.split(' ') takes a string and creates a list of elements divided by a space (' ').

set(foo) takes a collection foo and returns a set collection of only the distinct elements in foo.

What you want is this: unique_words = set(str1.split(' '))

The default value for the split separator is whitespace. I wanted to show that you can supply your own value to this method.

How to get unique words from a list quickly?

You need to do it all lazily and with as few intermediate lists and as possible (reducing allocations and processing time).
All unique words from a file:

import itertools
def unique_words_from_file(fpath):
with open(fpath, "r") as f:
return set(itertools.chain.from_iterable(map(str.split, f)))

Let's explain the ideas here.

File objects are iterable objects, which means that you can iterate over the lines of a file!

Then we want the words from each line, which is splitting them. In this case, we use map in Python3 (or itertools.imap in Python2) to create an object with that computation over our file lines. map and imap are also lazy, which means that no intermediate list is allocated by default and that is awesome because we will not be spending any resources on something we don't need!

Since str.split returns a list, our map result would be a succession of lists of strings, but we need to iterate over each of those strings. For doing that there is no need of building another list, we can use itertools.chain to flatten that result!

Finally, we call to set, which will iterate over those words and kept just a single one for each of them. Voila!

Let's make an improvement! can we make str.split also lazy?
Yes! check this SO answer:

import itertools
import re

def split_iter(string):
return (x.group(0) for x in re.finditer(r"[A-Za-z']+", string))

def unique_words_from_file(fpath):
with open(fpath, "r") as f:
return set(itertools.chain.from_iterable(map(split_iter, f)))

Print a list of unique words from a text file after removing punctuation, and find longest word

Here's a solution that uses str.translate() to throw away all bad characters (+ newline) before we ever do the split(). (Normally we'd use a regex with re.sub(), but you're not allowed.) This makes the cleaning a one-liner, which is really neat:

bad = "[],.\n"
bad_transtable = str.maketrans(bad, ' ' * len(bad))

# We can directly read and clean the entire output, without a reader object:
cleaned_input = open('doc.txt').read().translate(bad_transtable)
#with open("doc.txt") as reader:
# cleaned_input = reader.read().translate(bad_transtable)

# Get list of unique words, in decreasing length
unique_words = sorted(set(cleaned_input.split()), key=lambda w: -len(w))

with open("unique.txt", "w") as writer:
for word in unique_words:
writer.write(f'{word}\n')

max_length = len(unique_words[0])
print ([word for word in unique_words if len(word) == max_length])

Notes:

  • since the input is already 100% cleaned and split, no need to append to a list/insert to a set as we go, then have to make another cleaning pass later. We can just create unique_words directly! (using set() to keep only uniques). And while we're at it, we might as well use sorted(..., key=lambda w: -len(w)) to sort it in decreasing length. Only need to call sort() once. And no iterative-append to lists.
  • hence we guarantee that max_length = len(unique_words[0])
  • this approach is also going to be more performant than nested loops for line in <lines>: for word in line.split(): ...iterative append() to wordlist
  • no need to do explicit writer/reader.open()/.close(), that's what the with statement does for you. (It's also more elegant for handling IO when exceptions happen.)
  • you could also merge the printing of the max_length words inside the writer loop. But it's cleaner code to keep them separate.
  • note we use f-string formatting f'{word}\n' to add the newline back when we write() an output line
  • in Python we use lower_case_with_underscores for variable names, hence max_length not maxLength. See PEP8
  • in fact here, we don't strictly need a with-statement for the writer, if all we're going to do is slurp its entire contents in one go in with open('doc.txt').read(). (That's not scaleable for huge files, you'd have to read in chunks or n lines).
  • str.maketrans() is a builtin, but if your teacher objects to the module reference, you can also call it on a bound string e.g. ' '.maketrans()
  • str.maketrans() is really a throwback to the days when we only had 95 printable ASCII characters, not Unicode. It still works on Unicode, but building and using huge translation dicts is annoying and uses memory, regex on Unicode is easier, you can define entire character classes.

Alternative solution if you don't yet know str.translate()

dirty_input = open('doc.txt').read()
cleaned_input = dirty_input
# If you can't use either 're.sub()' or 'str.translate()', have to manually
# str.replace() each bad char one-by-one (or else use a method like str.isalpha())
for bad_char in bad:
cleaned_input = cleaned_input.replace(bad_char, ' ')

And if you wanted to be ridiculously minimalist, you could write the entire output file in one line with a list-comprehension. Don't do this, it would be terrible for debugging, e.g if you couldn't open/write/overwrite the output file, or got IOError, or unique_words wasn't a list, etc:

open("unique.txt", "w").writelines([f'{word}\n' for word in unique_words])

How can I get unique words from a DataFrame column of strings?

if you have strings in column then you would have to split every sentence into list of words and then put all list in one list - you can use it sum() for this - it should give you all words. To get unique words you can convert it to set() - and later you can convert back to list()

But at start you would have to clean sentences to remove chars like ., ?, etc. I uses regex to keep only some chars and space. Eventually you would have to convert all words into lower or upper case.

import pandas as pd

df = pd.DataFrame({
'sentences': [
'is so sad for my apl friend.',
'omg this is terrible.',
'what is this new song?',
]
})

unique = set(df['sentences'].str.replace('[^a-zA-Z ]', '').str.lower().str.split(' ').sum())

print(list(sorted(unique)))

Result

['apl', 'for', 'friend', 'is', 'my', 'new', 'omg', 'sad', 'so', 'song', 'terrible', 'this', 'what']

EDIT: as @HenryYik mentioned in comment - findall('\w+') can be used instead of split() but also instead of replace()

unique = set(df['sentences'].str.lower().str.findall("\w+").sum())

EDIT: I tested it with data from

http://thinknook.com/twitter-sentiment-analysis-training-corpus-dataset-2012-09-22/

All works fast except column.sum() or sum(column) - I measured time for 1000 rows and calculated for 1 500 000 rows and it would need 35 minutes.

Much faster is to use itertools.chain() - it would need about 8 seconds.

import itertools

words = df['sentences'].str.lower().str.findall("\w+")
words = list(itertools.chain(words))
unique = set(words)

but it can be converterd to set() directly.

words = df['sentences'].str.lower().str.findall("\w+")

unique = set()

for x in words:
unique.update(x)

and it takes about 5 seconds


Full code:

import pandas as pd
import time

print(time.strftime('%H:%M:%S'), 'start')

print('-----')
#------------------------------------------------------------------------------

start = time.time()

# `read_csv()` can read directly from internet and compressed to zip
#url = 'http://thinknook.com/wp-content/uploads/2012/09/Sentiment-Analysis-Dataset.zip'
url = 'SentimentAnalysisDataset.csv'

# need to skip two rows which are incorrect
df = pd.read_csv(url, sep=',', dtype={'ItemID':int, 'Sentiment':int, 'SentimentSource':str, 'SentimentText':str}, skiprows=[8835, 535881])

end = time.time()
print(time.strftime('%H:%M:%S'), 'load:', end-start, 's')

print('-----')
#------------------------------------------------------------------------------

start = end

words = df['SentimentText'].str.lower().str.findall("\w+")
#df['words'] = words

end = time.time()
print(time.strftime('%H:%M:%S'), 'words:', end-start, 's')

print('-----')
#------------------------------------------------------------------------------

start = end

unique = set()
for x in words:
unique.update(x)

end = time.time()
print(time.strftime('%H:%M:%S'), 'set:', end-start, 's')

print('-----')
#------------------------------------------------------------------------------

print(list(sorted(unique))[:10])

Result

00:27:04 start
-----
00:27:08 load: 4.10780930519104 s
-----
00:27:23 words: 14.803470849990845 s
-----
00:27:27 set: 4.338541269302368 s
-----
['0', '00', '000', '0000', '00000', '000000000000', '0000001', '000001', '000014', '00004873337e0033fea60']

Add unique words from a text file to a list in python

Here are the problems with your code and a corrected version follows:

fname = open("romeo.txt")      # better to open files in a `with` statement
lst = list() # lst = [] is more Pythonic
for line in fname:
line = line.rstrip() # not required, `split()` will do this anyway
words = line.split(' ') # don't specify a delimiter, `line.split()` will split on all white space
for word in words:
if word in lst: continue
lst = lst + words # this is the reason that you end up with duplicates... words is the list of all words for this line!
lst.sort() # don't sort in the for loop, just once afterwards.
print lst

So it almost works, however, you should be appending only the current word to the list, not all of the words that you got from the line with split(). If you simply changed the line:

lst = lst + words

to

lst.append(word)

it will work.

Here is a corrected version:

with open("romeo.txt") as infile:
lst = []
for line in infile:
words = line.split()
for word in words:
if word not in lst:
lst.append(word) # append only this word to the list, not all words on this line
lst.sort()
print(lst)

As others have suggested, a set is a good way to handle this. This is about as simple as it gets:

with open('romeo.txt') as infile:
print(sorted(set(infile.read().split())))

Using sorted() you don't need to keep a reference to the list. If you do want to use the sorted list elsewhere, just do this:

with open('romeo.txt') as infile:
unique_words = sorted(set(infile.read().split()))
print(unique_words)

Reading the entire file into memory may not be viable for large files. You can use a generator to efficiently read the file without cluttering up the main code. This generator will read the file one line at a time and it will yield one word at a time. It will not read the entire file in one go, unless the file consists of one long line (which your sample data clearly doesn't):

def get_words(f):
for line in f:
for word in line.split():
yield word

with open('romeo.txt') as infile:
unique_words = sorted(set(get_words(infile)))


Related Topics



Leave a reply



Submit