How to Remove Duplicate Words in a String with Python

How can I remove duplicate words in a string with Python?

def unique_list(l):
ulist = []
[ulist.append(x) for x in l if x not in ulist]
return ulist

a="calvin klein design dress calvin klein"
a=' '.join(unique_list(a.split()))

Remove adjacent duplicate words in a string with Python?

Using re.sub with a backreference we can try:

inp = 'Hey there There'
output = re.sub(r'(\w+) \1', r'\1', inp, flags=re.IGNORECASE)
print(output) # Hey there

The regex pattern used here says to:

(\w+)  match and capture a word
[ ] followed by a space
\1 then followed by the same word (ignoring case)

Then, we just replace with the first adjacent word.

Shortest way to remove duplicate words from string

A regex based approach could be shorter - match the non-white space (\\S+) followed by a white space character (\\s), capture it, followed by one or more occurrence of the backreference, and in the replacement, specify the backreference to return only a single copy of the match

gsub("(\\S+\\s)\\1+", "\\1", x)
[1] "A B C"

Or may need to split the string with strsplit, unlist, get the unique and then paste

paste(unique(unlist(strsplit(x, " "))), collapse = " ")
# [1] "A B C"

Python Dataframe: Remove duplicate words in the same cell within a column in Python

If you're looking to get rid of consecutive duplicates only, this should suffice:

df['Desired'] = df['Current'].str.replace(r'\b(\w+)(\s+\1)+\b', r'\1')
df

Current Desired
0 Racoon Dog Racoon Dog
1 Cat Cat Cat
2 Dog Dog Dog Dog Dog
3 Rat Fox Chicken Rat Fox Chicken

Details

\b        # word boundary
(\w+) # 1st capture group of a single word
(
\s+ # 1 or more spaces
\1 # reference to first group
)+ # one or more repeats
\b

Regex from here.


To remove non-consecutive duplicates, I'd suggest a solution involving the OrderedDict data structure:

from collections import OrderedDict

df['Desired'] = (df['Current'].str.split()
.apply(lambda x: OrderedDict.fromkeys(x).keys())
.str.join(' '))
df

Current Desired
0 Racoon Dog Racoon Dog
1 Cat Cat Cat
2 Dog Dog Dog Dog Dog
3 Rat Fox Chicken Rat Fox Chicken

Remove first occurrence of word in string

This should do the job:

test = 'User Key Account Department Account Start Date'

words = test.split()

# if word doesn't exist in the rest of the word list, add it
test = ' '.join([word for i, word in enumerate(words) if word not in words[i+1:]])

print(test) # User Key Department Account Start Date

Remove duplicate words in strings in column in every row in data frame

You can use (assuming column name is 0):

from collections import OrderedDict
df[0].str.split().apply(lambda x: ','.join(OrderedDict.fromkeys(x).keys()))

0 Yes,Absolutely
1 No,Nope
2 Win,Lose

Note , you can use set as:

df[0].str.split().apply(lambda x: ','.join(list(set(x))))

But set doesn't guarantee the order.

Removing duplicate words from a string in python

In your 1st approach:

data="".join(OrderedDict.fromkeys(data))

basically considers the variable data as an iterable. In this case, it will consider the string as iterable which contains unique characters. So the unique characters would be t,h,e,,a,n and the ordered dictionary is created with totally 6 keys.


In your 2nd approach:

data = "".join(OrderedDict.fromkeys(data.split(" ")))

you are splitting the string into a list (which means iterable). and the list elements are the, an, a and the ordered dictionary is created with 3 unique values as keys.

And in the final step you are joining them, which means just the keys will be returned as a string.

Hope this helps.

Removing duplicate characters from a string

If order does not matter, you can use

"".join(set(foo))

set() will create a set of unique letters in the string, and "".join() will join the letters back to a string in arbitrary order.

If order does matter, you can use a dict instead of a set, which since Python 3.7 preserves the insertion order of the keys. (In the CPython implementation, this is already supported in Python 3.6 as an implementation detail.)

foo = "mppmt"
result = "".join(dict.fromkeys(foo))

resulting in the string "mpt". In earlier versions of Python, you can use collections.OrderedDict, which has been available starting from Python 2.7.



Related Topics



Leave a reply



Submit