How can I remove duplicate words in a string with Python?
def unique_list(l):
ulist = []
[ulist.append(x) for x in l if x not in ulist]
return ulist
a="calvin klein design dress calvin klein"
a=' '.join(unique_list(a.split()))
Remove adjacent duplicate words in a string with Python?
Using re.sub
with a backreference we can try:
inp = 'Hey there There'
output = re.sub(r'(\w+) \1', r'\1', inp, flags=re.IGNORECASE)
print(output) # Hey there
The regex pattern used here says to:
(\w+) match and capture a word
[ ] followed by a space
\1 then followed by the same word (ignoring case)
Then, we just replace with the first adjacent word.
Shortest way to remove duplicate words from string
A regex
based approach could be shorter - match the non-white space (\\S+
) followed by a white space character (\\s
), capture it, followed by one or more occurrence of the backreference, and in the replacement, specify the backreference to return only a single copy of the match
gsub("(\\S+\\s)\\1+", "\\1", x)
[1] "A B C"
Or may need to split the string with strsplit
, unlist
, get the unique
and then paste
paste(unique(unlist(strsplit(x, " "))), collapse = " ")
# [1] "A B C"
Python Dataframe: Remove duplicate words in the same cell within a column in Python
If you're looking to get rid of consecutive duplicates only, this should suffice:
df['Desired'] = df['Current'].str.replace(r'\b(\w+)(\s+\1)+\b', r'\1')
df
Current Desired
0 Racoon Dog Racoon Dog
1 Cat Cat Cat
2 Dog Dog Dog Dog Dog
3 Rat Fox Chicken Rat Fox Chicken
Details
\b # word boundary
(\w+) # 1st capture group of a single word
(
\s+ # 1 or more spaces
\1 # reference to first group
)+ # one or more repeats
\b
Regex from here.
To remove non-consecutive duplicates, I'd suggest a solution involving the OrderedDict
data structure:
from collections import OrderedDict
df['Desired'] = (df['Current'].str.split()
.apply(lambda x: OrderedDict.fromkeys(x).keys())
.str.join(' '))
df
Current Desired
0 Racoon Dog Racoon Dog
1 Cat Cat Cat
2 Dog Dog Dog Dog Dog
3 Rat Fox Chicken Rat Fox Chicken
Remove first occurrence of word in string
This should do the job:
test = 'User Key Account Department Account Start Date'
words = test.split()
# if word doesn't exist in the rest of the word list, add it
test = ' '.join([word for i, word in enumerate(words) if word not in words[i+1:]])
print(test) # User Key Department Account Start Date
Remove duplicate words in strings in column in every row in data frame
You can use (assuming column name is 0):
from collections import OrderedDict
df[0].str.split().apply(lambda x: ','.join(OrderedDict.fromkeys(x).keys()))
0 Yes,Absolutely
1 No,Nope
2 Win,Lose
Note , you can use set as:
df[0].str.split().apply(lambda x: ','.join(list(set(x))))
But set doesn't guarantee the order.
Removing duplicate words from a string in python
In your 1st approach:
data="".join(OrderedDict.fromkeys(data))
basically considers the variable data
as an iterable. In this case, it will consider the string as iterable which contains unique
characters. So the unique characters would be t
,h
,e
,,
a
,n
and the ordered dictionary is created with totally 6 keys.
In your 2nd approach:
data = "".join(OrderedDict.fromkeys(data.split(" ")))
you are splitting the string into a list (which means iterable). and the list elements are the
, an
, a
and the ordered dictionary is created with 3 unique values as keys.
And in the final step you are joining them, which means just the keys will be returned as a string.
Hope this helps.
Removing duplicate characters from a string
If order does not matter, you can use
"".join(set(foo))
set()
will create a set of unique letters in the string, and "".join()
will join the letters back to a string in arbitrary order.
If order does matter, you can use a dict
instead of a set, which since Python 3.7 preserves the insertion order of the keys. (In the CPython implementation, this is already supported in Python 3.6 as an implementation detail.)
foo = "mppmt"
result = "".join(dict.fromkeys(foo))
resulting in the string "mpt"
. In earlier versions of Python, you can use collections.OrderedDict
, which has been available starting from Python 2.7.
Related Topics
Can Anyone Explain Python's Relative Imports
Using Subprocess to Run Python Script on Windows
How to Get Tkinter Canvas to Dynamically Resize to Window Width
Django Query That Get Most Recent Objects from Different Categories
Python Super() Raises Typeerror
Printing Utf-8 in Python 3 Using Sublime Text 3
How to Extract Text and Text Coordinates from a PDF File
Execute a Function After Flask Returns Response
Python Unittest.Testcase Execution Order
Get Timezone from City in Python/Django
How to Convert an Rgb Image to Numpy Array
Python Overwriting Variables in Nested Functions
How to Read Two Lines from a File at a Time Using Python
How to Apply a Function on Every Row on a Dataframe
Get the Position of the Largest Value in a Multi-Dimensional Numpy Array