How to remove duplicate words from a plain text file using linux command
Assuming that the words are one per line, and the file is already sorted:
uniq filename
If the file's not sorted:
sort filename | uniq
If they're not one per line, and you don't mind them being one per line:
tr -s [:space:] \\n < filename | sort | uniq
That doesn't remove punctuation, though, so maybe you want:
tr -s [:space:][:punct:] \\n < filename | sort | uniq
But that removes the hyphen from hyphenated words. "man tr" for more options.
How to remove duplicate words from a plain text file using unix commands on windows
- Put the words on different lines in your file, say, f1.txt. You may refer to How to replace a character by a newline in Vim? for this.
- Then execute command "sort -u f1.text > f2.txt"
- Combine the words of f2.txt into a line or lines if required.
How can I delete duplicate words in text file
you may try to use:
grep -o '\w*' a.txt | sort | uniq
where a.txt is your file.
removing duplicate words from sentences in file
You can do:
awk '{for(i=1;i<=NF;i++) if(++arr[$i]==1) print $i}' file
Prints:
hello
every
body
word
I
should
remove
the
how
can
i
it
?
To maintain the line structure:
awk '{for(i=1;i<=NF;i++)
if(++arr[$i]==1)
printf "%s%s", $i, OFS
print ""}' file
Prints:
hello every body
word I should remove the
how can i it ?
If the deduplication is only on a per line basis:
awk '{delete arr
for(i=1;i<=NF;i++)
if(++arr[$i]==1) printf "%s%s", $i, OFS
print ""}' file
Prints:
hello every body
word I should remove the
how can i remove it ?
BASH/sed to remove duplicates from a separated-by-line word list in a text file
This might work for you (GNU sed):
sed -nr 'G;/^([^\n]+\n)([^\n]+\n)*\1/!{P;h}' file
Keep a list of unique keys in the hold space and if the current key is not in the list print it and add it to the list.
Remove consecutive duplicate words from a file using awk or sed
With GNU awk for the 4th arg to split():
$ cat tst.awk
{
n = split($0,words,/[^[:alpha:]]+/,seps)
prev = ""
for (i=1; i<=n; i++) {
word = words[i]
if (word != prev) {
printf "%s%s", seps[i-1], word
}
prev = word
}
print ""
}
$ awk -f tst.awk file
“true, rohith Rohith;
cold burn, and fact and fact good?”
How to remove duplicate words from a string in a Bash script?
We have this test file:
$ cat file
abc, def, abc, def
To remove duplicate words:
$ sed -r ':a; s/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g; ta; s/(, )+/, /g; s/, *$//' file
abc, def
How it works
:a
This defines a label
a
.s/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g
This looks for a duplicated word consisting of alphanumeric characters and removes the second occurrence.
ta
If the last substitution command resulted in a change, this jumps back to label
a
to try again.In this way, the code keeps looking for duplicates until none remain.
s/(, )+/, /g; s/, *$//
These two substitution commands clean up any left over comma-space combinations.
Mac OSX or other BSD System
For Mac OSX or other BSD system, try:
sed -E -e ':a' -e 's/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g' -e 'ta' -e 's/(, )+/, /g' -e 's/, *$//' file
Using a string instead of a file
sed easily handles input either from a file, as shown above, or from a shell string as shown below:
$ echo 'ab, cd, cd, ab, ef' | sed -r ':a; s/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g; ta; s/(, )+/, /g; s/, *$//'
ab, cd, ef
How to Remove duplication of words from both sentences using shell script?
Your code would remove repeated lines; both sort
and uniq
operate on lines, not words. (And even then, the loop is superfluous; if you wanted to do that, your code should be simplified to just sort -u my_text.txt
.)
The usual fix is to split the input to one word per line; there are some complications with real-world text, but the first basic Unix 101 implementation looks like
tr ' ' '\n' <my_text.txt | sort -u
Of course, this gives you the words in a different order than in the original, and saves the first occurrence of every word. If you wanted to discard any words which occur more than once, maybe try
tr ' ' '\n' <my_text.txt | sort | uniq -c | awk '$1 == 1 { print $2 }'
(If your tr
doesn't recognize \n
as newline, maybe try '\012'
.)
Here is a dead simple two-pass Awk script which hopefully is a little bit more useful. It collects all the words into memory during the first pass over the file, then on the second, removes any words which occurred more than once.
awk 'NR==FNR { for (i=1; i<=NF; ++i) ++a[$i]; next }
{ for (i=1; i<=NF; ++i) if (a[$i] > 1) $i="" } 1' my_test.txt my_test.txt
This leaves whitespace where words were removed; fixing that should be easy enough with a final sub()
.
A somewhat more useful program would split off any punctuation, and reduce words to lowercase (so that Word
, word
, Word!
, and word?
don't count as separate).
Finding and Listing Duplicate Words in a Plain Text file
Split each line on /
, get the last item (cut
cannot do it, so revert each line and take the first one), then sort and run uniq
with -d
which shows duplicates.
rev FILE | cut -f1 -d/ | rev | sort | uniq -d
Related Topics
Read a File and Split Each Line into Multiple Variables
Using Gnu/Linux System Call 'Splice' for Zero-Copy Socket to Socket Data Transfers in Haskell
Create a Symbolic Link of Directory in Ubuntu
Bash Scripting - How to Set the Group That New Files Will Be Created With
What Would Be the Equivalent of Win32 API in Linux
Hard Time in Understanding Module_Device_Table(Usb, Id_Table) Usage
Set Max_Execution_Time in PHP Cli
Using <Linux/Types.H> in User Programs, or <Stdint.H> in Driver Module Code...Does It Matter
Where to Start Learning About Linux Dma/Device Drivers/Memory Allocation
How to Set Process Group of a Shell Script
Bash Store Output as a Variable
Why Docker Has Ability to Run Different Linux Distribution
Expression After Last Specific Character