How to Remove Duplicate Words from a Plain Text File Using Linux Command

How to remove duplicate words from a plain text file using linux command

Assuming that the words are one per line, and the file is already sorted:

uniq filename

If the file's not sorted:

sort filename | uniq

If they're not one per line, and you don't mind them being one per line:

tr -s [:space:] \\n < filename | sort | uniq

That doesn't remove punctuation, though, so maybe you want:

tr -s [:space:][:punct:] \\n < filename | sort | uniq

But that removes the hyphen from hyphenated words. "man tr" for more options.

How to remove duplicate words from a plain text file using unix commands on windows

  1. Put the words on different lines in your file, say, f1.txt. You may refer to How to replace a character by a newline in Vim? for this.
  2. Then execute command "sort -u f1.text > f2.txt"
  3. Combine the words of f2.txt into a line or lines if required.

How can I delete duplicate words in text file

you may try to use:

grep -o '\w*' a.txt | sort | uniq

where a.txt is your file.

removing duplicate words from sentences in file

You can do:

awk '{for(i=1;i<=NF;i++) if(++arr[$i]==1) print $i}' file

Prints:

hello
every
body
word
I
should
remove
the
how
can
i
it
?

To maintain the line structure:

awk '{for(i=1;i<=NF;i++) 
if(++arr[$i]==1)
printf "%s%s", $i, OFS
print ""}' file

Prints:

hello every body 
word I should remove the
how can i it ?

If the deduplication is only on a per line basis:

awk '{delete arr
for(i=1;i<=NF;i++)
if(++arr[$i]==1) printf "%s%s", $i, OFS
print ""}' file

Prints:

hello every body 
word I should remove the
how can i remove it ?

BASH/sed to remove duplicates from a separated-by-line word list in a text file

This might work for you (GNU sed):

sed -nr 'G;/^([^\n]+\n)([^\n]+\n)*\1/!{P;h}' file

Keep a list of unique keys in the hold space and if the current key is not in the list print it and add it to the list.

Remove consecutive duplicate words from a file using awk or sed

With GNU awk for the 4th arg to split():

$ cat tst.awk
{
n = split($0,words,/[^[:alpha:]]+/,seps)
prev = ""
for (i=1; i<=n; i++) {
word = words[i]
if (word != prev) {
printf "%s%s", seps[i-1], word
}
prev = word
}
print ""
}

$ awk -f tst.awk file
“true, rohith Rohith;
cold burn, and fact and fact good?”

How to remove duplicate words from a string in a Bash script?

We have this test file:

$ cat file
abc, def, abc, def

To remove duplicate words:

$ sed -r ':a; s/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g; ta; s/(, )+/, /g; s/, *$//' file
abc, def

How it works

  • :a

    This defines a label a.

  • s/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g

    This looks for a duplicated word consisting of alphanumeric characters and removes the second occurrence.

  • ta

    If the last substitution command resulted in a change, this jumps back to label a to try again.

    In this way, the code keeps looking for duplicates until none remain.

  • s/(, )+/, /g; s/, *$//

    These two substitution commands clean up any left over comma-space combinations.

Mac OSX or other BSD System

For Mac OSX or other BSD system, try:

sed -E -e ':a' -e 's/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g' -e 'ta' -e 's/(, )+/, /g' -e 's/, *$//' file

Using a string instead of a file

sed easily handles input either from a file, as shown above, or from a shell string as shown below:

$ echo 'ab, cd, cd, ab, ef' | sed -r ':a; s/\b([[:alnum:]]+)\b(.*)\b\1\b/\1\2/g; ta; s/(, )+/, /g; s/, *$//'
ab, cd, ef

How to Remove duplication of words from both sentences using shell script?

Your code would remove repeated lines; both sort and uniq operate on lines, not words. (And even then, the loop is superfluous; if you wanted to do that, your code should be simplified to just sort -u my_text.txt.)

The usual fix is to split the input to one word per line; there are some complications with real-world text, but the first basic Unix 101 implementation looks like

tr ' ' '\n' <my_text.txt | sort -u

Of course, this gives you the words in a different order than in the original, and saves the first occurrence of every word. If you wanted to discard any words which occur more than once, maybe try

tr ' ' '\n' <my_text.txt | sort | uniq -c | awk '$1 == 1 { print $2 }'

(If your tr doesn't recognize \n as newline, maybe try '\012'.)

Here is a dead simple two-pass Awk script which hopefully is a little bit more useful. It collects all the words into memory during the first pass over the file, then on the second, removes any words which occurred more than once.

awk 'NR==FNR { for (i=1; i<=NF; ++i) ++a[$i]; next }
{ for (i=1; i<=NF; ++i) if (a[$i] > 1) $i="" } 1' my_test.txt my_test.txt

This leaves whitespace where words were removed; fixing that should be easy enough with a final sub().

A somewhat more useful program would split off any punctuation, and reduce words to lowercase (so that Word, word, Word!, and word? don't count as separate).

Finding and Listing Duplicate Words in a Plain Text file

Split each line on /, get the last item (cut cannot do it, so revert each line and take the first one), then sort and run uniq with -d which shows duplicates.

rev FILE | cut -f1 -d/ | rev | sort | uniq -d


Related Topics



Leave a reply



Submit