Finding and Listing Duplicate Words in a Plain Text File

Finding and Listing Duplicate Words in a Plain Text file

Split each line on /, get the last item (cut cannot do it, so revert each line and take the first one), then sort and run uniq with -d which shows duplicates.

rev FILE | cut -f1 -d/ | rev | sort | uniq -d

Find Duplicate/Repeated or Unique words in file spanning across multiple lines

You can tokenize the words with grep -wo and find consecutive duplicates with uniq -d, add -c to count the number of duplicates, e.g.:

grep -wo '[[:alnum:]]\+' infile | sort | uniq -cd

Output:

2 abc
2 line

Python - Locating Duplicate Words in a Text File

You might also want to track previous locations, something like this:

with open(fname) as fh:
vocab = {}
for i, line in enumerate(fh):
words = line.split()
for j, word in enumerate(words):
if word in vocab:
locations = vocab[word]
print word "occurs at", locations
locations.append((i, j))
else:
vocab[word] = [(i, j)]
# print "First occurrence of", word

How do I find and print duplicate words in a document with the linux Find command?

Create script findDup.sh with following code:

for n in $(seq 1 13)
do
no_of_lines=$(grep -n color$n= test.php|wc -l)
if [ $no_of_lines -gt 1 ]
then
grep -n color$n= test.php
echo "--------"
fi
done

When you run it in directory containing your test.php you will get duplicate lines with line numbers.

Example:

$ ./findDup.sh
3:$color3=$_POST['color3'] ?? '';
6:$color3=$_POST['color3'] ?? '';
--------
9:$color13=$_POST['color13'] ?? '';
13:$color13=$_POST['color13'] ?? '';
--------

You may change limit from 13 to anything you want in above script.

How to remove duplicate words from a plain text file using linux command

Assuming that the words are one per line, and the file is already sorted:

uniq filename

If the file's not sorted:

sort filename | uniq

If they're not one per line, and you don't mind them being one per line:

tr -s [:space:] \\n < filename | sort | uniq

That doesn't remove punctuation, though, so maybe you want:

tr -s [:space:][:punct:] \\n < filename | sort | uniq

But that removes the hyphen from hyphenated words. "man tr" for more options.

Find duplicate words in two text files using command line

Well, then. With awk:

awk 'NR == FNR { for(i = 1; i <= NF; ++i) { a[NR,tolower($i)] = 1 }; next } { flag = 0; for(i = 1; i <= NF; ++i) { if(a[FNR,tolower($i)]) { printf("%s%s", flag ? OFS : "", $i); flag = 1 } } if(flag) print "" }' f1.txt f2.txt

This works as follows:

NR == FNR {                                 # While processing the first file:
for(i = 1; i <= NF; ++i) { # Remember which fields were in
a[NR,tolower($i)] = 1 # each line (lower-cased)
}
next # Do nothing else.
}
{ # After that (when processing the
# second file)
flag = 0 # reset flag so we know we haven't
# printed anything yet
for(i = 1; i <= NF; ++i) { # wade through fields (words)
if(a[FNR,tolower($i)]) { # if this field was in the
# corresponding line in the first
# file, then
printf("%s%s", flag ? OFS : "", $i) # print it (with a separator if it
# isn't the first)
flag = 1 # raise flag
}
}
if(flag) { # and if we printed anything
print "" # add a newline at the end.
}
}


Related Topics



Leave a reply



Submit