Finding and Listing Duplicate Words in a Plain Text file
Split each line on /
, get the last item (cut
cannot do it, so revert each line and take the first one), then sort and run uniq
with -d
which shows duplicates.
rev FILE | cut -f1 -d/ | rev | sort | uniq -d
Find Duplicate/Repeated or Unique words in file spanning across multiple lines
You can tokenize the words with grep -wo
and find consecutive duplicates with uniq -d
, add -c
to count the number of duplicates, e.g.:
grep -wo '[[:alnum:]]\+' infile | sort | uniq -cd
Output:
2 abc
2 line
Python - Locating Duplicate Words in a Text File
You might also want to track previous locations, something like this:
with open(fname) as fh:
vocab = {}
for i, line in enumerate(fh):
words = line.split()
for j, word in enumerate(words):
if word in vocab:
locations = vocab[word]
print word "occurs at", locations
locations.append((i, j))
else:
vocab[word] = [(i, j)]
# print "First occurrence of", word
How do I find and print duplicate words in a document with the linux Find command?
Create script findDup.sh with following code:
for n in $(seq 1 13)
do
no_of_lines=$(grep -n color$n= test.php|wc -l)
if [ $no_of_lines -gt 1 ]
then
grep -n color$n= test.php
echo "--------"
fi
done
When you run it in directory containing your test.php you will get duplicate lines with line numbers.
Example:
$ ./findDup.sh
3:$color3=$_POST['color3'] ?? '';
6:$color3=$_POST['color3'] ?? '';
--------
9:$color13=$_POST['color13'] ?? '';
13:$color13=$_POST['color13'] ?? '';
--------
You may change limit from 13 to anything you want in above script.
How to remove duplicate words from a plain text file using linux command
Assuming that the words are one per line, and the file is already sorted:
uniq filename
If the file's not sorted:
sort filename | uniq
If they're not one per line, and you don't mind them being one per line:
tr -s [:space:] \\n < filename | sort | uniq
That doesn't remove punctuation, though, so maybe you want:
tr -s [:space:][:punct:] \\n < filename | sort | uniq
But that removes the hyphen from hyphenated words. "man tr" for more options.
Find duplicate words in two text files using command line
Well, then. With awk:
awk 'NR == FNR { for(i = 1; i <= NF; ++i) { a[NR,tolower($i)] = 1 }; next } { flag = 0; for(i = 1; i <= NF; ++i) { if(a[FNR,tolower($i)]) { printf("%s%s", flag ? OFS : "", $i); flag = 1 } } if(flag) print "" }' f1.txt f2.txt
This works as follows:
NR == FNR { # While processing the first file:
for(i = 1; i <= NF; ++i) { # Remember which fields were in
a[NR,tolower($i)] = 1 # each line (lower-cased)
}
next # Do nothing else.
}
{ # After that (when processing the
# second file)
flag = 0 # reset flag so we know we haven't
# printed anything yet
for(i = 1; i <= NF; ++i) { # wade through fields (words)
if(a[FNR,tolower($i)]) { # if this field was in the
# corresponding line in the first
# file, then
printf("%s%s", flag ? OFS : "", $i) # print it (with a separator if it
# isn't the first)
flag = 1 # raise flag
}
}
if(flag) { # and if we printed anything
print "" # add a newline at the end.
}
}
Related Topics
Git - Crlf Issue in Windows + Linux Dual Boot
Insufficient Permission Pushing to Git Shared Repo Over Smart Http
How to Add a Custom System Call on X86 Ubuntu Linux
Socket Programming Send() Return Value
Unzip in Current Directory While Preserving File Structure
Will Java Compiled in Windows Work in Linux
Ld_Library_Path: How to Find a Shared Object
Deleting Directories Using Single Liner Command
Bash: Start Remote Python Application Through Ssh and Get Its Pid
Find' (Command) Finds Nothing with -Wholename
How to Enable Hocr Font Info in Tesseract 4
Linux Shell Kill Signal Sigkill && Kill
Environment Variables in Symbolic Links
Compile Errors Using Bfd.H on Linux
Git Clone Gnutls Recv Error (-9): a Tls Packet with Unexpected Length Was Received