Finding and Listing Duplicate Words in a Plain Text File

Finding and Listing Duplicate Words in a Plain Text file

Split each line on /, get the last item (cut cannot do it, so revert each line and take the first one), then sort and run uniq with -d which shows duplicates.

rev FILE | cut -f1 -d/ | rev | sort | uniq -d

Find Duplicate/Repeated or Unique words in file spanning across multiple lines

You can tokenize the words with grep -wo and find consecutive duplicates with uniq -d, add -c to count the number of duplicates, e.g.:

grep -wo '[[:alnum:]]\+' infile | sort | uniq -cd

Output:

2 abc
2 line

Python - Locating Duplicate Words in a Text File

You might also want to track previous locations, something like this:

with open(fname) as fh:
    vocab = {}
    for i, line in enumerate(fh):
       words = line.split()
       for j, word in enumerate(words):
           if word in vocab:
               locations = vocab[word]
               print word "occurs at", locations
               locations.append((i, j))
           else:
               vocab[word] = [(i, j)]
               # print "First occurrence of", word

How do I find and print duplicate words in a document with the linux Find command?

Create script findDup.sh with following code:

for n in $(seq 1 13)
do
        no_of_lines=$(grep -n color$n= test.php|wc -l)
        if  [ $no_of_lines -gt 1 ]
        then
                grep -n color$n= test.php
                echo "--------"
        fi
done

When you run it in directory containing your test.php you will get duplicate lines with line numbers.

Example:

$ ./findDup.sh
3:$color3=$_POST['color3'] ?? '';
6:$color3=$_POST['color3'] ?? '';
--------
9:$color13=$_POST['color13'] ?? '';
13:$color13=$_POST['color13'] ?? '';
--------

You may change limit from 13 to anything you want in above script.

How to remove duplicate words from a plain text file using linux command

Assuming that the words are one per line, and the file is already sorted:

uniq filename

If the file's not sorted:

sort filename | uniq

If they're not one per line, and you don't mind them being one per line:

tr -s [:space:] \\n < filename | sort | uniq

That doesn't remove punctuation, though, so maybe you want:

tr -s [:space:][:punct:] \\n < filename | sort | uniq

But that removes the hyphen from hyphenated words. "man tr" for more options.

Find duplicate words in two text files using command line

Well, then. With awk:

awk 'NR == FNR { for(i = 1; i <= NF; ++i) { a[NR,tolower($i)] = 1 }; next } { flag = 0; for(i = 1; i <= NF; ++i) { if(a[FNR,tolower($i)]) { printf("%s%s", flag ? OFS : "", $i); flag = 1 } } if(flag) print "" }' f1.txt f2.txt

This works as follows:

NR == FNR {                                 # While processing the first file:
  for(i = 1; i <= NF; ++i) {                # Remember which fields were in
    a[NR,tolower($i)] = 1                   # each line (lower-cased)
  }
  next                                      # Do nothing else.
}
{                                           # After that (when processing the
                                            # second file)
  flag = 0                                  # reset flag so we know we haven't
                                            # printed anything yet
  for(i = 1; i <= NF; ++i) {                # wade through fields (words)
    if(a[FNR,tolower($i)]) {                # if this field was in the
                                            # corresponding line in the first
                                            # file, then
      printf("%s%s", flag ? OFS : "", $i)   # print it (with a separator if it
                                            # isn't the first)
      flag = 1                              # raise flag
    }
  }
  if(flag) {                                # and if we printed anything
    print ""                                # add a newline at the end.
  }
}

Finding and Listing Duplicate Words in a Plain Text File