Why Does "Uniq" Count Identical Words as Different

Why does uniq count identical words as different?

Try to sort first:

cat .temp_occ | sort| uniq -c | sort -k1,1nr -k2 > distribution.txt

How can I count and display only the words that are repeated more than once using unix commands?

You can filter out the rows with count 1 with grep.

cut -d ":" -f 5 file1 | cut -d "," -f 1 | sort | uniq -c | grep -v '^ *1 '

How to count the amount of unique lines, duplicate lines and lines that appear three times in a text file

$ echo 'Donald
Donald
Lisa
John
Lisa
Donald' | sort | uniq -c | awk '{print $1}' | sort | uniq -c
1 1
1 2
1 3

The right column is the repetition count, and the left column is the number of unique names with that repetition count. E.g. “Donald” has a repetition count of 3.

Bigger example:

echo 'Donald
Donald
Rob
Lisa
WhatAmIDoing
John
Obama
Obama
Lisa
Washington
Donald' | sort | uniq -c | awk '{print $1}' | sort | uniq -c
4 1
2 2
1 3

Four names (“Rob”, “WhatAmIDoing”, “John”, and “Washington”) each have a repetition count of 1. Two names (“Lisa” and “Obama”) each have a repetition count of 2. One name (“Donald”) has a repetition count of 3.

elisp implementation of the uniq -c Unix command to count unique lines

I suppose a common method would be to just hash the strings and then print the contents. This approach can be easily accomplished in emacs.

;; See the emacs manual for creating a hash table test
;; https://www.gnu.org/software/emacs/manual/html_node/elisp/Defining-Hash.html
(defun case-fold-string= (a b)
(eq t (compare-strings a nil nil b nil nil t)))
(defun case-fold-string-hash (a)
(sxhash (upcase a)))

(define-hash-table-test 'case-fold
'case-fold-string= 'case-fold-string-hash)

(defun uniq (beg end)
"Print counts of strings in region."
(interactive "r")
(let ((h (make-hash-table :test 'case-fold))
(lst (split-string (buffer-substring-no-properties beg end) "\n"
'omit-nulls " "))
(output-func (if current-prefix-arg 'insert 'princ)))
(dolist (str lst)
(puthash str (1+ (gethash str h 0)) h))
(maphash (lambda (key val)
(apply output-func (list (format "%d: %s\n" val key))))
h)))

Output when selecting that text

4: flower
1: park
3: stone

Finding the number of unique values that contain another set of unique values

The problem with your pipeline is while uniq -c will provide a count of the unique occurrences, but "James, I play Baseball" and "James, I play football" will be considered unique. You can limit the check to the first N characters with the -w N option to uniq (in your case -w3), but you are much better off (and much, much more efficient) using a single call to awk.

Here you are concerned with the 2nd field (the name) and whether play occurs in the record. You can use /play[[:blank:]]/ (or /[[:blank:]]play[[:blank:]]/) as the test for "play" alone). Then each time a record containing "play" alone is encountered, you save the number in the array a[] indexed by the name (e.g. a[$2]). You just increment the number in the index for each name and then using the END rule, you output the name and the number of occurrences.

That makes the task quite simple, e.g.

awk -F, '/[[:blank:]]play[[:blank:]]/{a[$2]++} END {for (i in a) print i, a[i]}' Dataset.txt 

Output

 James 2
Bob 1

count words unique in array JavaScript

You can consider using a Set.

array = [1,1,2,3,4,4,5];

unique = [...new Set(array)];

console.log (unique.length);

How to Remove duplication of words from both sentences using shell script?

Your code would remove repeated lines; both sort and uniq operate on lines, not words. (And even then, the loop is superfluous; if you wanted to do that, your code should be simplified to just sort -u my_text.txt.)

The usual fix is to split the input to one word per line; there are some complications with real-world text, but the first basic Unix 101 implementation looks like

tr ' ' '\n' <my_text.txt | sort -u

Of course, this gives you the words in a different order than in the original, and saves the first occurrence of every word. If you wanted to discard any words which occur more than once, maybe try

tr ' ' '\n' <my_text.txt | sort | uniq -c | awk '$1 == 1 { print $2 }'

(If your tr doesn't recognize \n as newline, maybe try '\012'.)

Here is a dead simple two-pass Awk script which hopefully is a little bit more useful. It collects all the words into memory during the first pass over the file, then on the second, removes any words which occurred more than once.

awk 'NR==FNR { for (i=1; i<=NF; ++i) ++a[$i]; next }
{ for (i=1; i<=NF; ++i) if (a[$i] > 1) $i="" } 1' my_test.txt my_test.txt

This leaves whitespace where words were removed; fixing that should be easy enough with a final sub().

A somewhat more useful program would split off any punctuation, and reduce words to lowercase (so that Word, word, Word!, and word? don't count as separate).



Related Topics



Leave a reply



Submit