Use Grep to Remove Words from Dictionary Whose Roots Are Already Present

Use grep to remove words from dictionary whose roots are already present

If the list is sorted so that shorter strings always precede longer strings, you might be able to get fairly good performance out of a simple Awk script.

awk '$1~r && p in k { next } { k[$1]++; print; r= "^" $1; p=$1 }' words

If the current word matches the prefix regex r (defined in a moment) and the prefix p (ditto) is in the list of seen keys, skip. Otherwise, add the current word to the prefix keys, print the current line, create a regex which matches the current word at beginning of line (this is now the prefix regex r) and also remember the prefix string in p.

If all the similar strings are always adjacent (as they would be if you sort the file lexically), you could do away with k and p entirely too, I guess.

awk 'NR>1 && $1~r { next } { print; r="^" $1 }' words

Linux: remove strings from list if they have substrings elsewhere in list

If the list is sorted it's pretty simple

awk '{for(i in a)if(index($0,i))next;a[$0]}1' file

apple
kiwi
mango
oranges

Basically just loops over an array for each line, and checks if elements exist in the line. Adds to array if this is not the case.

For unsorted list this should work

awk '{for(i in a){if(index(i,$0)&&$0!=i)delete a[i];if(index($0,i))next}a[$0];next}
END{for(i in a)print i}' file

Tested on Wordlist for performance.

real    0m29.932s
user 0m29.918s
sys 0m0.008s

extract words from a file

You could use grep:

  • -E '\w+' searches for words
  • -o only prints the portion of the line that matches

% cat temp
Some examples use "The quick brown fox jumped over the lazy dog,"
rather than "Lorem ipsum dolor sit amet, consectetur adipiscing elit"
for example text.
# if you don't care whether words repeat
% grep -o -E '\w+' temp
Some
examples
use
The
quick
brown
fox
jumped
over
the
lazy
dog
rather
than
Lorem
ipsum
dolor
sit
amet
consectetur
adipiscing
elit
for
example
text

If you want to only print each word once, disregarding case, you can use sort

  • -u only prints each word once
  • -f tells sort to ignore case when comparing words

# if you only want each word once
% grep -o -E '\w+' temp | sort -u -f
adipiscing
amet
brown
consectetur
dog
dolor
elit
example
examples
for
fox
ipsum
jumped
lazy
Lorem
over
quick
rather
sit
Some
text
than
The
use

Use grep or sed to keep only the words that are in another word list file

In Python I would just read the word list file, create a list of strings with the words, then read the input file and output the word if it exists in the array.

And that's how you'd do in in awk too:

$ awk 'FNR == NR { dict[$0] = 1; next } # Read the dictionary file
{ # And for each word of each line of the sentence file
for (word = 1; word <= NF; word++) {
if ($word in dict) # See if it's in the dictionary
printf "%s ", $word
}
printf "\n"
}' dict.txt input.txt
I miss dog
I buy computer

(This does leave a trailing space on each line, but that's easy to filter out if it matters)



Related Topics



Leave a reply



Submit