Use grep to remove words from dictionary whose roots are already present
If the list is sorted so that shorter strings always precede longer strings, you might be able to get fairly good performance out of a simple Awk script.
awk '$1~r && p in k { next } { k[$1]++; print; r= "^" $1; p=$1 }' words
If the current word matches the prefix regex r
(defined in a moment) and the prefix p
(ditto) is in the list of seen keys, skip. Otherwise, add the current word to the prefix keys, print the current line, create a regex which matches the current word at beginning of line (this is now the prefix regex r
) and also remember the prefix string in p
.
If all the similar strings are always adjacent (as they would be if you sort the file lexically), you could do away with k
and p
entirely too, I guess.
awk 'NR>1 && $1~r { next } { print; r="^" $1 }' words
Linux: remove strings from list if they have substrings elsewhere in list
If the list is sorted it's pretty simple
awk '{for(i in a)if(index($0,i))next;a[$0]}1' file
apple
kiwi
mango
oranges
Basically just loops over an array for each line, and checks if elements exist in the line. Adds to array if this is not the case.
For unsorted list this should work
awk '{for(i in a){if(index(i,$0)&&$0!=i)delete a[i];if(index($0,i))next}a[$0];next}
END{for(i in a)print i}' file
Tested on Wordlist for performance.
real 0m29.932s
user 0m29.918s
sys 0m0.008s
extract words from a file
You could use grep
:
-E '\w+'
searches for words-o
only prints the portion of the line that matches
% cat temp
Some examples use "The quick brown fox jumped over the lazy dog,"
rather than "Lorem ipsum dolor sit amet, consectetur adipiscing elit"
for example text.
# if you don't care whether words repeat
% grep -o -E '\w+' temp
Some
examples
use
The
quick
brown
fox
jumped
over
the
lazy
dog
rather
than
Lorem
ipsum
dolor
sit
amet
consectetur
adipiscing
elit
for
example
text
If you want to only print each word once, disregarding case, you can use sort
-u
only prints each word once-f
tellssort
to ignore case when comparing words
# if you only want each word once
% grep -o -E '\w+' temp | sort -u -f
adipiscing
amet
brown
consectetur
dog
dolor
elit
example
examples
for
fox
ipsum
jumped
lazy
Lorem
over
quick
rather
sit
Some
text
than
The
use
Use grep or sed to keep only the words that are in another word list file
In Python I would just read the word list file, create a list of strings with the words, then read the input file and output the word if it exists in the array.
And that's how you'd do in in awk
too:
$ awk 'FNR == NR { dict[$0] = 1; next } # Read the dictionary file
{ # And for each word of each line of the sentence file
for (word = 1; word <= NF; word++) {
if ($word in dict) # See if it's in the dictionary
printf "%s ", $word
}
printf "\n"
}' dict.txt input.txt
I miss dog
I buy computer
(This does leave a trailing space on each line, but that's easy to filter out if it matters)
Related Topics
Interprocess Communication via Pipes
Grep and Sed with Spaces in Filenames
Linux: Processes and Threads in a Multi-Core Cpu
How to Use Performance Counters Inside of The Kernel
Making Strlcpy Available in Linux
How to View Dask Dashboard When Running on a Virtual Machine
Upgrading PHPmyadmin (And Other Packages) on Debian Squeeze
How to Manipulate Array in Shell Script
Linux Umask for Sudo and Apache
Golang Os/Exec, Realtime Memory Usage
Simplest Way to Build Dotnet Sdk Project Requiring Net461 on Macos
How to Send a Mail with a Message in Unix Script
Linux: Proc/Net/Sockstat Tcp Mem More and More Larger
Docker with '-User' Can Not Write to Volume with Different Ownership