How to Grep in Parallel

How do I grep in parallel

There will not be speed improvement if you are using a HDD to store that directory you are searching in. Hard drives are pretty much single-threaded access units.

But if you really want to do parallel grep, then this website gives two hints of how to do it with find and xargs. E.g.

find . -type f -print0 | xargs -0 -P 4 -n 40 grep -i foobar

How to properly run find | parallel with grep + escape characters?

As @Mark Setchell pointed out, I missed the "--quote" argument! This solution works:

find . -regex "filename pattern" -print0 | parallel -0 --quote grep -P "pattern1\t|pattern2\t|pattern3\t|...|patternN\t"

How to grep a very large file containing a single line in parallel?

parallel --recend '' --pipepart -a filename --block -1 -q grep -Eo "[0-9]+"

How to run grep in parallel on single lines from a list

A little trickier than I first thought since matched lines need to go to separate files - please post performance if you get the chance - this solution can be used in parallel too - the species list file can be chunked and/or the fasta file can be chunked and fed to parallel runs of the script

This takes about 1 minute on an Intel Xeon E5 with a 6GB fake data file checked for 10,000 species - but increasing the species list to 100,0000 even in chunks of 10,000 was problematic as I ran into disk issues with that many files being created and appended to in one directory - the problems began when the species list crossed 50,000 - this number will be different on other systems - I modified the script to create 100 subdirectories and limited each directory to 1000 files - this worked well and all 100,000 files were generated without having to chunk the species list or the 6GB datafile

Also to give you an idea of how fast grep is - it took 6 seconds to match 100,000 species in the 6GB file

specieslist=$1
nspecies=$(wc -l $specieslist|cut -f1 -d' ')
echo -e "grep $nspecies species from $specieslist\n"
grep -A1 -F -f $specieslist|
awk '
# skip context marker
/^--$/{next}
# process pair of lines
# first line is matching species header line
# species is semicolon-delimited field 7 of first line
# second line is sequence - both lines are written to a file with sanitized species name
{
split($0, flds, ";")
species=flds[7]
filekey=gensub(/\W/,".","g",species)
file="fastaout." filekey
if(!(filekey in outfiles)) {
outfiles[filekey]=file
printf("species \"%s\" outfile \"%s\" first match line %d: \"%s\"\n", species, file, NR, $0)
print >file
}
getline; print >>file
# close may be needed on systems where awk cannot juggle too many open files
close(outfile)
}
'
outfiles=(fastaout.*)
noutfiles=${#outfiles[*]}
echo -e "\ncreated $noutfiles fastaout.* files"
head -5 fastaout*

output and slightly modified test inputs follow - species list has some actual matches - fasta file sequence line prefixed with lowercased species to verify correctness and avoid matching species again

output

$ head out.*
==> out.Brachyspira_innocens <==
brachyspira_innocens.1:-UC-U-G--G-U---------------------------
brachyspira_innocens.2:-UC-U-G--G-U---------------------------

==> out.Methanohalophilus_portucalensis_FDF-1 <==
methanohalophilus_portucalensis_fdf-1:-UC-U-G--G-U---------------------------

==> out.Pucciniomycotina <==
pucciniomycotina:-UC-U-G--G-U---------------------------

species list

Allobacillus_halotolerans
Allochromatium_truperi
Allochromatium_vinosum
Methanohalophilus_portucalensis_FDF-1
Brachyspira_innocens
Pucciniomycotina

fasta file

HP451749.6.1794_Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Basidiomycota;Pucciniomycotina;Pucciniomycetes;Pucciniales;Pucciniaceae;Puccinia;Puccinia_triticina;.............................................................................
pucciniomycotina:-UC-U-G--G-U---------------------------
>EF164983.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_innocens
brachyspira_innocens.1:-UC-U-G--G-U---------------------------
>X96499.1.1810_Eukaryota;Archaeplastida;Chloroplastida;Charophyta;Phragmoplastophyta;Streptophyta;Embryophyta;Marchantiophyta;Jungermanniales;Calypogeia;Plagiochila_adiantoides
plagiochila_adiantoides:-UC-U-G--G-U---------------------------
>AB034906.1.1763_Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Ascomycota;Saccharomycotina;Saccharomycetes;Saccharomycetales;Saccharomycetaceae;Citeromyces;Citeromyces_siamensis
citeromyces_siamensis:-UC-U-G--G-U---------------------------
>AY290717.1.1208_Archaea;Euryarchaeota;Methanomicrobia;Methanosarcinales;Methanosarcinaceae;Methanohalophilus;Methanohalophilus_portucalensis_FDF-1
methanohalophilus_portucalensis_fdf-1:-UC-U-G--G-U---------------------------
>EF164984.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_pulli
brachyspira_pulli:-UC-U-G--G-U---------------------------
>AY291120.1.1477_Bacteria;Proteobacteria;Betaproteobacteria;Burkholderiales;Comamonadaceae;Lampropedia;Lampropedia_hyalina
lampropedia_hyalina:-UC-U-G--G-U---------------------------
>EF164987.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_alvinipulli
brachyspira_alvinipulli:-UC-U-G--G-U---------------------------
>JQ838073.1.1461_Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces_sp._QLS01
streptomyces_sp._qls01:-UC-U-G--G-U---------------------------
>EF164989.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_alvinipulli
brachyspira_alvinipulli:-UC-U-G--G-U---------------------------
>JQ838076.1.1460_Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces_sp._QLS04
streptomyces_sp._qls04:-UC-U-G--G-U---------------------------
>AB035584.1.1789_Eukaryota;Opisthokonta;Nucletmycea;Fungi;Dikarya;Basidiomycota;Agaricomycotina;Tremellomycetes;Tremellales;Trichosporonaceae;Trichosporon;Trichosporon_debeurmannianum
trichosporon_debeurmannianum:-UC-U-G--G-U---------------------------
>JQ838080.1.1457_Bacteria;Actinobacteria;Actinobacteria;Streptomycetales;Streptomycetaceae;Streptomyces;Streptomyces_sp._QLS11
streptomyces_sp._qls11:-UC-U-G--G-U---------------------------
>EF165015.1.1527_Bacteria;Firmicutes;Clostridia;Clostridiales;Family_XI;Tepidimicrobium;Clostridium_sp._PML3-1
clostridium_sp._pml3-1:-UC-U-G--G-U---------------------------
>U85867.1.1424_Bacteria;Proteobacteria;Gammaproteobacteria;Alteromonadales;Alteromonadaceae;Marinobacter;Marinobacter_sp.
Marinobacter_sp.:-UC-U-G--G-U---------------------------
>EF165044.1.1398_Bacteria;Proteobacteria;Alphaproteobacteria;Rhizobiales;Methylobacteriaceae;Methylobacterium;Methylobacterium_sp._CBMB38
methylobacterium_sp._cbmb38:-UC-U-G--G-U---------------------------
>U85870.1.1458_Bacteria;Proteobacteria;Gammaproteobacteria;Pseudomonadales;Pseudomonadaceae;Pseudomonas;Pseudomonas_sp.
pseudomonas_sp.:-UC-U-G--G-U---------------------------
>EF165046.1.1380_Bacteria;Proteobacteria;Gammaproteobacteria;Enterobacteriales;Enterobacteriaceae;Pantoea;Pantoea_sp._CBMB55
pantoea_sp._cbmb55:-UC-U-G--G-U---------------------------
>EF164983.1.1433_Bacteria;Spirochaetae;Spirochaetes;Spirochaetales;Brachyspiraceae;Brachyspira;Brachyspira_innocens
brachyspira_innocens.2:-UC-U-G--G-U---------------------------

Boosting the grep search using GNU parallel

If you do not need to use grep try:

build_k_mers() {
k="$1"
slot="$2"
perl -ne 'for $n (0..(length $_)-'"$k"') {
$prefix = substr($_,$n,2);
$fh{$prefix} or open $fh{$prefix}, ">>", "tmp/kmer.$prefix.'"$slot"'";
$fh = $fh{$prefix};
print $fh substr($_,$n,'"$k"'),"\n"
}'
}
export -f build_k_mers

rm -rf tmp
mkdir tmp
export LC_ALL=C
# search strings must be sorted for comm
parsort patterns.txt | awk '{print >>"tmp/patterns."substr($1,1,2)}' &

# make shorter lines: Insert \n(last 12 char before \n) for every 32k
# This makes it easier for --pipepart to find a newline
# It will not change the kmers generated
perl -pe 's/(.{32000})(.{12})/$1$2\n$2/g' large_strings.txt > large_lines.txt
# Build 12-mers
parallel --pipepart --block -1 -a large_lines.txt 'build_k_mers 12 {%}'
# -j10 and 20s may be adjusted depending on hardware
parallel -j10 --delay 20s 'parsort -u tmp/kmer.{}.* > tmp/kmer.{}; rm tmp/kmer.{}.*' ::: `perl -e 'map { printf "%02x ",$_ } 0..255'`
wait
parallel comm -23 {} {=s/patterns./kmer./=} ::: tmp/patterns.??

I have tested this on patterns.txt: 9GBytes/725937231 lines, large_strings.txt: 19GBytes/184 lines and on my 64-core machine it completes in 3 hours.

GNU Parallel -- How to understand block-size setting, and guess what to set it to?

parallel --pipepart --block -1 --jobs 10 -a large_file.csv grep -f small_file.csv

--block -1 will split large_file.csv into one block per jobslot (here 10 chunks). The splitting will be done on the fly, so it will not be read into RAM to do the splitting.

Splitting into n evenly sized blocks (where n = number of jobs to run in parallel) often makes sense if the time spent per line is roughly the same. If it varies a lot (say, some lines take 100 times longer to process than others), then it may make more sense to chop into more bits. E.g. --block -10 will split into 10 times as many blocks as --block -1.

The optimal value can seldom be guessed in advance, because it may also depend on how fast your disk is. So try different values and identify where the bottleneck is. It is typically one of disk I/O, CPU, RAM, command startup time.

XARGS, GREP and GNU parallel

cat filepaths | parallel -j 30 grep -Pl '\d+,\d+,\d+,\d+' {} > FoundPatternsInFile.out

In this case you can even leave out {}.



Related Topics



Leave a reply



Submit