How to Remove the Lines Which Appear on File B from Another File A

How to remove the lines which appear on file B from another file A?

If the files are sorted (they are in your example):

comm -23 file1 file2

-23 suppresses the lines that are in both files, or only in file 2. If the files are not sorted, pipe them through sort first...

See the man page here

Deleting lines from one file which are in another file

grep -v -x -f f2 f1 should do the trick.

Explanation:

  • -v to select non-matching lines
  • -x to match whole lines only
  • -f f2 to get patterns from f2

One can instead use grep -F or fgrep to match fixed strings from f2 rather than patterns (in case you want remove the lines in a "what you see if what you get" manner rather than treating the lines in f2 as regex patterns).

How can I remove lines in one file that exist in another?

You could likely use grep with the -v (invert-match) and -f (file) options:

grep -v -f oldfile newfile > newstrip

It matches any lines in newfile that are not in oldfile and saves them to newstrip. If you are happy with the results you could easily do afterward:

mv newstrip newfile

This will overwrite newfile with newstrip (removing newstrip).

How to remove the lines which appear on file 1 from another file 2 KEEPING empty lines?

With awk please try following once.

awk 'FNR==NR{arr[$0];next} !NF || !($0 in arr)' file2 file1

Explanation: Adding detailed explanation for above code.

awk '                  ##Mentioning awk program from here.
FNR==NR{ ##Checking if FNR==NR which will be TRUE when file2 is being read.
arr[$0] ##Creating array with index of $2 here.
next ##next will skip all further statements from here.
}
(!NF || !($0 in arr)) ##If line is empty OR not in arr then print it.
' file2 file1 ##Mentioning Input_file names here.

Removing lines which match with specific pattern from another file

Another awk:

$ awk -F/ '                            # / delimiter
NR==FNR {
a[$1,$2] # hash patterns to a
next
}
{
if( tf=((substr($1,2),$2) in a) ) # if first part found in hash
print # output and store found result in var tf
if(getline && tf) # read next record and if previous record was found
print # output
}' patterns myfile

Output:

>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/26/ccs
TAGACAATGTA

Edit: To output the ones not found:

$ awk -F/ '                              # / delimiter
NR==FNR {
a[$1,$2] # hash patterns to a
next
}
{
if( tf=((substr($1,2),$2) in a) ) { # if first part found in hash
getline # consume the next record too
next
}
print # otherwise output
}' patterns myfile

Output:

>m64071_201130_104452/16/ccs
ACAGTCGAGCG
>m64071_201130_104452/20/ccs
CAGTCGAGCGC
>m64071_201130_104452/22/ccs
CACACATCTCG

Delete lines from a file which does not have a key present in another file

With different approach with awk by making objectId= OR & as field separators for fileA considering that your Input_files will be same as shown samples only.

awk 'FNR==NR{a[$0];next} ($4 in a)' fileB FS="objectId=|&" fileA


2nd solution: Using match.

awk '
FNR==NR{
a[$0]
next
}
match($0,/objectId=[a-zA-Z]+-[0-9]+/){
var=substr($0,RSTART+9,RLENGTH-9)
}
var in a
' fileB fileA

Remove block of text stored in one file from another file

Finally found the answer:

pcregrep -v -F -f <(seq 2 4) <(for J in {1..5};do seq 5;done)

for the large files you need to raise buffer.

Removing Lines of Text That Exist in Another File

Well, I ended up writing a PHP script after all.

I read both files into a string, then exploded the strings into arrays using \r\n as the delimiter. I then iterated through the arrays to remove any elements that exist, and finally dumped them back out to a file.

The only problem was that by trying to refactor the stripping routine to a function, I found that passing the array that gets changed (elements removed) by reference caused it to slow down to the point of needing to be Ctrl-C’d, so I just passed by value and returned the new array (counterintuitive). Also, using unset to delete the elements was slow no matter what, so I just set the element to an empty string and skipped those during the dump.

Remove Lines from File which not appear in another File, error

I wrote a small python script in a few minutes. Works well, I have tested with 42000-char lines and it works fine.

import sys,re

# rudimentary argument parsing

file1 = sys.argv[1]
file2 = sys.argv[2]
file3 = sys.argv[3]

present = set()

# first read file 1, discard all fields except the first one (the key)
with open(file1,"r") as f1:
for l in f1:
toks = re.split("\s+",l) # same as awk fields
if toks: # robustness against empty lines
present.add(toks[0])

#now read second one and write in third one only if id is in the set

with open(file2,"r") as f2:
with open(file3,"w") as f3:
for l in f2:
toks = re.split("\s+",l)
if toks and toks[0] in present:
f3.write(l)

(First install python if not already present.)

Call my sample script mytool.py and run it like this:

python mytool.py file1.txt file2.txt file3.txt

To process several files at once simply in a bash file (to replace the original solution) it's easy (although not optimal because could be done in a whirl in python)

<whatever the for loop you need>; do
python my_tool.py $1 $2 $3
done

exactly like you would call awk with 3 files.



Related Topics



Leave a reply



Submit