How to remove the lines which appear on file B from another file A?
If the files are sorted (they are in your example):
comm -23 file1 file2
-23
suppresses the lines that are in both files, or only in file 2. If the files are not sorted, pipe them through sort
first...
See the man page here
Deleting lines from one file which are in another file
grep -v -x -f f2 f1
should do the trick.
Explanation:
-v
to select non-matching lines-x
to match whole lines only-f f2
to get patterns fromf2
One can instead use grep -F
or fgrep
to match fixed strings from f2
rather than patterns (in case you want remove the lines in a "what you see if what you get" manner rather than treating the lines in f2
as regex patterns).
How can I remove lines in one file that exist in another?
You could likely use grep
with the -v
(invert-match) and -f
(file) options:
grep -v -f oldfile newfile > newstrip
It matches any lines in newfile that are not in oldfile and saves them to newstrip. If you are happy with the results you could easily do afterward:
mv newstrip newfile
This will overwrite newfile with newstrip (removing newstrip).
How to remove the lines which appear on file 1 from another file 2 KEEPING empty lines?
With awk
please try following once.
awk 'FNR==NR{arr[$0];next} !NF || !($0 in arr)' file2 file1
Explanation: Adding detailed explanation for above code.
awk ' ##Mentioning awk program from here.
FNR==NR{ ##Checking if FNR==NR which will be TRUE when file2 is being read.
arr[$0] ##Creating array with index of $2 here.
next ##next will skip all further statements from here.
}
(!NF || !($0 in arr)) ##If line is empty OR not in arr then print it.
' file2 file1 ##Mentioning Input_file names here.
Removing lines which match with specific pattern from another file
Another awk:
$ awk -F/ ' # / delimiter
NR==FNR {
a[$1,$2] # hash patterns to a
next
}
{
if( tf=((substr($1,2),$2) in a) ) # if first part found in hash
print # output and store found result in var tf
if(getline && tf) # read next record and if previous record was found
print # output
}' patterns myfile
Output:
>m64071_201130_104452/13/ccs
ACAGTCGAGCG
>m64071_201130_104452/26/ccs
TAGACAATGTA
Edit: To output the ones not found:
$ awk -F/ ' # / delimiter
NR==FNR {
a[$1,$2] # hash patterns to a
next
}
{
if( tf=((substr($1,2),$2) in a) ) { # if first part found in hash
getline # consume the next record too
next
}
print # otherwise output
}' patterns myfile
Output:
>m64071_201130_104452/16/ccs
ACAGTCGAGCG
>m64071_201130_104452/20/ccs
CAGTCGAGCGC
>m64071_201130_104452/22/ccs
CACACATCTCG
Delete lines from a file which does not have a key present in another file
With different approach with awk
by making objectId=
OR &
as field separators for fileA
considering that your Input_files will be same as shown samples only.
awk 'FNR==NR{a[$0];next} ($4 in a)' fileB FS="objectId=|&" fileA
2nd solution: Using match.
awk '
FNR==NR{
a[$0]
next
}
match($0,/objectId=[a-zA-Z]+-[0-9]+/){
var=substr($0,RSTART+9,RLENGTH-9)
}
var in a
' fileB fileA
Remove block of text stored in one file from another file
Finally found the answer:
pcregrep -v -F -f <(seq 2 4) <(for J in {1..5};do seq 5;done)
for the large files you need to raise buffer.
Removing Lines of Text That Exist in Another File
Well, I ended up writing a PHP script after all.
I read both files into a string, then exploded the strings into arrays using \r\n
as the delimiter. I then iterated through the arrays to remove any elements that exist, and finally dumped them back out to a file.
The only problem was that by trying to refactor the stripping routine to a function, I found that passing the array that gets changed (elements removed) by reference caused it to slow down to the point of needing to be Ctrl-C’d, so I just passed by value and returned the new array (counterintuitive). Also, using unset
to delete the elements was slow no matter what, so I just set the element to an empty string and skipped those during the dump.
Remove Lines from File which not appear in another File, error
I wrote a small python script in a few minutes. Works well, I have tested with 42000-char lines and it works fine.
import sys,re
# rudimentary argument parsing
file1 = sys.argv[1]
file2 = sys.argv[2]
file3 = sys.argv[3]
present = set()
# first read file 1, discard all fields except the first one (the key)
with open(file1,"r") as f1:
for l in f1:
toks = re.split("\s+",l) # same as awk fields
if toks: # robustness against empty lines
present.add(toks[0])
#now read second one and write in third one only if id is in the set
with open(file2,"r") as f2:
with open(file3,"w") as f3:
for l in f2:
toks = re.split("\s+",l)
if toks and toks[0] in present:
f3.write(l)
(First install python if not already present.)
Call my sample script mytool.py
and run it like this:
python mytool.py file1.txt file2.txt file3.txt
To process several files at once simply in a bash file (to replace the original solution) it's easy (although not optimal because could be done in a whirl in python)
<whatever the for loop you need>; do
python my_tool.py $1 $2 $3
done
exactly like you would call awk with 3 files.
Related Topics
Filter Log File Entries Based on Date Range
What Is Rss and Vsz in Linux Memory Management
Better Way to Rename Files Based on Multiple Patterns
How to Quickly Sum All Numbers in a File
Get Most Recent File in a Directory on Linux
How to Add a New Device in Qemu Source Code
How to Programmatically "Burn In" Ansi Control Codes to a File Using Unix Utils
Extract One Word After a Specific Word on the Same Line
How to Pass Command Output as Multiple Arguments to Another Command
Understanding Linux /Proc/Pid/Maps or /Proc/Self/Maps
How to Permanently Export a Variable in Linux
Find Multiple Files and Rename Them in Linux
How to View the List of Functions a Linux Shared Library Is Exporting
Is Gettimeofday() Guaranteed to Be of Microsecond Resolution