Two files: keep lines with identical first n characters only
Your code doesn't work because your pattern doesn't match anything. The regular expression ^[5]
means "the character '5' at the beginning of the string" (the square brackets define a character class), not "5 characters at the beginning of the string". The latter would be ^.{5}
. Also, you never match the content of a.txt
against the content of b.txt
.
There are several ways to do what you want:
Extract the first 5 characters from each line of
b.txt.
to an array and compare the lines ofa.txt
against that array. Esperento57's answer sort of uses this approach, but in a way that requires PowerShell v3 or newer. A variant that'll work on all PowerShell versions could look like this:$pattern = '^(.{5}).*'
$ref = (Get-Content 'b.txt') -match $pattern -replace $pattern, '$1' |
Get-Unique
Get-Content 'a.txt' | Where-Object {
$ref -contains ($_ -replace $pattern, '$1')
} | Set-Content 'results.txt'Since lookups in arrays are comparatively slow and don't scale well (they get significantly slower with increasing number of elements in the array) you could also put the reference values in a hashtable so you can do index lookups (which are significantly faster):
$pattern = '^(.{5}).*'
$ref = @{}
(Get-Content 'b.txt') -match $pattern -replace $pattern, '$1' |
ForEach-Object { $ref[$_] = $true }
Get-Content 'a.txt' | Where-Object {
$ref.ContainsKey(($_ -replace $pattern, '$1'))
} | Set-Content 'results.txt'Another alternative would be to build a second regular expression from the substrings extracted from
b.txt
and compare the content ofa.txt
against that expression:$pattern = '^(.{5}).*'
$list = (Get-Content 'b.txt') -match $pattern -replace $pattern, '$1' |
Get-Unique |
ForEach-Object { [regex]::Escape($_) }
$ref = '^({0})' -f ($list -join '|')
(Get-Content 'a.txt') -match $ref | Set-Content 'results.txt'
Note that each of these approaches will ignore lines shorter than 5 characters.
Compare two files, and keep only if first word of each line is the same
Use awk where you iterate over both files:
$ awk 'NR == FNR { a[$1] = 1; next } a[$1]' a.txt b.txt
hello dolly 1
tom sawyer 2
super man 4
NR == FNR
is only true for the first file making { a[$1] = 1; next }
only run on said file.
How to compare two files containing many long strings then extract lines with at least n consecutive identical chars?
Given the format of the files, the most efficient implementation would be something like this:
- Load all
b
strings into a[hashtable]
or[HashSet[string]]
- Filter the contents of
a
by:- Extracting the substring from each line with
String.Split(':')
or similar - Check whether it exists in the set from step 1
- Extracting the substring from each line with
$FilterStrings = [System.Collections.Generic.HashSet[string]]::new(
[string[]]@(
Get-Content .\path\to\b
)
)
Get-Content .\path\to\a |Where-Object {
# Split the line into the prefix, middle, and suffix;
# Discard the prefix and suffix
$null,$searchString,$null = $_.Split(":", 3)
if($FilterStrings.Contains($searchString)){
# we found a match, write it to the new file
$searchString |Add-Content .\path\to\matchedStrings.txt
# make sure it isn't passed through
$false
}
else {
# substring wasn't found to be in `b`, let's pass it through
$true
}
} |Set-Content .\path\to\filteredStrings.txt
Compare and delete lines based on first x character match (between two files)
You can try this:
with open('a.txt') as f1, open('b.txt') as f2:
lines1 = f1.readlines()
lines2 = f2.readlines()
result = []
for line1 in lines1:
for line2 in lines2:
if len(line1.strip()) >= 5 and line1[:5] == line2[:5]:
result.append(line1)
with open('a.txt', 'w') as f1:
f1.writelines(result)
Note that Python's slices are very insidious since s[:100]
from a string of length less than 101 is the same string. Therefore you should check - whether each line contains a sufficient number of characters. In the method above, this is implemented through the condition len(line1.strip()) >= 5
, which guarantees that the provided method will eliminate lines of length less than 5 as well as long lines of spaces.
For example:
a.txt
---------------
abcde000
0123456xxx
xyzxyzxyz
kkkkkkkkkkk
1
# <== 10 spaces here
2
3
b.txt
---------------
012345aabbcc
kkkkkkkhhkkvv
nnnnnnnnnnn
# <== 12 spaces here
1
2
3
result (a.txt)
---------------
0123456xxx
kkkkkkkkkkk
Linux: Comparing two files but not caring what line only content
awk can also help.
awk 'NR==FNR {a[$1]=$1; next}!($1 in a) {print $0}' fileA fileB
How do I compare lines in two files WITHOUT respect to their position in those files (set difference operation)
For simple line-oriented comparisons, the comm
command might be all you need:
$ tail a.txt b.txt
==> a.txt <==
a
b
c
d
f
g
==> b.txt <==
a
b
c
e
g
h
$ comm -23 <(sort a.txt) <(sort b.txt)
d
f
$ comm -13 <(sort a.txt) <(sort b.txt)
e
h
Also, it's probably worth it to enable the --unique
flag on sort
in order to remove duplicate lines:
comm -23 <(sort --unique a.txt) <(sort --unique b.txt)
Related Topics
How to Configure Gitlab as a Subdomain in Nginix.Conf
Individual Thread Priority Checking Using Command Line in Linux
Using Git to Clone from a Windows Machine to a Linux Webserver (In House)
How to Launch a Job in a Shell Which Will Persist Even If The Shell Which Launches It Terminates
Problems Building Libcurl 7.21.2 on Ubuntu 11.10 (Hiphop)
Put Command Output into String
Can Tmux Save Commands to a File, Like .Bash_History
How to Determine The Available Physical Memory in Linux
How to Add a Ssh Key to Remote Server
How to Find Files Containing a String Using Egrep
Setting Environment Variable with Leading Digit in Bash
Resolving MAC Address for Ip Address Using C++ on Linux
How to Split Two Vertical Pane Inside a Horizontal Pane in Tmux Using Tmuxinator
In Bash, How to Expand Variables Twice in Double Quotes
Docker Installation on Linux Mint 19.2 Doesn't Work