Diff Files Comparing Only First N Characters of Each Line

Two files: keep lines with identical first n characters only

Your code doesn't work because your pattern doesn't match anything. The regular expression ^[5] means "the character '5' at the beginning of the string" (the square brackets define a character class), not "5 characters at the beginning of the string". The latter would be ^.{5}. Also, you never match the content of a.txt against the content of b.txt.

There are several ways to do what you want:

  • Extract the first 5 characters from each line of b.txt. to an array and compare the lines of a.txt against that array. Esperento57's answer sort of uses this approach, but in a way that requires PowerShell v3 or newer. A variant that'll work on all PowerShell versions could look like this:

    $pattern = '^(.{5}).*'

    $ref = (Get-Content 'b.txt') -match $pattern -replace $pattern, '$1' |
    Get-Unique

    Get-Content 'a.txt' | Where-Object {
    $ref -contains ($_ -replace $pattern, '$1')
    } | Set-Content 'results.txt'
  • Since lookups in arrays are comparatively slow and don't scale well (they get significantly slower with increasing number of elements in the array) you could also put the reference values in a hashtable so you can do index lookups (which are significantly faster):

    $pattern = '^(.{5}).*'

    $ref = @{}
    (Get-Content 'b.txt') -match $pattern -replace $pattern, '$1' |
    ForEach-Object { $ref[$_] = $true }

    Get-Content 'a.txt' | Where-Object {
    $ref.ContainsKey(($_ -replace $pattern, '$1'))
    } | Set-Content 'results.txt'
  • Another alternative would be to build a second regular expression from the substrings extracted from b.txt and compare the content of a.txt against that expression:

    $pattern = '^(.{5}).*'

    $list = (Get-Content 'b.txt') -match $pattern -replace $pattern, '$1' |
    Get-Unique |
    ForEach-Object { [regex]::Escape($_) }
    $ref = '^({0})' -f ($list -join '|')

    (Get-Content 'a.txt') -match $ref | Set-Content 'results.txt'

Note that each of these approaches will ignore lines shorter than 5 characters.

Compare two files, and keep only if first word of each line is the same

Use awk where you iterate over both files:

$ awk 'NR == FNR { a[$1] = 1; next } a[$1]' a.txt b.txt
hello dolly 1
tom sawyer 2
super man 4

NR == FNR is only true for the first file making { a[$1] = 1; next } only run on said file.

How to compare two files containing many long strings then extract lines with at least n consecutive identical chars?

Given the format of the files, the most efficient implementation would be something like this:

  1. Load all b strings into a [hashtable] or [HashSet[string]]
  2. Filter the contents of a by:

    • Extracting the substring from each line with String.Split(':') or similar
    • Check whether it exists in the set from step 1
$FilterStrings = [System.Collections.Generic.HashSet[string]]::new(
[string[]]@(
Get-Content .\path\to\b
)
)

Get-Content .\path\to\a |Where-Object {
# Split the line into the prefix, middle, and suffix;
# Discard the prefix and suffix
$null,$searchString,$null = $_.Split(":", 3)

if($FilterStrings.Contains($searchString)){
# we found a match, write it to the new file
$searchString |Add-Content .\path\to\matchedStrings.txt

# make sure it isn't passed through
$false
}
else {
# substring wasn't found to be in `b`, let's pass it through
$true
}
} |Set-Content .\path\to\filteredStrings.txt

Compare and delete lines based on first x character match (between two files)

You can try this:

with open('a.txt') as f1, open('b.txt') as f2:

lines1 = f1.readlines()
lines2 = f2.readlines()

result = []

for line1 in lines1:
for line2 in lines2:
if len(line1.strip()) >= 5 and line1[:5] == line2[:5]:
result.append(line1)

with open('a.txt', 'w') as f1:
f1.writelines(result)

Note that Python's slices are very insidious since s[:100] from a string of length less than 101 is the same string. Therefore you should check - whether each line contains a sufficient number of characters. In the method above, this is implemented through the condition len(line1.strip()) >= 5, which guarantees that the provided method will eliminate lines of length less than 5 as well as long lines of spaces.

For example:

a.txt
---------------
abcde000
0123456xxx
xyzxyzxyz
kkkkkkkkkkk

1
# <== 10 spaces here
2
3
b.txt
---------------
012345aabbcc
kkkkkkkhhkkvv
nnnnnnnnnnn
# <== 12 spaces here

1
2
3
result (a.txt)
---------------
0123456xxx
kkkkkkkkkkk

Linux: Comparing two files but not caring what line only content

awk can also help.

 awk  'NR==FNR {a[$1]=$1; next}!($1 in a) {print $0}' fileA fileB

How do I compare lines in two files WITHOUT respect to their position in those files (set difference operation)

For simple line-oriented comparisons, the comm command might be all you need:

$ tail a.txt b.txt 
==> a.txt <==
a
b
c
d
f
g

==> b.txt <==
a
b
c
e
g
h
$ comm -23 <(sort a.txt) <(sort b.txt)
d
f
$ comm -13 <(sort a.txt) <(sort b.txt)
e
h

Also, it's probably worth it to enable the --unique flag on sort in order to remove duplicate lines:

comm -23 <(sort --unique a.txt) <(sort --unique b.txt)


Related Topics



Leave a reply



Submit