How to Compare 2 Files Fast Using .Net

How to compare 2 files fast using .NET?

A checksum comparison will most likely be slower than a byte-by-byte comparison.

In order to generate a checksum, you'll need to load each byte of the file, and perform processing on it. You'll then have to do this on the second file. The processing will almost definitely be slower than the comparison check.

As for generating a checksum: You can do this easily with the cryptography classes. Here's a short example of generating an MD5 checksum with C#.

However, a checksum may be faster and make more sense if you can pre-compute the checksum of the "test" or "base" case. If you have an existing file, and you're checking to see if a new file is the same as the existing one, pre-computing the checksum on your "existing" file would mean only needing to do the DiskIO one time, on the new file. This would likely be faster than a byte-by-byte comparison.

Comparing two files in C#

Depending on how far you're looking to take it, you can take a look at Diff.NET

Here's a simple file comparison function:

// This method accepts two strings the represent two files to 
// compare. A return value of 0 indicates that the contents of the files
// are the same. A return value of any other value indicates that the
// files are not the same.
private bool FileCompare(string file1, string file2)
{
int file1byte;
int file2byte;
FileStream fs1;
FileStream fs2;

// Determine if the same file was referenced two times.
if (file1 == file2)
{
// Return true to indicate that the files are the same.
return true;
}

// Open the two files.
fs1 = new FileStream(file1, FileMode.Open, FileAccess.Read);
fs2 = new FileStream(file2, FileMode.Open, FileAccess.Read);

// Check the file sizes. If they are not the same, the files
// are not the same.
if (fs1.Length != fs2.Length)
{
// Close the file
fs1.Close();
fs2.Close();

// Return false to indicate files are different
return false;
}

// Read and compare a byte from each file until either a
// non-matching set of bytes is found or until the end of
// file1 is reached.
do
{
// Read one byte from each file.
file1byte = fs1.ReadByte();
file2byte = fs2.ReadByte();
}
while ((file1byte == file2byte) && (file1byte != -1));

// Close the files.
fs1.Close();
fs2.Close();

// Return the success of the comparison. "file1byte" is
// equal to "file2byte" at this point only if the files are
// the same.
return ((file1byte - file2byte) == 0);
}

C# - Compare Two Text Files

You would then have to compare the string content if the files. The StreamReader (which ReadLines uses) should detect the encoding.

var areEquals = System.IO.File.ReadLines("c:\\file1.txt").SequenceEqual(
System.IO.File.ReadLines("c:\\file2.txt"));

Note that ReadLines will not read the complete file into memory.

how to check if 2 files are equal using .NET?

If you want to be 100% sure of the exact bytes in the file being the same, then opening two streams and comparing each byte of the files is the only way.

If you just want to be pretty sure (99.9999%?), I would calculate a MD5 hash of each file and compare the hashes instead. Check out System.Security.Cryptography.MD5CryptoServiceProvider.

In my testing, if the files are usually equivalent then comparing MD5 hashes is about three times faster than comparing each byte of the file.

If the files are usually different then comparing byte-by-byte will be much faster, because you don't have to read in the whole file, you can stop as soon as a single byte differs.

Edit: I originally based this answer off a quick test which read from each file byte-by-byte, and compared them byte-by-byte. I falsely assumed that the buffered nature of the System.IO.FileStream would save me from worrying about hard disk block sizes and read speeds; this was not true. I retested my program that reads from each file in 4096 byte chunks and then compares the chunks - this method is slightly faster overall than MD5 even when the files are exactly the same, and will of course be much faster if they differ.

I'm leaving this answer as a mild warning about the FileStream class, and because I still thinkit has some value as an answer to "how do I calculate the MD5 of a file in .NET". Apart from that though, it's not the best way to fulfill the original request.

example of calculating the MD5 hashes of two files (now tested!):

using (var reader1 = new System.IO.FileStream(filepath1, System.IO.FileMode.Open, System.IO.FileAccess.Read))
{
using (var reader2 = new System.IO.FileStream(filepath2, System.IO.FileMode.Open, System.IO.FileAccess.Read))
{
byte[] hash1;
byte[] hash2;

using (var md51 = new System.Security.Cryptography.MD5CryptoServiceProvider())
{
md51.ComputeHash(reader1);
hash1 = md51.Hash;
}

using (var md52 = new System.Security.Cryptography.MD5CryptoServiceProvider())
{
md52.ComputeHash(reader2);
hash2 = md52.Hash;
}

int j = 0;
for (j = 0; j < hash1.Length; j++)
{
if (hash1[j] != hash2[j])
{
break;
}
}

if (j == hash1.Length)
{
Console.WriteLine("The files were equal.");
}
else
{
Console.WriteLine("The files were not equal.");
}
}
}

Compare two files any format c#

A different aproach and much simpler in my opinion would be using MD5 Hash:

        string filePath = @"C:\Users\Gabriel\Desktop\Test.txt";
string filePath2 = @"C:\Users\Gabriel\Desktop\Test2.txt";
string hash;
string hash2;

using (var md5 = MD5.Create())
{
using (var stream = File.OpenRead(filePath))
{
hash = BitConverter.ToString(md5.ComputeHash(stream)).Replace("-", "").ToLower();
}
using (var stream = File.OpenRead(filePath2))
{
hash2 = BitConverter.ToString(md5.ComputeHash(stream)).Replace("-", "").ToLower();
}
}
if (hash == hash2)
{
// Both files are the same, so you can do your stuff here
}

Be aware that MD5 Hash uses the contents of the file to define if they are the same, but it doesn't consider its name. So if you create 2 identical text files with different names it will be considered the same. If you need it to check the names too you could try changing the last if statement to something like that:

        if (hash == hash2)
{
FileInfo file = new FileInfo(filePath);
FileInfo file2 = new FileInfo(filePath2);
if (file.Name == file2.Name)
{
// Both files are the same, so you can do your stuff here
}
}

.NET Fastest way to find similarity and difference of two files?

You probably want to something like similarity as seen by a binary diff utility -- not a dumb byte-by-byte comparison. But hey, just for fun...

unsafe static long DumbDifference(string file1Path, string file2Path)
{
// completely untested! also, add some using()s here.
// also, map views in chunks if you plan to use it on large files.

MemoryMappedFile file1 = MemoryMappedFile.CreateFromFile(
file1Path, System.IO.FileMode.Open,
null, 0, MemoryMappedFileAccess.Read);
MemoryMappedFile file2 = MemoryMappedFile.CreateFromFile(
file2Path, System.IO.FileMode.Open,
null, 0, MemoryMappedFileAccess.Read);
MemoryMappedViewAccessor view1 = file1.CreateViewAccessor();
MemoryMappedViewAccessor view2 = file2.CreateViewAccessor();

long length1 = checked((long)view1.SafeMemoryMappedViewHandle.ByteLength);
long length2 = checked((long)view2.SafeMemoryMappedViewHandle.ByteLength);
long minLength = Math.Min(length1, length2);

byte* ptr1 = null, ptr2 = null;
view1.SafeMemoryMappedViewHandle.AcquirePointer(ref ptr1);
view2.SafeMemoryMappedViewHandle.AcquirePointer(ref ptr2);

ulong differences = (ulong)Math.Abs(length1 - length2);

for (long i = 0; i < minLength; ++i)
{
// if you expect your files to be pretty similar,
// you could optimize this by comparing long-sized chunks.
differences += ptr1[i] != ptr2[i] ? 1u : 0u;
}

return checked((long)differences);
}

Too bad .NET has no SIMD support built in.

comparing the contents of two huge text files quickly

  • Call File.ReadLines() (.NET 4) instead of ReadAllLines() (.NET 2.0).

    ReadAllLines needs to build an array to hold the return value, which can be extremely slow for large files.

    If you're not using .Net 4.0, replace it with a StreamReader.

  • Build a Dictionary<string, string> with the matchCollects (once), then loop through the foundList and check whether the HashSet contains matchFound.

    This allows you to replace the O(n) inner loop with an O(1) hash check

  • Use a StreamWriter instead of calling AppendText

  • EDIT: Call Path.GetFileNameWithoutExtension and the other Path methods instead of manually manipulating strings.

For example:

var collection = File.ReadLines(@"C:\found.txt")
.ToDictionary(s => s.Split('\\')[3].Replace(".txt", ""));

using (var writer = new StreamWriter(@"C:\Copy.txt")) {
foreach (string found in foundlist) {
string splitFound = found.Split('|');
string matchFound = Path.GetFileNameWithoutExtension(found)

string collectedLine;
if (collection.TryGetValue(matchFound, collectedLine)) {
end++;
long finaldest = (start - end);
Console.WriteLine(finaldest);
writer.WriteLine("copy \"" + collectedLine + "\" \"C:\\OUT\\"
+ splitFound[1] + "\\" + spltifound[0] + ".txt\"");
}
}
}

File comparison in VB.Net

I would say hashing the file is the way to go, It's how I have done it in the past.

Use Using statements when working with Streams and such, as they clean themselves up.
Here is an example.

Public Function CompareFiles(ByVal file1FullPath As String, ByVal file2FullPath As String) As Boolean

If Not File.Exists(file1FullPath) Or Not File.Exists(file2FullPath) Then
'One or both of the files does not exist.
Return False
End If

If file1FullPath = file2FullPath Then
' fileFullPath1 and fileFullPath2 points to the same file...
Return True
End If

Try
Dim file1Hash as String = hashFile(file1FullPath)
Dim file2Hash as String = hashFile(file2FullPath)

If file1Hash = file2Hash Then
Return True
Else
Return False
End If

Catch ex As Exception
Return False
End Try
End Function

Private Function hashFile(ByVal filepath As String) As String
Using reader As New System.IO.FileStream(filepath, IO.FileMode.Open, IO.FileAccess.Read)
Using md5 As New System.Security.Cryptography.MD5CryptoServiceProvider
Dim hash() As Byte = md5.ComputeHash(reader)
Return System.Text.Encoding.Unicode.GetString(hash)
End Using
End Using
End Function


Related Topics



Leave a reply



Submit