Diff Files Inside of Zip Without Extracting It

How to read data from a zip file without having to unzip the entire file

DotNetZip is your friend here.

As easy as:

using (ZipFile zip = ZipFile.Read(ExistingZipFile))
{
ZipEntry e = zip["MyReport.doc"];
e.Extract(OutputStream);
}

(you can also extract to a file or other destinations).

Reading the zip file's table of contents is as easy as:

using (ZipFile zip = ZipFile.Read(ExistingZipFile))
{
foreach (ZipEntry e in zip)
{
if (header)
{
System.Console.WriteLine("Zipfile: {0}", zip.Name);
if ((zip.Comment != null) && (zip.Comment != ""))
System.Console.WriteLine("Comment: {0}", zip.Comment);
System.Console.WriteLine("\n{1,-22} {2,8} {3,5} {4,8} {5,3} {0}",
"Filename", "Modified", "Size", "Ratio", "Packed", "pw?");
System.Console.WriteLine(new System.String('-', 72));
header = false;
}
System.Console.WriteLine("{1,-22} {2,8} {3,5:F0}% {4,8} {5,3} {0}",
e.FileName,
e.LastModified.ToString("yyyy-MM-dd HH:mm:ss"),
e.UncompressedSize,
e.CompressionRatio,
e.CompressedSize,
(e.UsesEncryption) ? "Y" : "N");

}
}

Edited To Note: DotNetZip used to live at Codeplex. Codeplex has been shut down. The old archive is still available at Codeplex. It looks like the code has migrated to Github:

  • https://github.com/DinoChiesa/DotNetZip. Looks to be the original author's repo.
  • https://github.com/haf/DotNetZip.Semverd. This looks to be the currently maintained version. It's also packaged up an available via Nuget at https://www.nuget.org/packages/DotNetZip/

Reading contents of zip file without extracting

You actually are reading what exactly is in the file.

The /r/n character is the newline character in windows. The question
Difference between \n and \r? goes into a bit more detail, but what it comes down to is that Windows uses /r/n as its newline.

The b' character you seeing is related to python and how it parses the file. The question What does the 'b' character do in front of a string literal? does a good job answering why exactly that is happening, but the documentation quoted is:

Bytes literals are always prefixed with 'b' or 'B'; they produce an
instance of the bytes type instead of the str type. They may only
contain ASCII characters; bytes with a numeric value of 128 or greater
must be expressed with escapes.

EDIT: I actually found a very similar answer you can pull from for reading without the extra characters: py3k: How do you read a file inside a zip file as text, not bytes?. The basic idea was you could use this:

items_file  = io.TextIOWrapper(items_file, encoding='your-encoding', newline='')

Compare ZIP file with dir with shell command

You can install a command line tool called unzip, and run

$unzip -l yourzipfile.zip

Files contained in yourzipfile.zip will be listed.

========

To verify files automatically, you can follow these steps.

If files compressed into yourzipfile.zip is in dir1, you can first unzip yourzipfile.zip into dir2, then you may compare files in dir1 and dir2 by running

$ diff --brief -r dir1/ dir2/

Why does Zipping the same content twice gives two files with different SHA1?

According to Wikipedia http://en.wikipedia.org/wiki/Zip_(file_format) seems that zip files have headers for
File last modification time and File last modification date so any zip file checked into git will appear to git to have changed if the zip is rebuilt from the same content since. And it seems that there is no flag to tell it to not set those headers.

I am resorting to just using tar, it seems to produce the same bytes for the same input if run multiple times.



Related Topics



Leave a reply



Submit