Archival Filesystem or Format

Archival filesystem or format

virt-sparsify can be used to sparsify and (through qemu's qcow2 gzip support) compress almost any linux filesystem or disk image. The resulting images can be mounted in a VM, or on the host through guestmount.

There's a new ndbkit xz plugin that can be used for higher compression, which still keeps good random-access performance (as long as you ask xz/pixz to reset compression on block boundaries).

looking for a one-file filesystem or archiving file format

If you need archiving (i.e. sequential compressing and decompression, with no modifications), then ZIP is a standard and probably there's nothing close to ZIP in popularity. However ZIP is not effective when you manipulate files in it (i.e. when you need a virtual file system). In the latter case you can use CodeBase File System, Solid File System (our product) or one of similar products.

Storing lots of small files: archive vs. filesystem

Small files don't compress especially well, so you may not gain much compression.

While loading the files will be fast because they are smaller, decompression adds time. You'd have to experiment to see which is faster.

I would think the real issues would relate to the efficiency of the file system when it comes to iterating over all the little files, especially if they are all in one folder. Windows is notorious for being pretty inefficient when folders contain lots of files.

I would consider doing something like writing them out into one file, uncompressed, that could be streamed into memory -- maybe not necessarily contiguous memory, as that might be a problem. But the idea would be to put them all in one file. Then write some kind of index that ties a file name or other identifier to an offset from which the location of the image in memory could be determined.

New images could be added at the end, and the index updated appropriately.

It isn't fancy but that's what you're trying to avoid. An archive or even a file system gives you lots of power and flexibility but at the cost of efficiency. When you know what you want to do, sometimes simple is better.

I would consider implementing a solution that reads files from a folder, another that divides the files into subfolders and subsubfolders so there are no more than 100 or so files in any given folder, then time those solutions so you have something to compare to. I would think a simple indexed file would be fast enough that you wouldn't even need to pre-load the images like you're suggesting -- just retrieve them as you need them and keep them around once they're in memory.

Custom-made archive format question

Most file systems operate with clusters or pages or blocks, which have fixed size. In many filesystems the directory (metadata) is a just a special file, so it can grow in the same way the regular data files grow. On other filesystems some master metadata block has a fixed size which is pre-allocated during file system formatting. In this case the file system can become full before files take all available space.

On a side note, is there a reason to reinvent the wheel (custom file system for private needs)? There exist some implementations of in-file virtual file systems which are similar to archives, but provide more functionality. One of examples is our SolFS.

Is there an archive file format that supports being split in multiple parts and can be unpacked natively on MS Windows?

CAB archives meet this purpose a bit (see this library's page for example, it says that through it, archives can even be extracted directly from a HTTP(S)/FTP server). Since the library relies on .NET, it could even be used on Linux through Mono/Wine, which is a crucial part if your servers aren't running Windows... Because archive must be created on server, right?.

Your major problem is more that a split archive can't be created in parallel on multiple servers, at least because of LZx's dictionnary. Each server should create the whole set of archives and send only the ones it should send, and you don't have ANY guarantee that all these archives' sets would be identical on each server.

Best way is probably to create the whole archive on ONE server, then distribute each part (or the whole splitted archive...) on your various servers, through a replication-like interface.

Otherwise, you can also make individual archives that contains only a subset of the directory tree (you'll have to partition the files across servers), but it won't meet your requirements since it would be a collection of individual archives, and not a big splitted archive.

Some precisions may be required:

  • Do you absolutely need a system without any client besides the browser? Or can you use other protocols, as long as they natively exist on Windows (like FTP / SSH client that are now provided by default)?
  • What is the real purpose behind this request? Distribute load across all servers? Or to avoid too big single file downloads (i.e. a 30 GB archive) in case of transfer failure? Or both?
  • In case of a file size problem, why don't rely on resuming download?

Random-access archive for Unix use

You can check duplicty. It allow you to make compressed and encrypted backup and allows random access to file. Here you can find more info about these project: http://duplicity.nongnu.org/new_format.html.

If you want use it you can also check script duply. Is is shell front end for duplicty. More info: http://sourceforge.net/projects/ftplicity/

With which data format can I distribute a big number of small files?

I just did some Benchmarks:

Experiments / Benchmarks

I used dtrx to extract the following and time dtrx filename to get the time.

Format      File size     Time to extract
.7z 27.7 MB > 1h
.tar.bz2 29.1 MB 7.18s
.tar.lzma 29.3 MB 6.43s
.xz 29.3 MB 6.56s
.tar.gz 33.3 MB 6.56s
.zip 57.2 MB > 30min
.jar 70.8 MB 5.64s
.tar 177.9 MB 5.40s

Interesting. The extracted content is 47 MB big. Why is .tar more than 3 times the size of its content?

Anyway. I think tar.bz2 might be a good choice.

Archiving file in another disk linux

I want to archive the files that exist my pwd and are older than two years

Consist of two steps:

  • finding all files older then two years
  • archiving files from step one

To "find" files use find. The program find takes the argument -mtime. Command [ does not take -mtime argument.

"Archiving" I believe is just cp. To archive to a disc, you first have to format that disc to store files, and then you can copy files into the filesystem created on that disc.

Overall, it would something along:

 # First format and mount
# Note - it will erase all the files
sudo mkfs /dev/sdb1
sudo mount /dev/sdb1 /mnt/somedir
# then copy
find . -type f -mtime +730 -print0 |
xargs -0 cp -a -t /mnt/somedir --parents

This uses GNU tools and the -t and --parents options specific to GNU cp - see manual. But you should most probably use rsync anyway and research rdiff-backup and other better tools to do "archiving".



Related Topics



Leave a reply



Submit