Archival filesystem or format
virt-sparsify can be used to sparsify and (through qemu's qcow2 gzip support) compress almost any linux filesystem or disk image. The resulting images can be mounted in a VM, or on the host through guestmount.
There's a new ndbkit xz plugin that can be used for higher compression, which still keeps good random-access performance (as long as you ask xz/pixz to reset compression on block boundaries).
looking for a one-file filesystem or archiving file format
If you need archiving (i.e. sequential compressing and decompression, with no modifications), then ZIP is a standard and probably there's nothing close to ZIP in popularity. However ZIP is not effective when you manipulate files in it (i.e. when you need a virtual file system). In the latter case you can use CodeBase File System, Solid File System (our product) or one of similar products.
Storing lots of small files: archive vs. filesystem
Small files don't compress especially well, so you may not gain much compression.
While loading the files will be fast because they are smaller, decompression adds time. You'd have to experiment to see which is faster.
I would think the real issues would relate to the efficiency of the file system when it comes to iterating over all the little files, especially if they are all in one folder. Windows is notorious for being pretty inefficient when folders contain lots of files.
I would consider doing something like writing them out into one file, uncompressed, that could be streamed into memory -- maybe not necessarily contiguous memory, as that might be a problem. But the idea would be to put them all in one file. Then write some kind of index that ties a file name or other identifier to an offset from which the location of the image in memory could be determined.
New images could be added at the end, and the index updated appropriately.
It isn't fancy but that's what you're trying to avoid. An archive or even a file system gives you lots of power and flexibility but at the cost of efficiency. When you know what you want to do, sometimes simple is better.
I would consider implementing a solution that reads files from a folder, another that divides the files into subfolders and subsubfolders so there are no more than 100 or so files in any given folder, then time those solutions so you have something to compare to. I would think a simple indexed file would be fast enough that you wouldn't even need to pre-load the images like you're suggesting -- just retrieve them as you need them and keep them around once they're in memory.
Custom-made archive format question
Most file systems operate with clusters or pages or blocks, which have fixed size. In many filesystems the directory (metadata) is a just a special file, so it can grow in the same way the regular data files grow. On other filesystems some master metadata block has a fixed size which is pre-allocated during file system formatting. In this case the file system can become full before files take all available space.
On a side note, is there a reason to reinvent the wheel (custom file system for private needs)? There exist some implementations of in-file virtual file systems which are similar to archives, but provide more functionality. One of examples is our SolFS.
Is there an archive file format that supports being split in multiple parts and can be unpacked natively on MS Windows?
CAB
archives meet this purpose a bit (see this library's page for example, it says that through it, archives can even be extracted directly from a HTTP(S)/FTP server). Since the library relies on .NET, it could even be used on Linux through Mono/Wine, which is a crucial part if your servers aren't running Windows... Because archive must be created on server, right?.
Your major problem is more that a split archive can't be created in parallel on multiple servers, at least because of LZx's dictionnary. Each server should create the whole set of archives and send only the ones it should send, and you don't have ANY guarantee that all these archives' sets would be identical on each server.
Best way is probably to create the whole archive on ONE server, then distribute each part (or the whole splitted archive...) on your various servers, through a replication-like interface.
Otherwise, you can also make individual archives that contains only a subset of the directory tree (you'll have to partition the files across servers), but it won't meet your requirements since it would be a collection of individual archives, and not a big splitted archive.
Some precisions may be required:
- Do you absolutely need a system without any client besides the browser? Or can you use other protocols, as long as they natively exist on Windows (like FTP / SSH client that are now provided by default)?
- What is the real purpose behind this request? Distribute load across all servers? Or to avoid too big single file downloads (i.e. a 30 GB archive) in case of transfer failure? Or both?
- In case of a file size problem, why don't rely on resuming download?
Random-access archive for Unix use
You can check duplicty
. It allow you to make compressed and encrypted backup and allows random access to file. Here you can find more info about these project: http://duplicity.nongnu.org/new_format.html.
If you want use it you can also check script duply
. Is is shell front end for duplicty
. More info: http://sourceforge.net/projects/ftplicity/
With which data format can I distribute a big number of small files?
I just did some Benchmarks:
Experiments / Benchmarks
I used dtrx
to extract the following and time dtrx filename
to get the time.
Format File size Time to extract
.7z 27.7 MB > 1h
.tar.bz2 29.1 MB 7.18s
.tar.lzma 29.3 MB 6.43s
.xz 29.3 MB 6.56s
.tar.gz 33.3 MB 6.56s
.zip 57.2 MB > 30min
.jar 70.8 MB 5.64s
.tar 177.9 MB 5.40s
Interesting. The extracted content is 47 MB big. Why is .tar
more than 3 times the size of its content?
Anyway. I think tar.bz2
might be a good choice.
Archiving file in another disk linux
I want to archive the files that exist my pwd and are older than two years
Consist of two steps:
- finding all files older then two years
- archiving files from step one
To "find" files use find
. The program find
takes the argument -mtime
. Command [
does not take -mtime
argument.
"Archiving" I believe is just cp
. To archive to a disc, you first have to format that disc to store files, and then you can copy files into the filesystem created on that disc.
Overall, it would something along:
# First format and mount
# Note - it will erase all the files
sudo mkfs /dev/sdb1
sudo mount /dev/sdb1 /mnt/somedir
# then copy
find . -type f -mtime +730 -print0 |
xargs -0 cp -a -t /mnt/somedir --parents
This uses GNU tools and the -t
and --parents
options specific to GNU cp
- see manual. But you should most probably use rsync
anyway and research rdiff-backup
and other better tools to do "archiving".
Related Topics
Find Port Number of Ibm Mq Queue Manager
Implementation of Function Execve (Unistd.H)
Linux Kill Process Using Timeout in Milliseconds
Errors While Trying to Build Gdb for Arm
Case Statement in a While Loop, Shell Scripting
./Configure-With-Boost No Such File or Directory
Finding All Directories That Are World Readable
Linux Allocates Memory at Specific Physical Address
Find Command Search Only Non Hidden Directories
Reducing Privileges of a Process on Unix or Gnu/Linux
Inconsistent Systemd Startup of Freeswitch
Running a Program Through Ssh Fails with "Error Opening Terminal: Unknown."
How to Find Libstdc++.So.6: That Contain Glibcxx_3.4.19 for Rhel 6
How to Configure/Make/Install Against an Older Version of a Library
Error While Using Make to Compile Glibc-2.11.1 for Linux from Scratch