Remote Linux Server to Remote Linux Server Large Sparse Files Copy - How To

Copying a 1TB sparse file

Short answer:
Use bsdtar or GNU tar (version 1.29 or later) to create archives, and GNU tar (version 1.26 or later) to extract them on another box.

Long answer:
There are some requirements for this to work.

First, Linux must be at least kernel 3.1 (Ubuntu 12.04 or later would do), so it supports SEEK_HOLE functionality.

Then, you need tar utility that can support this syscall. GNU tar supports it since version 1.29 (released on 2016/05/16, it should be present by default since Ubuntu 18.04), or bsdtar since version 3.0.4 (available since Ubuntu 12.04) - install it using sudo apt-get install bsdtar.

While bsdtar (which uses libarchive) is awesome, unfortunately, it is not very smart when it comes to untarring - it stupidly requires to have at least as much free space on target drive as untarred file size, without regard to holes. GNU tar will untar such sparse archives efficiently and will not check this condition.

This is log from Ubuntu 12.10 (Linux kernel 3.5):

$ dd if=/dev/zero of=1tb seek=1T bs=1 count=1
1+0 records in
1+0 records out
1 byte (1 B) copied, 0.000143113 s, 7.0 kB/s

$ time bsdtar cvfz sparse.tar.gz 1tb
a 1tb

real 0m0.362s
user 0m0.336s
sys 0m0.020s

# Or, use gnu tar if version is later than 1.29:
$ time tar cSvfz sparse-gnutar.tar.gz 1tb
1tb

real 0m0.005s
user 0m0.006s
sys 0m0.000s

$ ls -l
-rw-rw-r-- 1 autouser autouser 1099511627777 Nov 7 01:43 1tb
-rw-rw-r-- 1 autouser autouser 257 Nov 7 01:43 sparse.tar.gz
-rw-rw-r-- 1 autouser autouser 134 Nov 7 01:43 sparse-gnutar.tar.gz
$

Like I said above, unfortunately, untarring with bsdtar will not work unless you have 1TB free space. However, any version of GNU tar works just fine to untar such sparse.tar:

$ rm 1tb 
$ time tar -xvSf sparse.tar.gz
1tb

real 0m0.031s
user 0m0.016s
sys 0m0.016s
$ ls -l
total 8
-rw-rw-r-- 1 autouser autouser 1099511627777 Nov 7 01:43 1tb
-rw-rw-r-- 1 autouser autouser 257 Nov 7 01:43 sparse.tar.gz

Optimizing SCP over network

I suggest to take a look at rsync.
It's designed exactly for this kind of purposes.
Quoting from man rsync:

   Rsync is a fast and extraordinarily versatile file copying tool.   It  can
copy locally, to/from another host over any remote shell, or to/from a
remote rsync daemon. It offers a large number of options that control
every aspect of its behavior and permit very flexible specification of the
set of files to be copied. It is famous for its delta-transfer algorithm,
which reduces the amount of data sent over the network by sending only the
differences between the source files and the existing files in the desti-
nation. Rsync is widely used for backups and mirroring and as an improved
copy command for everyday use.

On a server that supports scp,
typically rsync works too.
Try this:

sshpass -p "password" rsync keylog.txt machine@192.168.151.19:/home/machine/keylog.txt 

Even without tuning the parameters,
rsync will try to minimize the amount of data transferred.
When the destination file already exists,
it will transfer only the necessary difference.

What can du see that rsync can't?

You probably have deleted files that can't yet be deallocated because there are open filehandles on them. (I didn't previously know that du would see the usage from those, but some testing showed that it does.) You can research this using lsof. The two main causes of this from my experience are deleting Apache logs without kicking the httpd and deleting mysql tables from the filesystem rather than by using DROP TABLE.

rsync --sparse does transfer whole data

Take a look a this discussion, specifically, this answer.

It seems that the solution is to do a rsync --sparse followed by a rsync --inplace.

On the first, --sparse, call, also use --ignore-existing to prevent already transferred sparse files to be overwritten, and -z to save network resources.

The second call, --inplace, should update only modified chunks. Here, compression is optional.

Also see this post.

Update

I believe the suggestions above won't solve your problem. I also believe that rsync is not the right tool for the task. You should search for other tools which will give you a good balance between network and disk I/O efficiency.

Rsync was designed for efficient usage of a single resource, the network. It assumes reading and writing to the network is much more expensive than reading and writing the source and destination files.

We assume that the two machines are connected by a low-bandwidth high-latency bi-directional communications link. The rsync algorithm, abstract.

The algorithm, summarized in four steps.

  1. The receiving side β sends checksums of blocks of size S of the destination file B.
  2. The sending side α identify blocks that match in the source file A, at any offset.
  3. α sends β a list of instructions made of either verbatim, non-matching, data, or matching block references.
  4. β reconstructs the whole file from those instructions.

Notice that rsync normally reconstructs the file B as a temporary file T, then replaces B with T. In this case it must write the whole file.

The --inplace does not relieve rsync from writing blocks matched by α, as one could imagine. They can match at different offsets. Scanning B a second time to take new data checksums is prohibitive in terms of performance. A block that matches in the same offset it was read on step one could be skipped, but rsync does not do that. In the case of a sparse file, a null block of B would match for every null block of A, and would have to be rewritten.

The --inplace just causes rsync to write directly to B, instead of T. It will rewrite the whole file.



Related Topics



Leave a reply



Submit