zip command skip errors
tar
is always a good option to compress in Linux. Beware that zip
may also have file size limit issue.
tar vcfz file.tar.gz folder
Jersey Client download ZIP file and unpack efficiently
Ok so I solved it, apparently my request was getting a 404 for adding the query param in the path... .path("download?delete=true")
Check the total content size of a tar gz file
This will sum the total content size of the extracted files:
$ tar tzvf archive.tar.gz | sed 's/ \+/ /g' | cut -f3 -d' ' | sed '2,$s/^/+ /' | paste -sd' ' | bc
The output is given in bytes.
Explanation: tar tzvf
lists the files in the archive in verbose format like ls -l
. sed
and cut
isolate the file size field. The second sed
puts a + in front of every size except the first and paste
concatenates them, giving a sum expression that is then evaluated by bc
.
Note that this doesn't include metadata, so the disk space taken up by the files when you extract them is going to be larger - potentially many times larger if you have a lot of very small files.
Guide for installation of NVIDIA’s nvCOMP and running of its accompanying examples
I will answer my own question.
System info
Here is the system information obtained from the command line:
uname -r
: 5.15.0-46-genericlsb_release -a
: Ubuntu 20.04.5 LTSnvcc --version
: Cuda compilation tools, release 10.1, V10.1.243nvidia-smi
:- Two Tesla K80 (2-in-1 card) and one GeForce (Gigabyte RTX 3060 Vision 12G rev . 2.0)
- NVIDIA-SMI 470.82.01
- Driver Version: 470.82.01
- CUDA Version: 11.4
cmake --version
: cmake version 3.22.5make --version
: GNU Make 4.2.1lscpu
: Xeon CPU E5-2680 V4 @ 2.40GHz - 56 CPU(s)
Observation
Although there are two GPUs installed in the server, nvCOMP only works with the RTX.
The Steps
Perhaps "installation" is a misnomer. One only needs to properly compile the downloaded nvCOMP files and run the resulting executables.Step 1: The nvCOMP library
Download the nvCOMP library from https://developer.nvidia.com/nvcomp.
The file I downloaded was named nvcomp_install_CUDA_11.x.tgz
. And I left the extracted folder in the Downloads
directory and renamed it nvcomp
.
Step 2: The nvCOMP test package on GitHub
Download it from https://github.com/NVIDIA/nvcomp. Click the green "Code" icon, then click "Download ZIP".
By default, the downloaded zip file is called nvcomp-main.zip
. And I left the extracted folder, named nvcomp-main
, in the Downloads
directory.
Step 3: The NIVIDIA CUB library on GitHub
Download it from https://github.com/nvidia/cub. Click the green "Code" icon, then click "Download ZIP".
By default, the downloaded zip file is called cub-main.zip
. And I left the extracted folder, named cub-main
, in the Downloads
directory.
There is no "installation" of the CUB library other than making the folder path "known", ie available, to the calling program.
Comments: The nvCOMP GitHub site did not seem to explain that the CUB library was needed to run nvCOMP, and I only found that out from an error message during an attempted compilation of the test files in Step 2.
Step 4: "Building CPU and GPU Examples, GPU Benchmarks provided on Github"
The nvCOMP GitHub landing page has a section with the exact name as this Step. The instructions could have been more detailed.
Step 4.1: cmake
- All in the
Downloads
directory are the foldersnvcomp
(the Step 1 nvCOMP library),nvcomp-main
(Step 2), andcub-main
(Step 3). - Start a terminal and then go inside
nvcomp-main
, ie, go to/your-path/Downloads/nvcomp-main
- Run
cmake -DCMAKE_PREFIX_PATH=/your-path/Downloads/nvcomp -DCUB_DIR=/your-path/Downloads/cub-main
- This
cmake
step sets up the build files for the nextmake
" step. - During
cmake
, a harmless yellow-colored cmake warning appeared - There was also a harmless printout "-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed" per this thread.
- The last few printout lines from
cmake
variously stated it foundThreads
,nvcomp
,ZLIB
(on my system) and it was done with "Configuring" and "Build files have been written".
Step 4.2: make
- Run
make
in the same terminal as above. - This is a screenshot of the
make
compilation. - Please check the before and after folder tree to see what files have been generated.
Step 5: Running the examples/benchmarks
Let's run the "built-in" example before running the benchmarks with the (now outdated) Fannie Mae single-family loan performance data from NVIDIA's RAPIDS repository.
Check if there are executables in /your-path/Downloads/nvcomp-main/bin
. These are the excutables created from the cmake
and make
steps above.
You can try to run these executables on your to-be-compressed files, which are buit with different compression algorithms and functionalities. The name of the executable indicates the algorithm used and/or its functionality.
Some of the executables require the files to be of a certain size, eg, the "benchmark_cascaded_chunked" executable requires the target file's size to be a multiple of 4 bytes. I have not tested all of these executables.
Step 5.1: CPU compression examples
- Per https://github.com/NVIDIA/nvcomp
- Start a terminal (anywhere)
- Run
time /your-path/Downloads/nvcomp-main/bin/gdeflate_cpu_compression -f /full-path-to-your-target/my-file.txt
- Here are the results of running
gdeflate_cpu_compression
on an updated Fannie Mae loan data file "2002Q1.csv" (11GB) - Similarly, change the name of the executable to run
lz4_cpu_compression
orlz4_cpu_decompression
Step 5.2: The benchmarks with the Fannie Mae files from NVIDIA Rapids
Apart from following the NVIDIA instructions here, it seems the "benchmark" executables in the above "bin" directory can be run with "any" file. Just use the executable in the same way as in Step 5.1 and adhere to the particular executable specifications.
Below is one example following the NVIDIA instruction.
Long story short, the nvcomp-main
(Step 2) test package contains the files to (i) extract a column of homogeneous data from an outdated Fannie Mae loan data file, (ii) save the extraction in binary format, and (iii) run the benchmark executable(s) on the binary extraction.
The Fannie Mae single-family loan performance data files, old or new, all use "|" as the delimiter. In the outdated Rapids version, the first column, indexed as column "0" in the code (zero-based numbering), contains the 12-digit loan IDs for the loans sampled from the (real) Fannie Mae loan portfolio. In the new Fannie Mae data files from the official Fannie Mae site, the loan IDs are in column 2 and the data files have a csv
file extension.
Download the dataset "1 Year" Fannie Mae data, not the "1GB Splits*" variant, by following the link from here, or by going directly to RAPIDS
Place the downloaded
mortgage_2000.tgz
anywhere and unzip it withtar -xvzf mortgage_2000.tgz
.There are four txt files in
/mortgage_2000/perf
. I will usePerformance_2000Q1.txt
as an example.Check if python is installed on the system
Check if
text_to_binary.py
is in/nvcomp-main/benchmarks
Start a terminal (anywhere)
As shown below, use the python script to extract the first column, indexed "0", with format long, from
Performance_2000Q1.txt
, and put the.bin
output file somewhere.- Run
time python /your-path/Downloads/nvcomp-main/benchmarks/text_to_binary.py /your-other-path-to/mortgage_2000/perf/Performance_2000Q1.txt 0 long /another-path/2000Q1-col0-long.bin
- For comparison of the benchmarks, run
time python /your-path/Downloads/nvcomp-main/benchmarks/text_to_binary.py /your-other-path-to/mortgage_2000/perf/Performance_2000Q1.txt 0 string /another-path/2000Q1-col0-string.bin
- Run
Run the benchmarking executables with the target bin files as shown at the bottom of the web page of the NVIDIA official guide
- Eg,
/your-path/Downloads/nvcomp-main/bin/benchmark_hlif lz4 -f /another-path/2000Q1-col0-long.bin
- Just make sure the operating system know where the executable and the target file are.
- Eg,
Step 5.3: The high_level_quickstart_example
and low_level_quickstart_example
- These two executables are in
/nvcomp-main/bin
- They are completely self contained. Just run eg
high_level_quickstart_example
without any input arguments. Please see corresponding c++ source code in/nvcomp-main/examples
and see the official nvCOMP guides on GitHub.
Observations after some experiments
This could be another long thread but let's keep it short. Note that NVIDIA used various A-series cards for its benchmarks and I used a GeForce RTX 3060.
Speed
- The python script is slow. It took 4m12.456s to extract the loan ID column from an 11.8 GB Fannie Mae data file (with 108 columns) using format "string"
- In contract, R with
data.table
took 25.648 seconds to do the same. - With the outdated "Performance_2000Q1.txt" (0.99 GB) tested above, the python script took 32.898s whereas R took 26.965s to do the same extraction.
Compression ratio
- "Bloated" python outputs.
- The R-output "string.txt" files are generally a quarter of the size of the corresponding python-output "string.bin" files.
- Applying the executables to the R-output files achieved much better compression ratio and throughputs than to the python-output files.
- Eg, running
benchmark_hlif lz4 -f 2000Q1-col0-string.bin
with the python output vs runningbenchmark_hlif lz4 -f 2000Q1-col0-string.txt
with the R output - Uncompressed size: 436,544,592 vs 118,230,827 bytes
- Compressed size: 233,026,108 vs 4,154,261 bytes
- Compressed ratio: 1.87 vs 28.46 bytes
- Compression throughput (GB/s): 2.42 vs 18.96
- decompression throughput (GB/s): 8.86 vs 91.50
- Wall time: 2.805 vs 1.281s
- Eg, running
Overall performance: accounting for file size and memory limits
Use of the nvCOMP library is limited by the GPU memory, no more than 12GB for the RTX 3060 tested. And depending on the compression algorithm, an 8GB target file can easily trigger a stop with
cudaErrorMemoryAllocation: out of memory
In both speed and compression ratio,
pigz
trumped the tested nvCOMP excutables when the target files were the new Fannie Mae data files containing 108 columns of strings and numbers.
Related Topics
Get X/Y Position of Caret (Input Text Cursor) Under Xorg
Why Is Git Creating Read-Only (444) Files
How to Store Your Github Https Password on Linux in a Terminal Keychain
How to Check If a String Contains a Special Character (!@#$%^&*()_+)
Simulate Effect of Select() and Poll() in Kernel Socket Programming
How to Delete All Files Starting with ._ from The Shell in Linux
Efficiently Read The Average Color of The Screen Content Rendered by Xbmc
Tomcat Server Creating Directories in Tmp
Alternative for Netcat Utility
Selecting a Part of a File and Copying That into New File in Linux
Is Anyone Using Netlink for Ipc
Source Line Numbers in Perf Call Graph
Diff Files Comparing Only First N Characters of Each Line
Chmod a Freshly Mounted External Drive to Set Up Writing Access