Parallel download using Curl command line utility
Well, curl
is just a simple UNIX process. You can have as many of these curl
processes running in parallel and sending their outputs to different files.
curl
can use the filename part of the URL to generate the local file. Just use the -O
option (man curl
for details).
You could use something like the following
urls="http://example.com/?page1.html http://example.com?page2.html" # add more URLs here
for url in $urls; do
# run the curl job in the background so we can start another job
# and disable the progress bar (-s)
echo "fetching $url"
curl $url -O -s &
done
wait #wait for all background jobs to terminate
Parallel curl (or wget) downloads
If you're okay with a (slightly) different tool, may I recommend using GNU Wget2? It is the spiritual successor to GNU Wget. It is already available in the Debian and OpenSUSE repositories and on the AUR
Wget2 provides multi-threaded downloads out of the box with a nice progress bar to view the current status. It also supports HTTP/2 and many other newer features that were nearly impossible to add into Wget.
See my answer here: https://stackoverflow.com/a/49386440/952658 for some more details.
With Wget2, you can simply run $wget2 -i urls.txt
and it will start downloading your files in parallel.
EDIT: As mentioned in the other answer, a disclaimer: I maintain both Wget and Wget2. So I'm clearly biased towards this tool
Pipe output of cat to cURL to download a list of files
This works for me:
$ xargs -n 1 curl -O < urls.txt
I'm in FreeBSD. Your xargs may work differently.
Note that this runs sequential curl
s, which you may view as unnecessarily heavy. If you'd like to save some of that overhead, the following may work in bash:
$ mapfile -t urls < urls.txt
$ curl ${urls[@]/#/-O }
This saves your URL list to an array, then expands the array with options to curl
to cause targets to be downloaded. The curl
command can take multiple URLs and fetch all of them, recycling the existing connection (HTTP/1.1), but it needs the -O
option before each one in order to download and save each target. Note that characters within some URLs ] may need to be escaped to avoid interacting with your shell.
Or if you are using a POSIX shell rather than bash:
$ curl $(printf ' -O %s' $(cat urls.txt))
This relies on printf
's behaviour of repeating the format pattern to exhaust the list of data arguments; not all stand-alone printf
s will do this.
Note that this non-xargs method also may bump up against system limits for very large lists of URLs. Research ARG_MAX and MAX_ARG_STRLEN if this is a concern.
Faster way to download multiple files in R
The curl
package has a way to perform async requests, which means that downloads are performed simultaneously instead of one after another. Especially with smaller files this should give you a large boost in performance. Here is a barebone function that does that
# total_con: max total concurrent connections.
# host_con: max concurrent connections per host.
# print: print status of requests at the end.
multi_download <- function(file_remote,
file_local,
total_con = 1000L,
host_con = 1000L,
print = TRUE) {
# check for duplication (deactivated for testing)
# dups <- duplicated(file_remote) | duplicated(file_local)
# file_remote <- file_remote[!dups]
# file_local <- file_local[!dups]
# create pool
pool <- curl::new_pool(total_con = total_con,
host_con = host_con)
# function performed on successful request
save_download <- function(req) {
writeBin(req$content, file_local[file_remote == req$url])
}
# setup async calls
invisible(
lapply(
file_remote, function(f)
curl::curl_fetch_multi(f, done = save_download, pool = pool)
)
)
# all created requests are performed here
out <- curl::multi_run(pool = pool)
if (print) print(out)
}
Now we need some test files to compare it to your baseline approach. I use covid data from the Johns Hopkins University GitHub page as it contains many small csv files which should be similar to your files.
file_remote <- paste0(
"https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/",
format(seq(as.Date("2020-03-03"), as.Date("2022-06-01"), by = "day"), "%d-%m-%Y"),
".csv"
)
file_local <- paste0("/home/johannes/Downloads/test/", seq_along(file_remote), ".bin")
We could also infer the file names from the URLs but I assume that is not what you want. So now lets compare the approaches for these 821 files:
res <- bench::mark(
baseline(),
multi_download(file_remote,
file_local,
print = FALSE),
check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
summary(res)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec`
#> <bch:expr> <bch:> <bch:> <dbl>
#> 1 baseline() 2.8m 2.8m 0.00595
#> 2 multi_download(file_remote, file_local, print = FALSE) 12.7s 12.7s 0.0789
#> # … with 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>
summary(res, relative = TRUE)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec`
#> <bch:expr> <dbl> <dbl> <dbl>
#> 1 baseline() 13.3 13.3 1
#> 2 multi_download(file_remote, file_local, print = FALSE) 1 1 13.3
#> # … with 2 more variables: mem_alloc <dbl>, `gc/sec` <dbl>
The new approach is 13.3 times faster than the original one. I would assume that the difference will be bigger the more files you have. Note though, that this benchmark is not perfect as my internet speed fluctuates quite a bit.
The function should also be improved in terms of handling errors (currently you get a message how many requests have been successful and how many errored, but no indication which files exist). My understanding is also that multi_run
writes files to the memory before save_download
writes them to disk. With small files this is fine, but it might be an issue with larger ones.
baseline function
baseline <- function() {
credentials <- "usr/pwd"
downloader <- function(file_remote, file_local, credentials) {
data_bin <- RCurl::getBinaryURL(
file_remote,
userpwd = credentials,
ftp.use.epsv = FALSE,
forbid.reuse = TRUE
)
writeBin(data_bin, file_local)
}
purrr::walk2(
file_remote,
file_local,
~ downloader(
file_remote = .x,
file_local = .y,
credentials = credentials
)
)
}
Created on 2022-06-05 by the reprex package (v2.0.1)
How do I use cURL to perform multiple simultaneous requests?
While curl is a very useful and flexible tool, isn't intended for this type of use. There are other tools available which will let you make multiple concurrent requests to the same URL.
ab is a very simple yet effective tool of this type, which works for any web server (despite the introduction focusing on Apache server).
Grinder is a more sophisticated tool, which can let you specify many different URLs to use in a load test. This lets you mix requests for cheap and expensive pages, which may more closely resemble standard load for your website.
Download multiple files simultaneously (parallel) with custom filenames
Aria2c supports getting URIs from a file.
Try writing your file names into the file and then running "aria2c -i uri-list.txt" or write them to stdout and pipe them to "aria2c -i -"
download files parallely in a bash script
If you do not mind using xargs
then you can:
xargs -I xxx -P 3 sleep xxx < sleep
and sleep is:
1
2
3
4
5
6
7
8
9
and if you watch the background with:
watch -n 1 -exec ps --forest -g -p your-Bash-pid
(sleep could be your array of link ) then you will see that 3 jobs are run in parallel and when one of these three is completed the next job is added. In fact always 3 jobs are running till the end of array.
sample output of watch(1)
:
12260 pts/3 S+ 0:00 \_ xargs -I xxx -P 3 sleep xxx
12263 pts/3 S+ 0:00 \_ sleep 1
12265 pts/3 S+ 0:00 \_ sleep 2
12267 pts/3 S+ 0:00 \_ sleep 3
xargs
starts with 3 jobs and when one of them is finished it will add the next which bacomes:
12260 pts/3 S+ 0:00 \_ xargs -I xxx -P 3 sleep xxx
12265 pts/3 S+ 0:00 \_ sleep 2
12267 pts/3 S+ 0:00 \_ sleep 3
12269 pts/3 S+ 0:00 \_ sleep 4 # this one was added
How do I curl multiple resources in one command?
Do it in several processes:
for i in {1..50}
do
curl -O http://www.university.edu/~prof/lect$i/lect$i.pdf &
done
or as a one-liner (just a different formatting):
for i in {1..50}; do curl -O http://www.university.edu/~prof/lect$i/lect$i.pdf & done
The &
makes all processes run in parallel.
Don't be scared by the output; the shell tells you that 50 processes have been started, that's a lot of spam. Later it will tell you for each of these that they terminated. A lot of output again.
You probably don't want to run all 50 in parallel ;-)
EDIT:
Your example using {1..50}
twice makes a matrix of the numbers. See for example echo {1..3}/{1..3}
to see what I mean. And I guess that this way you create a lot of 404s.
Related Topics
Nginx Redirect Url with Query Strings
View Tabular File Such as CSV from Command Line
Differencebetween Buffer and Cache Memory in Linux
Efficiently Test If a Port Is Open on Linux
How to Find the Java Sdk in Linux After Installing It
Recursive Search and Replace in Text Files on MAC and Linux
Linux Command: How to 'Find' Only Text Files
How to Specify More Spaces for the Delimiter Using Cut
Send Mail from Linux Terminal in One Line
Location of Ini/Config Files in Linux/Unix
How to Download a Tarball from Github Using Curl
Unit Testing for Shell Scripts
Cross-Compile a Rust Application from Linux to Windows
How Would I Get a Cron Job to Run Every 30 Minutes
Apache Not Accepting Incoming Connections from Outside of Localhost
How Do Unix Domain Sockets Differentiate Between Multiple Clients