Parallel Download Using Curl Command Line Utility

Parallel download using Curl command line utility

Well, curl is just a simple UNIX process. You can have as many of these curl processes running in parallel and sending their outputs to different files.

curl can use the filename part of the URL to generate the local file. Just use the -O option (man curl for details).

You could use something like the following

urls="http://example.com/?page1.html http://example.com?page2.html" # add more URLs here

for url in $urls; do
# run the curl job in the background so we can start another job
# and disable the progress bar (-s)
echo "fetching $url"
curl $url -O -s &
done
wait #wait for all background jobs to terminate

Parallel curl (or wget) downloads

If you're okay with a (slightly) different tool, may I recommend using GNU Wget2? It is the spiritual successor to GNU Wget. It is already available in the Debian and OpenSUSE repositories and on the AUR

Wget2 provides multi-threaded downloads out of the box with a nice progress bar to view the current status. It also supports HTTP/2 and many other newer features that were nearly impossible to add into Wget.

See my answer here: https://stackoverflow.com/a/49386440/952658 for some more details.

With Wget2, you can simply run $wget2 -i urls.txt and it will start downloading your files in parallel.

EDIT: As mentioned in the other answer, a disclaimer: I maintain both Wget and Wget2. So I'm clearly biased towards this tool

Pipe output of cat to cURL to download a list of files

This works for me:

$ xargs -n 1 curl -O < urls.txt

I'm in FreeBSD. Your xargs may work differently.

Note that this runs sequential curls, which you may view as unnecessarily heavy. If you'd like to save some of that overhead, the following may work in bash:

$ mapfile -t urls < urls.txt
$ curl ${urls[@]/#/-O }

This saves your URL list to an array, then expands the array with options to curl to cause targets to be downloaded. The curl command can take multiple URLs and fetch all of them, recycling the existing connection (HTTP/1.1), but it needs the -O option before each one in order to download and save each target. Note that characters within some URLs ] may need to be escaped to avoid interacting with your shell.

Or if you are using a POSIX shell rather than bash:

$ curl $(printf ' -O %s' $(cat urls.txt))

This relies on printf's behaviour of repeating the format pattern to exhaust the list of data arguments; not all stand-alone printfs will do this.

Note that this non-xargs method also may bump up against system limits for very large lists of URLs. Research ARG_MAX and MAX_ARG_STRLEN if this is a concern.

Faster way to download multiple files in R

The curl package has a way to perform async requests, which means that downloads are performed simultaneously instead of one after another. Especially with smaller files this should give you a large boost in performance. Here is a barebone function that does that

# total_con: max total concurrent connections.
# host_con: max concurrent connections per host.
# print: print status of requests at the end.
multi_download <- function(file_remote,
file_local,
total_con = 1000L,
host_con = 1000L,
print = TRUE) {

# check for duplication (deactivated for testing)
# dups <- duplicated(file_remote) | duplicated(file_local)
# file_remote <- file_remote[!dups]
# file_local <- file_local[!dups]

# create pool
pool <- curl::new_pool(total_con = total_con,
host_con = host_con)

# function performed on successful request
save_download <- function(req) {
writeBin(req$content, file_local[file_remote == req$url])
}

# setup async calls
invisible(
lapply(
file_remote, function(f)
curl::curl_fetch_multi(f, done = save_download, pool = pool)
)
)

# all created requests are performed here
out <- curl::multi_run(pool = pool)

if (print) print(out)

}

Now we need some test files to compare it to your baseline approach. I use covid data from the Johns Hopkins University GitHub page as it contains many small csv files which should be similar to your files.

file_remote <- paste0(
"https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/",
format(seq(as.Date("2020-03-03"), as.Date("2022-06-01"), by = "day"), "%d-%m-%Y"),
".csv"
)
file_local <- paste0("/home/johannes/Downloads/test/", seq_along(file_remote), ".bin")

We could also infer the file names from the URLs but I assume that is not what you want. So now lets compare the approaches for these 821 files:

res <- bench::mark(
baseline(),
multi_download(file_remote,
file_local,
print = FALSE),
check = FALSE
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
summary(res)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec`
#> <bch:expr> <bch:> <bch:> <dbl>
#> 1 baseline() 2.8m 2.8m 0.00595
#> 2 multi_download(file_remote, file_local, print = FALSE) 12.7s 12.7s 0.0789
#> # … with 2 more variables: mem_alloc <bch:byt>, `gc/sec` <dbl>
summary(res, relative = TRUE)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec`
#> <bch:expr> <dbl> <dbl> <dbl>
#> 1 baseline() 13.3 13.3 1
#> 2 multi_download(file_remote, file_local, print = FALSE) 1 1 13.3
#> # … with 2 more variables: mem_alloc <dbl>, `gc/sec` <dbl>

The new approach is 13.3 times faster than the original one. I would assume that the difference will be bigger the more files you have. Note though, that this benchmark is not perfect as my internet speed fluctuates quite a bit.

The function should also be improved in terms of handling errors (currently you get a message how many requests have been successful and how many errored, but no indication which files exist). My understanding is also that multi_run writes files to the memory before save_download writes them to disk. With small files this is fine, but it might be an issue with larger ones.

baseline function

baseline <- function() {
credentials <- "usr/pwd"
downloader <- function(file_remote, file_local, credentials) {
data_bin <- RCurl::getBinaryURL(
file_remote,
userpwd = credentials,
ftp.use.epsv = FALSE,
forbid.reuse = TRUE
)
writeBin(data_bin, file_local)
}

purrr::walk2(
file_remote,
file_local,
~ downloader(
file_remote = .x,
file_local = .y,
credentials = credentials
)
)
}

Created on 2022-06-05 by the reprex package (v2.0.1)

How do I use cURL to perform multiple simultaneous requests?

While curl is a very useful and flexible tool, isn't intended for this type of use. There are other tools available which will let you make multiple concurrent requests to the same URL.

ab is a very simple yet effective tool of this type, which works for any web server (despite the introduction focusing on Apache server).

Grinder is a more sophisticated tool, which can let you specify many different URLs to use in a load test. This lets you mix requests for cheap and expensive pages, which may more closely resemble standard load for your website.

Download multiple files simultaneously (parallel) with custom filenames

Aria2c supports getting URIs from a file.

Try writing your file names into the file and then running "aria2c -i uri-list.txt" or write them to stdout and pipe them to "aria2c -i -"

download files parallely in a bash script

If you do not mind using xargs then you can:

xargs -I xxx -P 3 sleep xxx < sleep

and sleep is:

1
2
3
4
5
6
7
8
9

and if you watch the background with:

watch -n 1 -exec ps  --forest -g -p your-Bash-pid

(sleep could be your array of link ) then you will see that 3 jobs are run in parallel and when one of these three is completed the next job is added. In fact always 3 jobs are running till the end of array.

sample output of watch(1):

12260 pts/3    S+     0:00  \_ xargs -I xxx -P 3 sleep xxx
12263 pts/3 S+ 0:00 \_ sleep 1
12265 pts/3 S+ 0:00 \_ sleep 2
12267 pts/3 S+ 0:00 \_ sleep 3

xargs starts with 3 jobs and when one of them is finished it will add the next which bacomes:

12260 pts/3    S+     0:00  \_ xargs -I xxx -P 3 sleep xxx
12265 pts/3 S+ 0:00 \_ sleep 2
12267 pts/3 S+ 0:00 \_ sleep 3
12269 pts/3 S+ 0:00 \_ sleep 4 # this one was added

How do I curl multiple resources in one command?

Do it in several processes:

for i in {1..50}
do
curl -O http://www.university.edu/~prof/lect$i/lect$i.pdf &
done

or as a one-liner (just a different formatting):

for i in {1..50}; do curl -O http://www.university.edu/~prof/lect$i/lect$i.pdf & done

The & makes all processes run in parallel.

Don't be scared by the output; the shell tells you that 50 processes have been started, that's a lot of spam. Later it will tell you for each of these that they terminated. A lot of output again.

You probably don't want to run all 50 in parallel ;-)

EDIT:

Your example using {1..50} twice makes a matrix of the numbers. See for example echo {1..3}/{1..3} to see what I mean. And I guess that this way you create a lot of 404s.



Related Topics



Leave a reply



Submit