Wget Breaking with Content-Disposition

wget breaking with content-disposition

The & symbols are interpreted as the shell special character which causes a command to run in background(to fork). So you should escape or quote them:

curl -O -J -L 'http://waterwatch.usgs.gov/index.php?m=real&w=kml&r=us®ions=ia'

In the command above we used full quoting.

The following lines from your output mean that three commands are being forked to background:

[1] 32260
[2] 32261
[3] 32262

The numbers at the left (in brackets) are job numbers. You can bring a job to foreground by typing fg N, where N is the number of the job. The numbers at the right are process IDs.

PHP: Save the file returned by wget with file_put_content

Its because exec doesn't return the whole data. Take a look at the documentation
https://www.php.net/manual/en/function.exec.php :

Return Values: The last line from the result of the command.

But shell_exec (or just backticks) returns whole data: https://www.php.net/manual/en/function.shell-exec.php .

Example:

<?php

$url = 'https://file-examples-com.github.io/uploads/2017/02/zip_5MB.zip';

$content = exec("wget -qO- $url");

var_dump(strlen($content));

$content2 = shell_exec("wget -qO- $url");

var_dump(strlen($content2));

file_put_contents('1.zip', $content);
file_put_contents('2.zip', $content2);

Output:

int(208)
int(5452018)

2.zip works (all 5MB data), 1.zip obviously not (just some bytes from the end).

So don't treat exec's return value as the whole output of the command.

How to make wget to save with proper file name

On an HTTP level, the server sends the filename information to the client in the Content-Disposition header field within the HTTP response:

HTTP/1.1 200 OK
[…]
Content-Disposition: attachment; filename="bind-9.9.4-P2.tar.gz";

See RFC2183 for details on the Content-Disposition header field.

wget has experimental support for this feature according to its manpage:

   --content-disposition
      If this is set to on, experimental (not fully-functional) support for
      "Content-Disposition" headers is enabled. This can currently result in
      extra round-trips to the server for a "HEAD" request, and is known to
      suffer from a few bugs, which is why it is not currently enabled by
      default.

So if you choose to enable it, just specify the --content-disposition option. (You could also use curl to do the job instead of wget, but the question was about wget.)

Efficient parallel downloading and decompressing with matching pattern for list of files on server

Can you list the urls to download?

listurls() {
  # do something that lists the urls without downloading them
  # Possibly something like:
  # lynx -listonly -image_links -dump "$starturl"
  # or
  # wget --spider -r -nH -np -nv -nd -A "${filename}.bz2" "url/${run}/${1,,}/"
  # or
  # seq 100 | parallel echo ${url}${year}${month}${day}${run}_{}_${id}.grib2
}

get_and_extract_one() {
  url="$1"
  file="$2"
  wget -O - "$url" | bzip2 -dc > "$file"
}
export -f get_and_extract_one

# {=s:/:_:g; =} will generate a file name from the URL with / replaced by _
# You probably want something nicer.
# Possibly just {/.}
listurls | parallel get_and_extract_one {} '{=s:/:_:g; =}'

This way you will decompress while downloading and doing all in parallel.

Wget Breaking with Content-Disposition