Skip Successfully Downloaded Files Using Wget

Skip successfully downloaded files using Wget

You can take advantage of Wget --continue (to resume broken downloads) and --timestamping (to overwrite successfully downloaded files only when Last-modified attribute has changed, otherwise skips the download)

wget "--continue ‐‐timestamping --timeout=180" "--tries=5" "$paramURL" "-O" "${scratch_dir}$paramFilename"

another option is to use --no-clobber instead of --timestamping, it skips the already downloaded files without checking the Last-modified attribute,

 wget "--continue ‐‐no-clobber --timeout=180" "--tries=5" "$paramURL" "-O" "${scratch_dir}$paramFilename"

How to ignore specific type of files to download in wget?

Use the

 --reject jpg,png  --accept html

options to exclude/include files with certain extensions, see http://www.gnu.org/software/wget/manual/wget.html#Recursive-Accept_002fReject-Options.

Put patterns with wildcard characters in quotes, otherwise your shell will expand them, see http://www.gnu.org/software/wget/manual/wget.html#Types-of-Files

wget, download linked files with specific ending

This works for me:

 wget -e robots=off -r -np -nH --accept "*.bz2"  http://downloads.skullsecurity.org/passwords/

Read about Robot Exclusion

If you know what you are doing and really really wish to turn off the robot exclusion, set the robots variable to ‘off’

Site http://downloads.skullsecurity.org/ contains robot.txt with content

User-agent: *
Disallow: /

Explanation

The Disallow: / tells the robot that it should not visit any pages on the site.

Python Wget: Check for duplicate files and skip if it exists?

wget.download() doesn't have any such option. The following workaround should do the trick for you:

import subprocess

url = "https://url/to/index.html"
path = "/path/to/save/your/files"
subprocess.run(["wget", "-r", "-nc", "-P", path, url])

If the file is already there, you will get the following message:

File ‘index.html’ already there; not retrieving.

EDIT:
If you are running this on Windows, you'd also have to include shell=True:

subprocess.run(["wget", "-r", "-nc", "-P", path, url], shell=True)

wget: delete incomplete files

I made surprising discovery when attempting to implement tvm's suggestion.

It turns out, and this something I didn't realize, that when you run wget -N, wget actually checks file sizes and verifies they are the same. If they are not, the files are deleted and then downloaded again.

So cool tip if you're having the same issue I am!

wget reject still downloads file

That appears to be how wget was designed to work. When performing recursive downloads, non-leaf files that match the reject list are still downloaded so they can be harvested for links, then deleted.

From the in-code comments (recur.c):

Either --delete-after was specified, or we loaded this
otherwise rejected (e.g. by -R) HTML file just so we
could harvest its hyperlinks
-- in either case, delete
the local file.

We've had a run-in with this in a past project where we had to mirror an authenticated site and wget keeps hitting the logout pages even when it was meant to reject those URLs. We could not find any options to change the behaviour of wget.

The solution we ended up with was to download, hack and build our own version of wget. There's probably a more elegant approach to this, but the quick fix we used was to add the following rules to the end of the download_child_p() routine (modified to match your requirements):

  /* Extra rules */
if (match_tail(url, ".pdf", 0)) goto out;
if (match_tail(url, ".css", 0)) goto out;
if (match_tail(url, ".gif", 0)) goto out;
if (match_tail(url, ".txt", 0)) goto out;
if (match_tail(url, ".png", 0)) goto out;
/* --- end extra rules --- */

/* The URL has passed all the tests. It can be placed in the
download queue. */
DEBUGP (("Decided to load it.\n"));

return 1;

out:
DEBUGP (("Decided NOT to load it.\n"));

return 0;
}


Related Topics



Leave a reply



Submit