Skip successfully downloaded files using Wget
You can take advantage of Wget --continue
(to resume broken downloads) and --timestamping
(to overwrite successfully downloaded files only when Last-modified
attribute has changed, otherwise skips the download)
wget "--continue ‐‐timestamping --timeout=180" "--tries=5" "$paramURL" "-O" "${scratch_dir}$paramFilename"
another option is to use --no-clobber
instead of --timestamping
, it skips the already downloaded files without checking the Last-modified
attribute,
wget "--continue ‐‐no-clobber --timeout=180" "--tries=5" "$paramURL" "-O" "${scratch_dir}$paramFilename"
How to ignore specific type of files to download in wget?
Use the
--reject jpg,png --accept html
options to exclude/include files with certain extensions, see http://www.gnu.org/software/wget/manual/wget.html#Recursive-Accept_002fReject-Options.
Put patterns with wildcard characters in quotes, otherwise your shell will expand them, see http://www.gnu.org/software/wget/manual/wget.html#Types-of-Files
wget, download linked files with specific ending
This works for me:
wget -e robots=off -r -np -nH --accept "*.bz2" http://downloads.skullsecurity.org/passwords/
Read about Robot Exclusion
If you know what you are doing and really really wish to turn off the robot exclusion, set the robots variable to ‘off’
Site http://downloads.skullsecurity.org/ contains robot.txt with content
User-agent: *
Disallow: /
Explanation
The
Disallow: /
tells the robot that it should not visit any pages on the site.
Python Wget: Check for duplicate files and skip if it exists?
wget.download()
doesn't have any such option. The following workaround should do the trick for you:
import subprocess
url = "https://url/to/index.html"
path = "/path/to/save/your/files"
subprocess.run(["wget", "-r", "-nc", "-P", path, url])
If the file is already there, you will get the following message:
File ‘index.html’ already there; not retrieving.
EDIT:
If you are running this on Windows, you'd also have to include shell=True
:
subprocess.run(["wget", "-r", "-nc", "-P", path, url], shell=True)
wget: delete incomplete files
I made surprising discovery when attempting to implement tvm's suggestion.
It turns out, and this something I didn't realize, that when you run wget -N
, wget actually checks file sizes and verifies they are the same. If they are not, the files are deleted and then downloaded again.
So cool tip if you're having the same issue I am!
wget reject still downloads file
That appears to be how wget
was designed to work. When performing recursive downloads, non-leaf files that match the reject list are still downloaded so they can be harvested for links, then deleted.
From the in-code comments (recur.c):
Either --delete-after was specified, or we loaded this
otherwise rejected (e.g. by -R) HTML file just so we
could harvest its hyperlinks -- in either case, delete
the local file.
We've had a run-in with this in a past project where we had to mirror an authenticated site and wget
keeps hitting the logout pages even when it was meant to reject those URLs. We could not find any options to change the behaviour of wget
.
The solution we ended up with was to download, hack and build our own version of wget
. There's probably a more elegant approach to this, but the quick fix we used was to add the following rules to the end of the download_child_p()
routine (modified to match your requirements):
/* Extra rules */
if (match_tail(url, ".pdf", 0)) goto out;
if (match_tail(url, ".css", 0)) goto out;
if (match_tail(url, ".gif", 0)) goto out;
if (match_tail(url, ".txt", 0)) goto out;
if (match_tail(url, ".png", 0)) goto out;
/* --- end extra rules --- */
/* The URL has passed all the tests. It can be placed in the
download queue. */
DEBUGP (("Decided to load it.\n"));
return 1;
out:
DEBUGP (("Decided NOT to load it.\n"));
return 0;
}
Related Topics
How to Find Out The User of Parent Shell Inside a Child Shell
Error When Bootstrapping Cmake:Log of Errors
Make Diff to Ignore Symbolic Link
How to Take Screenshot of Obscured Window in C++ on Linux
Graphics Card Memory Usage in Linux
What's The Meaning of This Sed Command? Sed 's%^.*/%%'
Use Stdin from Within R Studio
How to Grep for Presence of Specific Hex Bytes in Files
Elk Not Passing Metadata from Filebeat into Logstash
How to Wget The More Recent File of a Directory
Why Does High-Memory Not Exist for 64-Bit Cpu
Npm Install -G Grunt-Cli Failed in Linux
"Relocation R_X86_64_32S Against '.Bss' Can Not Be Used When Making a Shared Object"
How to Set Umask Default for an User
How to Get Started with Libsandbox
Linux: Triggering Shell Command on File Save