Download Large File in Python With Requests

Download large file in python with requests

With the following streaming code, the Python memory usage is restricted regardless of the size of the downloaded file:

def download_file(url):
    local_filename = url.split('/')[-1]
    # NOTE the stream=True parameter below
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        with open(local_filename, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192): 
                # If you have chunk encoded response uncomment if
                # and set chunk_size parameter to None.
                #if chunk: 
                f.write(chunk)
    return local_filename

Note that the number of bytes returned using iter_content is not exactly the chunk_size; it's expected to be a random number that is often far bigger, and is expected to be different in every iteration.

See body-content-workflow and Response.iter_content for further reference.

problems downloading large files with requests?

Thanks to SilentGhost on IRC#python who pointed out to this suggesting I should upgrade my requests, which solved it(from 2.22.0 to 2.24.0).

upgrading the package is done like this:

pip install requests --upgrade

Another source that may help someone looking at this question is to use pycurl, here is a good starting point: https://github.com/rajatkhanduja/PyCurl-Downloader

or/and you can use --libcurl to your curl command to get a good indication on how to use pycurl

Save a large file using the Python requests library

Oddly enough, requests doesn't have anything simple for this. You'll have to iterate over the response and write those chunks to a file:

response = requests.get('http://www.example.com/image.jpg', stream=True)

# Throw an error for bad status codes
response.raise_for_status()

with open('output.jpg', 'wb') as handle:
    for block in response.iter_content(1024):
        handle.write(block)

I usually just use urllib.urlretrieve(). It works, but if you need to use a session or some sort of authentication, the above code works as well.

How to download a file using Python requests, when that file is being served with redirect?

Pass cookies={"PHPSESSID": "3r7ql7poiparp92ia7ltv8nai5"} instead of headers={"cookie": "PHPSESSID=3r7ql7poiparp92ia7ltv8nai5"}.

This is because the requests library does headers.pop('Cookie', None) upon redirect.
Retry if resp.url is not f"https://www.fadedpage.com/books/{bookID}/{fileType}.php".

This is because the server first redirects link.php with a different bookID to showbook.php.
A download of downloadFile("20170817", "html") contains the text "The First Part of this book is intended for pupils", not "woodland slope behind St. Pierre-les-Bains" that is contained in a download of downloadFile("20130603", "html").

def downloadFile(bookID, fileType, retry=1):
    cookies = {"PHPSESSID": "3r7ql7poiparp92ia7ltv8nai5"}
    url = f"https://www.fadedpage.com/link.php?file={bookID}.{fileType}"
    print("Getting ", url)
    with requests.get(url, cookies=cookies) as resp:
        if resp.url != f"https://www.fadedpage.com/books/{bookID}/{fileType}.php":
            if retry:
                return downloadFile(bookID, fileType, retry=retry-1)
            else:
                raise Exception
        with open(f"{bookID}.{fileType}", 'wb') as f:
            f.write(resp.content)

def isValidDownload(bookID, fileType="html"):
    """
    A download of `downloadFile("20170817", "html")` should produce
    a file 20170817.html which contains the text "The First Part of
    this book is intended for pupils". If it doesn't, it isn't getting
    the full text file.
    """
    with open(f"{bookID}.{fileType}") as f:
        raw = f.read()
    test = ""
    if bookID == "20130603":
        test = "woodland slope behind St. Pierre-les-Bains"
    if bookID == "20170817":
        test = "The First Part of this book is intended for pupils"
    return test in raw

Streaming download large file with python-requests interrupting

There might be several issues that will cause download to be interrupted. Network issues, etc. But we know the file size before we start the download to check if you have downloaded the whole file, you can do this using urllib:

site = urllib.urlopen("http://python.org")
meta = site.info()
print meta.getheaders("Content-Length")

Using requests:

r = requests.get("http://python.org")
r.headers["Content-Length"]

Download Large File in Python With Requests