Download and Save PDF File with Python Requests Module

Download and save PDF file with Python requests module

You should use response.content in this case:

with open('/tmp/metadata.pdf', 'wb') as f:
f.write(response.content)

From the document:

You can also access the response body as bytes, for non-text requests:

>>> r.content
b'[{"repository":{"open_issues":0,"url":"https://github.com/...

So that means: response.text return the output as a string object, use it when you're downloading a text file. Such as HTML file, etc.

And response.content return the output as bytes object, use it when you're downloading a binary file. Such as PDF file, audio file, image, etc.


You can also use response.raw instead. However, use it when the file which you're about to download is large. Below is a basic example which you can also find in the document:

import requests

url = 'http://www.hrecos.org//images/Data/forweb/HRTVBSH.Metadata.pdf'
r = requests.get(url, stream=True)

with open('/tmp/metadata.pdf', 'wb') as fd:
for chunk in r.iter_content(chunk_size):
fd.write(chunk)

chunk_size is the chunk size which you want to use. If you set it as 2000, then requests will download that file the first 2000 bytes, write them into the file, and do this again, again and again, unless it finished.

So this can save your RAM. But I'd prefer use response.content instead in this case since your file is small. As you can see use response.raw is complex.


Relates:

  • How to download large file in python with requests.py?

  • How to download image using requests

How to download PDF file from web using python requests library

sometimes i need to download things programatically too. I just use this:

import requests

response = requests.get("https://link_to_thing.pdf")
file = open("myfile.pdf", "wb")
file.write(response.content)
file.close()

you can also use the os package to download with wget:

import os

url = 'https://link_to_pdf.pdf'
name = 'myfile.pdf'

os.system('wget {} -O {}'.format(url,name))

How do i download pdf file over https with python

I think this will work

import requests
import shutil
url="https://Hostname/saveReport/file_name.pdf" #Note: It's https
r = requests.get(url, auth=('usrname', 'password'), verify=False,stream=True)
r.raw.decode_content = True
with open("file_name.pdf", 'wb') as f:
shutil.copyfileobj(r.raw, f)

Python - Save A PDF file into Disk

If type(pdf) is of type <class 'bytes'> then you can just do:

with open('yourfile.pdf', 'wb') as f:
f.write(pdf)

Corrupted pdf when using requests (python)

Add headers to your requests

import requests 
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; PPC Mac OS X 10_8_7 rv:5.0; en-US) AppleWebKit/533.31.5 (KHTML, like Gecko) Version/4.0 Safari/533.31.5',
}
from bs4 import BeautifulSoup
url ="https://www.geeksforgeeks.org/how-to-extract-pdf-tables-in-python/"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a')
i = 0

for link in links:
if('.pdf' in link.get('href', [])):
i += 1
print("Downloading file: ", i)

response = requests.get(link.get('href'), headers=headers)

pdf = open("pdf"+str(i)+".pdf", 'wb')
pdf.write(response.content)
pdf.close()
print("File ", i, " downloaded")
print("All PDF files downloaded")

How to prevent downloading an empty pdf file while using get and requests in Python?

Thanks for the suggestion by the user.

As per @Nicolas,

Do the save as pdf only if the response return 200

if response.status_code == 200:

In the previous version, an empty file will be created regardless of the response because following with open(filename, 'wb') as f: was created before the checking status_code

To mitigate this, the with open(filename, 'wb') as f: should be initiated only if the condition set was as intended.

The complete code then is as below:

import requests
filename = 'new_name.pdf'
url_to_download_pdf='https://bradscholars.brad.ac.uk/https://www.brad.ac.uk/library/additional-help/bradford-scholars-faqs/digital_preservation_policy.pdf'
my_req = requests.get(url_to_download_pdf)
if my_req.status_code == 200:
with open(filename, 'wb') as f:
f.write(my_req.content)

Downloading .pdf using requests results in corrupted file

Actually you haven't passed the required parameters for starting the download, as if you have navigate to the url, you will see that you need to Click continue in order to start the download. what's happening in the bacground is GET request to the back-end with the following parameters ?switchLocale=y&siteEntryPassthrough=true to start the download.

You can view that under developer-tools within your browser and navigate to the Network-Tab section.

import requests

params = {
'switchLocale': 'y',
'siteEntryPassthrough': 'true'
}

def main(url, params):
r = requests.get(url, params=params)
with open("test.pdf", 'wb') as f:
f.write(r.content)

main("https://www.blackrock.com/uk/individual/literature/annual-report/blackrock-index-selection-fund-en-gb-annual-report-2019.pdf", params)

Download a pdf embedded in webpage using python2.7

It's much easier with requests

import requests 

url = 'https://ascopubs.org/doi/pdfdirect/10.1200/JCO.2018.77.8738'
pdfName = "./jco.2018.77.8738.pdf"
r = requests.get(url)

with open(pdfName, 'wb') as f:
f.write(r.content)


Related Topics



Leave a reply



Submit