How to Convert Webpage into Pdf by Using Python

How to convert webpage into PDF by using Python

thanks to below posts, and I am able to add on the webpage link address to be printed and present time on the PDF generated, no matter how many pages it has.

Add text to Existing PDF using Python

https://github.com/disflux/django-mtr/blob/master/pdfgen/doc_overlay.py

To share the script as below:

import time
from pyPdf import PdfFileWriter, PdfFileReader
import StringIO
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
from xhtml2pdf import pisa
import sys
from PyQt4.QtCore import *
from PyQt4.QtGui import *
from PyQt4.QtWebKit import *

url = 'http://www.yahoo.com'
tem_pdf = "c:\\tem_pdf.pdf"
final_file = "c:\\younameit.pdf"

app = QApplication(sys.argv)
web = QWebView()
#Read the URL given
web.load(QUrl(url))
printer = QPrinter()
#setting format
printer.setPageSize(QPrinter.A4)
printer.setOrientation(QPrinter.Landscape)
printer.setOutputFormat(QPrinter.PdfFormat)
#export file as c:\tem_pdf.pdf
printer.setOutputFileName(tem_pdf)

def convertIt():
web.print_(printer)
QApplication.exit()

QObject.connect(web, SIGNAL("loadFinished(bool)"), convertIt)

app.exec_()
sys.exit

# Below is to add on the weblink as text and present date&time on PDF generated

outputPDF = PdfFileWriter()
packet = StringIO.StringIO()
# create a new PDF with Reportlab
can = canvas.Canvas(packet, pagesize=letter)
can.setFont("Helvetica", 9)
# Writting the new line
oknow = time.strftime("%a, %d %b %Y %H:%M")
can.drawString(5, 2, url)
can.drawString(605, 2, oknow)
can.save()

#move to the beginning of the StringIO buffer
packet.seek(0)
new_pdf = PdfFileReader(packet)
# read your existing PDF
existing_pdf = PdfFileReader(file(tem_pdf, "rb"))
pages = existing_pdf.getNumPages()
output = PdfFileWriter()
# add the "watermark" (which is the new pdf) on the existing page
for x in range(0,pages):
page = existing_pdf.getPage(x)
page.mergePage(new_pdf.getPage(0))
output.addPage(page)
# finally, write "output" to a real file
outputStream = file(final_file, "wb")
output.write(outputStream)
outputStream.close()

print final_file, 'is ready.'

How to download a web page as a PDF using Python?

First of all method

from_url from module 'pdfkit' 

returns True when called.

After executing this line pdf = pdfkit.from_url(url, "file.pdf") value of pdf is True or False depending on downloading and creating the file.

So this line
r = requests.get(pdf)
is evalueated to
r = requests.get(True)
Which cannot be executed properly.

Basically you only need to ask the user for url and path to the file

url = input("Please enter the url of the file you want to download.")
path = input("Please enter the file path ex. C:\Jim\Desktop")
file_name = input("Please enter file name")
if pdfkit.from_url(str(url), str(path + file_name)): # Check if method from_url returned True
print("Sucessfully created pdf from url")
else:
print("Something went wrong")

Is it possible to save an HTML page as PDF using Python?

Have you tried pdfkit?

It is easy to use as well -

import pdfkit
pdfkit.from_file('test.html', 'out.pdf')

Creating PDFs from HTML/Javascript in Python with no OS dependencies

So to clarify and formalize what others have said:

  • If you want to create PDF documents from HTML/CSS/javascript content, you will necessarily need a javascript engine (because you obviously need to execute the javascript if it affects the visuals of the document). This is the most complex component that you need.

  • As for now, there is no ECMAscript compliant engine written in pure python that is well-maintained (that would be a huge project)... There will probably never be one, since compilers and VMs for languages need to be performant and are thus usually written in a performant low-level language.

  • So you will always need compiled binaries for that and the HTML renderers which are less complex but also need to be performant if used in browsers, so usually they're also C++ or the likes.

  • The javascript engine and HTML renderer are the major part of a browser, so a headless browser is a good solution to this requirement.

How to Convert HTML Pages to Pdf in Ubuntu using Python?

You can use WeasyPrint

Install:

pip install weasyprint

Code:

import weasyprint
pdf = weasyprint.HTML('http://www.google.com').write_pdf()
open('google.pdf', 'wb').write(pdf)

From: https://stackoverflow.com/a/34438445/13710015

Downloading a pdf based webpage as pdf using Python

A more-or-less basic use of the requests package will help you out here. (This is only slightly fancy with chunking the result.)

import requests
outpath = './out.pdf'
url = r"""http://curia.europa.eu/juris/showPdf.jsf;jsessionid=03B8AD93D8D1B1FBB33A15FDA3774709?text=&docid=62809&pageIndex=0&doclang=EN&mode=lst&dir=&occ=first&part=1&cid=2874259"""
r = requests.get(url, stream=True)
if r.status_code == 200:
with open(outpath, 'wb') as f:
for chunk in r.iter_content(1024):
f.write(chunk)

For more fun with requests, see: https://2.python-requests.org//en/master/



Related Topics



Leave a reply



Submit