Save Embedded Pdf from Website

Python Download PDF Embedded in a Page

Using Selenium with a specific ChromeProfile you can download embedded pdfs using the following code:

Code:

def download_pdf(lnk):

from selenium import webdriver
from time import sleep

options = webdriver.ChromeOptions()

download_folder = "C:\\"

profile = {"plugins.plugins_list": [{"enabled": False,
"name": "Chrome PDF Viewer"}],
"download.default_directory": download_folder,
"download.extensions_to_open": ""}

options.add_experimental_option("prefs", profile)

print("Downloading file from link: {}".format(lnk))

driver = webdriver.Chrome(chrome_options = options)
driver.get(lnk)

filename = lnk.split("/")[4].split(".cfm")[0]
print("File: {}".format(filename))

print("Status: Download Complete.")
print("Folder: {}".format(download_folder))

driver.close()

And when I call this function:

download_pdf("http://www.equibase.com/premium/eqbPDFChartPlus.cfm?RACE=1&BorP=P&TID=ALB&CTRY=USA&DT=06/17/2002&DAY=D&STYLE=EQB")

Thats the output:

>>> Downloading file from link: http://www.equibase.com/premium/eqbPDFChartPlus.cfm?RACE=1&BorP=P&TID=ALB&CTRY=USA&DT=06/17/2002&DAY=D&STYLE=EQB
>>> File: eqbPDFChartPlus
>>> Status: Download Complete.
>>> Folder: C:\

Sample Image


Take a look at the specific profile:

profile = {"plugins.plugins_list": [{"enabled": False,
"name": "Chrome PDF Viewer"}],
"download.default_directory": download_folder,
"download.extensions_to_open": ""}

It disables the Chrome PDF Viewer plugin (that embedds the pdf at the webpage), set the default download folder to the folder defined at download_folder variable and sets that Chrome isn't allowed to open any extensions automatically.

After that, when you open the so called "Internal link" your webdriver will automatically download the .pdf file to the download_folder.

How to download embedded PDF from webpage using selenium?

Here You go, description in code:

=^..^=

from selenium import webdriver
import os

# initialise browser
browser = webdriver.Chrome(os.getcwd()+'/chromedriver')
# load page with iframe
browser.get('https://www.sebi.gov.in/enforcement/orders/jun-2019/adjudication-order-in-respect-of-three-entities-in-the-matter-of-prism-medico-and-pharmacy-ltd-_43323.html')

# find pdf url
pdf_url = browser.find_element_by_tag_name('iframe').get_attribute("src")
# load page with pdf
browser.get(pdf_url)
# download file
download = browser.find_element_by_xpath('//*[@id="download"]')
download.click()

R: download pdf embedded in a webpage

It is not correct that IE is needed to get effective interaction with that page. Using Firefox or Chrome on a Mac the small printer icon above and to the left of the data columns offers to "Print this report" when moused-over and when clicked causes a file named "CrystalReportViewer1.pdf" to be downloaded. If one then uses the cross-platform browser-plugin application named Tabula, you can extract the data in csv form. The top of the extracted data (on April 1, 2016) looks like:

Syndication ,Gilt Name ,Amount Sold ,Issue ,Issue ,Announcement ,Results 
Date ,"",(£ million ,Price (£) ,Yield ,Press Release ,Press Release
"","",nominal),"","","",""
23 Feb 2016 ,0 1/8% Index-linked Treasury Gilt 2065 ," 2,750.0 ", 163.73 ,-0.8905% ,Announcement ,Results
01 Dec 2015 ,0 1/8% Index-linked Treasury Gilt 2046 ," 3,250.0 ", 129.74 ,-0.7475% ,Announcement ,Results
20 Oct 2015 ,2½% Treasury Gilt 2065 ," 4,750.0 ", 98.40 , 2.5570% ,Announcement ,Results
22 Sep 2015 ,0 1/8% Index-linked Treasury Gilt 2068 ," 2,500.0 ", 166.00 ,-0.8655% ,Announcement ,Results
21 Jul 2015 ,3½% Treasury Gilt 2068 ," 4,000.0 ", 121.31 , 2.7360% ,Announcement ,Results

Instead of trying to extract pdf from that page (which is not in pdf form as far as I can determine) you should instead use RSelenium to download the pdf file to a local drive and process it from there.

This is the button ID:

{'id':'CrystalReportViewer1_toptoolbar_print'}

There are demos in the RSelenium help pages. One is entitled: selDownloadZip.R. It shows how to execute a "click" on a "page Element":

webElem <- remDr$findElement("id", "CrystalReportViewer1_toptoolbar_print")
webElem$clickElement()

Then looking at the "element Inspector" in Firefox's ViewSource panel I see the name of the button ("id", "theBttnbobjid_1459536946505_dialog_submitBtn"), so a further click is needed. However that number changes with each page access, so use webElem <- remDr$findElement("link text", "Export")

 webElem <- remDr$findElement("link text", "Export")
webElem$clickElement()

It would be a good idea to review the webElement-class help page.

JavaScript Sample Image

how to save/download pdf embedded in web page without a pdf filename

The common method in CF for streaming a PDF to the browser is using this method:

<cfheader name="Content-Disposition" value="attachment;filename=#PDFFileName#">
<cfcontent type="application/pdf" reset="true" variable="#toBinary(PDFinMemory)#">

Use a C# WebRequest to get the URL of the PDf. Then check the response header for a 'Content-Type of 'application/pdf'. If so, save the binary stream to a PDF file on disk.

Edit pdf embedded in the browser and save the pdf directly to server

My assumption: you have an interactive PDF document with AcroForm fields similar to submit_me.pdf:

Sample Image

The main difference, is that I have different buttons on the form:

  • POST will post the data I fill out as if the PDF was an HTML form,
  • FDF will post the data to the server in the Forms Data Format, a format that is very similar to PDF, but that only contains the data pairs, not the actual form,
  • XFDF will post the data to the server in the XML Forms Data Format, which is the XML version of the Forms Data Format.
  • RESET will reset all the fields to their initial value.

The SubmitForm example shows how the buttons were added to an existing form. Note that there are more options than the ones described in this example.

For instance:

  • Instead of using the POST method, you can also use GET by tweaking some of the parameters,
  • You can also submit the complete PDF (data + form) to the server, but this will only work if the end user has Adobe Acrobat as plug-in. This won't work if the plug-in is merely Adobe Reader,
  • ...

To show you what to expect when you commit a form, I wrote the ShowData servlet. This servlet returns the bytes that are sent to the server.

In case of a POST:

personal.loginname=jdoe&personal.name=John+Doe&personal.password=test&personal.reason=reason&post.x=29&post.y=7

Note that I also defined the button in such a way that the coordinate of my click is passed to the server. You probably don't need this.

In case of FDF:

%FDF-1.2
%âãÏÓ
1 0 obj
<</FDF<</Fields[<</T(FDF)>><</Kids[<</T(loginname)/V(jdoe)>><</T(name)/V(John Doe)>><</T(password)/V(test)>><</T(reason)/V(Reason)>>]/T(personal)>>]/ID[<EF0089E16ED50F11CB6057A700B9046E><1205D069D1D6AE37665B6FF7EEA65414>]>>/Type/Catalog>>
endobj
trailer
<</Root 1 0 R>>
%%EOF

In case of XFDF:

<?xml version="1.0" encoding="UTF-8"?>
<xfdf xmlns="http://ns.adobe.com/xfdf/" xml:space="preserve"
><f href="http://itextpdf.com:8180/book/submit_me.pdf"
/><fields
><field name="XFDF"
/><field name="personal"
><field name="loginname"
><value
>jdoe</value
></field
><field name="name"
><value
>John Doe</value
></field
><field name="password"
><value
>test</value
></field
><field name="reason"
><value
>Reason</value
></field
></field
></fields
><ids original="EF0089E16ED50F11CB6057A700B9046E" modified="1205D069D1D6AE37665B6FF7EEA65414"
/></xfdf
>

In an ideal world, this would be your solution. It is described in ISO-32000-1 which is the world-wide standard for PDF. However: many people started using crappy PDF viewers that do not support this functionality, so if you want to use this solution, you'll have to make sure that people use a decent PDF viewer as their browser plug-in.

how to access embedded pdf file in web page

This ones really tricky since it isn't actually a pdf file that is being downloaded. Your code would work if it was a normal pdf. This is a webpage that runs some javascript that posts back to itself to generate the pdf.

I have the answer to your immediate problem but you may have a long road ahead of you if you need to do this for many files. To get this to work I ran the page through Fiddler to get the post string that it was posting back to itself and then I used C# to emulate that process and save the result as a pdf.

This works great but the problem is that without that manual step of getting the post string via Fiddler you would basically have to create your own web browser that understands all the javascript codes and execute them to find out how the string is generated.

    using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.IO;
using System.Linq;
using System.Net;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;

namespace WindowsFormsApplication1
{
public partial class Form1 : Form
{
public Form1()
{
InitializeComponent();
}

private void Form1_Load(object sender, EventArgs e)
{
string sURL = "http://emaps.emapsplus.com/rdl/MadisonCoAl/MadisonCoAl.aspx?showimg=yes&pid=1701013003029000";
string sSource = "";

byte[] buffer = new byte[4096];
WebRequest wr = WebRequest.Create(sURL);
using (WebResponse response = wr.GetResponse())
{
using (Stream responseStream = response.GetResponseStream())
{
using (MemoryStream memoryStream = new MemoryStream())
{
int count = 0;
do
{
count = responseStream.Read(buffer, 0, buffer.Length);
memoryStream.Write(buffer, 0, count);

} while (count != 0);

sSource = System.Text.Encoding.UTF8.GetString(memoryStream.ToArray());
}
}
}

if (!string.IsNullOrEmpty(sSource))
{
const string sQuoteString = "\""; // If the values are not being found it could be because the markup is being output with single quotes change this variable from "\"" to "'" in that case
const string sViewStateString = "__VIEWSTATE";
const string sEventValidationString = "__EVENTVALIDATION";
const string sValueString = "value=" + sQuoteString;

Int32 nIndex1 = sSource.IndexOf(sViewStateString);
Int32 nIndex2 = default(Int32);
bool bFoundValues = false;
string sViewState = "";
string sEventValidation = "";

// Look for the view state and event validation tags and grab the values
// Without these values we cannot continue

if (nIndex1 > -1)
{
nIndex2 = sSource.IndexOf(sValueString, nIndex1);

if (nIndex2 > -1)
{
nIndex1 = sSource.IndexOf(sQuoteString, nIndex2 + sValueString.Length);

if (nIndex1 > -1)
{
sViewState = sSource.Substring(nIndex2 + sValueString.Length, nIndex1 - nIndex2 - sValueString.Length);
nIndex1 = sSource.IndexOf(sEventValidationString);

if (nIndex1 > -1)
{
nIndex2 = sSource.IndexOf(sValueString, nIndex1);

if (nIndex2 > -1)
{
nIndex1 = sSource.IndexOf(sQuoteString, nIndex2 + sValueString.Length);

if (nIndex1 > -1)
{
sEventValidation = sSource.Substring(nIndex2 + sValueString.Length, nIndex1 - nIndex2 - sValueString.Length);
bFoundValues = true;
}
}
}
}
}
}

if (bFoundValues == true)
{
Int32 nTimeout = 30;
HttpWebRequest oRequest = HttpWebRequest.Create(new Uri(sURL).AbsoluteUri) as HttpWebRequest;
string sPostData = "__EVENTTARGET=btnPageLoad&__EVENTARGUMENT=&__VIEWSTATE=" + System.Web.HttpUtility.UrlEncode(sViewState) + "&__EVENTVALIDATION=" + System.Web.HttpUtility.UrlEncode(sEventValidation) + "&hdnLoaded=false&ReportViewer1%24ctl03%24ctl00=&ReportViewer1%24ctl03%24ctl01=&ReportViewer1%24ctl11=&ReportViewer1%24ctl12=standards&ReportViewer1%24AsyncWait%24HiddenCancelField=False&ReportViewer1%24ToggleParam%24store=&ReportViewer1%24ToggleParam%24collapse=false&ReportViewer1%24ctl09%24ClientClickedId=&ReportViewer1%24ctl08%24store=&ReportViewer1%24ctl08%24collapse=false&ReportViewer1%24ctl10%24VisibilityState%24ctl00=Error&ReportViewer1%24ctl10%24ScrollPosition=&ReportViewer1%24ctl10%24ReportControl%24ctl02=&ReportViewer1%24ctl10%24ReportControl%24ctl03=&ReportViewer1%24ctl10%24ReportControl%24ctl04=100";
byte[] oPostDataBuffer = System.Text.Encoding.ASCII.GetBytes(sPostData);

oRequest.UserAgent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10; rv:33.0) Gecko/20100101 Firefox/33.0";
oRequest.Timeout = nTimeout * 1000;
oRequest.Method = "POST";
oRequest.ContentType = "application/x-www-form-urlencoded";
oRequest.ContentLength = oPostDataBuffer.Length;

using (Stream oRequestStream = oRequest.GetRequestStream())
{
oRequestStream.Write(oPostDataBuffer, 0, oPostDataBuffer.Length);
oRequestStream.Close();
}

HttpWebResponse oResponse = oRequest.GetResponse() as HttpWebResponse;

if (oResponse.StatusCode != HttpStatusCode.OK)
{
// Error

MessageBox.Show(oResponse.StatusCode.ToString());
}
else
{
// Status is OK

byte[] oBuffer = null;
byte[] oFile = null;

using (BinaryReader reader = new BinaryReader(oResponse.GetResponseStream()))
{
using (MemoryStream oMemoryStream = new MemoryStream())
{

oBuffer = reader.ReadBytes(1024);

while (oBuffer.Length > 0)
{
oMemoryStream.Write(oBuffer, 0, oBuffer.Length);
oBuffer = reader.ReadBytes(1024);
}

oFile = new byte[Convert.ToInt32(Math.Floor(Convert.ToDouble(oMemoryStream.Length)))];
oMemoryStream.Position = 0;
oMemoryStream.Read(oFile, 0, oFile.Length);
}
}

using (FileStream oFileStream = new FileStream("C:\\testpdf.pdf", FileMode.Create))
{
oFileStream.Write(oFile, 0, oFile.Length);
}
}

MessageBox.Show("PDF downloaded to C:\\testpdf.pdf");
}
else
{
MessageBox.Show("Cannot find the __VIEWSTATE and/or __EVENTVALIDATION variables.");
}
}
else
{
MessageBox.Show("Cannot find source code for original url.");
}
}
}
}

Update:

There is most likely some kind of session involved with the post data and it already timed out before you were able to test it. Therefore this means we have to get even more creative. I will admit this has been a bit of trial and error and this code is really tailored to this one url and may or may not work for other pdfs on the same website. By comparing the sPostData string I posted earlier with a new one I just grabbed now by running the website through the Fiddler proxy, I discovered that only 2 of the many variables that are posted have changed. Both of these 2 variables are available in the html source code that can be generated from your original C# code. So all we do is a little string manipulation and grab a copy of those variables and THEN do the original code that I posted. See? We are working together now! This updated code should now work every time without giving the 500 internal server error message.

NOTE: Because we are posting data that has not been properly encoded for the web, we must include a reference to system.web to gain access to the urlencode method. To do this you need to:

  1. Right click on "References" in the "Solution Explorer" and choose "Add Reference"
  2. Click into "Assemblies" on the left and find "System.Web" in the "Framework" section or use the search box on the far right
  3. Put a check next to "System.Web" and click "OK"

Here's code to streamline capturing the postback data

        var postData = new System.Collections.Generic.List<string>();
var document = new HtmlWeb().Load(url);
foreach (var input in document.DocumentNode.SelectNodes("//input[@type='hidden']"))
{
var name = input.GetAttributeValue("name", "");
name = Uri.EscapeDataString(name);
var value = input.GetAttributeValue("value", "");
value = Uri.EscapeDataString(value);
if (name == "__EVENTTARGET")
{
value = "btnPageLoad";
}

postData.Add(string.Format("{0}={1}", name, value));
}

// Use StringBuilder for concatenation
System.Text.StringBuilder sb = new System.Text.StringBuilder(postData[0]);
for (int i = 1; i < postData.Count; i++)
{
sb.Append("&");
sb.Append(postData[i]);
}
var postBody = sb.ToString();

Download Embedded PDF File

Note: This will only work for pdf files. If there is a mix of embedded files then this will not work.

Basic Preparations:

  1. Let's say our Excel File C:\Users\routs\Desktop\Sample.xlsx has 2 Pdf Files embedded as shown below.

    Sample Image

  2. For testing purpose, we will create a temp folder on our desktop C:\Users\routs\Desktop\Temp.

Logic:

  1. The Excel file is essentially just a .zip file
  2. Excel saves the oleObjects in the \xl\embeddings\ folder. If you rename the Excel file to zip and open it in say Winzip, you can see the following

    Sample Image

  3. If you extract the bin files and rename it to pdf then you will be able to open the pdf in Microsoft Edge but not in any other pdf viewer. To make it compatible with any other pdf viewer, we will have to do some Binary reading and editing.

  4. If you open the bin file in any Hex Editor, you will see the below. I used the online hex editor https://hexed.it/

    JavaScript Sample Image 7

    We have to delete everything before the word %PDF

    We will try and find the 8 bit unsigned values of %PDF... Or more specifically of %, P, D and F

    If you scroll down in the hex editor, you will get those four values

    Value of %
    JavaScript Sample Image 8

    Value of P
    JavaScript Sample Image 9

    Value of D
    JavaScript Sample Image 10

    Value of F
    JavaScript Sample Image 11

    Now all we have to do is read the binary file and delete everything before %PDF and save the file with a .Pdf extention.

Code:

Option Explicit

Const TmpPath As String = "C:\Users\routs\Desktop\Temp"
Const ExcelFile As String = "C:\Users\routs\Desktop\Sample.xlsx"
Const ZipName As String = "C:\Users\routs\Desktop\Sample.zip"

Sub ExtractPDF()
Dim tmpPdf As String
Dim oApp As Object
Dim i As Long

'~~> Deleting any previously created files. This is
'~~> usually helpful from 2nd run onwards
On Error Resume Next
Kill ZipName
Kill TmpPath & "\*.*"
On Error GoTo 0

'~~> Copy and rename the Excel file as zip file
FileCopy ExcelFile, ZipName

Set oApp = CreateObject("Shell.Application")

'~~> Extract the bin file from xl\embeddings\
For i = 1 To oApp.Namespace(ZipName).items.Count
oApp.Namespace(TmpPath).CopyHere oApp.Namespace(ZipName).items.Item("xl\embeddings\oleObject" & i & ".bin")

tmpPdf = TmpPath & "\oleObject" & i & ".bin"

'~~> Read and Edit the Bin File
If Dir(tmpPdf) <> "" Then ReadAndWriteExtractedBinFile tmpPdf
Next i

MsgBox "Done"
End Sub

'~~> Read and ReWrite Bin File
Sub ReadAndWriteExtractedBinFile(s As String)
Dim intFileNum As Long, bytTemp As Byte
Dim MyAr() As Long, NewAr() As Long
Dim fileName As String
Dim i As Long, j As Long, k As Long

j = 1

intFileNum = FreeFile

'~~> Open the bing file
Open s For Binary Access Read As intFileNum
'~~> Get the number of lines in the bin file
Do While Not EOF(intFileNum)
Get intFileNum, , bytTemp
j = j + 1
Loop

'~~> Create an array to store the filtered results of the bin file
'~~> We will use this to recreate the bin file
ReDim MyAr(1 To j)
j = 1

'~~> Go to first record
If EOF(intFileNum) Then Seek intFileNum, 1

'~~> Store the contents of bin file in an array
Do While Not EOF(intFileNum)
Get intFileNum, , bytTemp
MyAr(j) = bytTemp
j = j + 1
Loop
Close intFileNum

'~~> Check for the #PDF and Filter out rest of the data
For i = LBound(MyAr) To UBound(MyAr)
If i = UBound(MyAr) - 4 Then Exit For
If Val(MyAr(i)) = 37 And Val(MyAr(i + 1)) = 80 And _
Val(MyAr(i + 2)) = 68 And Val(MyAr(i + 3)) = 70 Then
ReDim NewAr(1 To j - i + 2)

k = 1
For j = i To UBound(MyAr)
NewAr(k) = MyAr(j)
k = k + 1
Next j

Exit For
End If
Next i

intFileNum = FreeFile

'~~> Decide on the new name of the pdf file
'~~> Format(Now, "ddmmyyhhmmss") This method will awlays ensure that
'~~> you will get a unique filename
fileName = TmpPath & "\" & Format(Now, "ddmmyyhhmmss") & ".pdf"

'~~> Write the new binary file
Open fileName For Binary Lock Read Write As #intFileNum
For i = LBound(NewAr) To UBound(NewAr)
Put #intFileNum, , CByte(NewAr(i))
Next i

Close #intFileNum
End Sub

Output

JavaScript Sample Image 12

How can I download a PDF file from an URL where the PDF is embedded into the HTML?

You can download pdf using requests and BeautifulSoup libraries. In code below replace /Users/../aaa.pdf with full path where document will be downloaded:

import requests
from bs4 import BeautifulSoup

url = 'http://www.nebraskadeedsonline.us/document.aspx?g5savSPtTDnumMn1bRBWoKqN6Gu65tBhDE9%2fVs5YdPg='

response = requests.post(url)
page = BeautifulSoup(response.text, "html.parser")

VIEWSTATE = page.select_one("#__VIEWSTATE").attrs["value"]
VIEWSTATEGENERATOR = page.select_one("#__VIEWSTATEGENERATOR").attrs["value"]
EVENTVALIDATION = page.select_one("#__EVENTVALIDATION").attrs["value"]
btnDocument = page.select_one("[name=btnDocument]").attrs["value"]

data = {
'__VIEWSTATE': VIEWSTATE,
'__VIEWSTATEGENERATOR': VIEWSTATEGENERATOR,
'__EVENTVALIDATION': EVENTVALIDATION,
'btnDocument': btnDocument
}
response = requests.post(url, data=data)
with open('/Users/../aaa.pdf', 'wb') as f:
f.write(response.content)


Related Topics



Leave a reply



Submit