Python Beautifulsoup Iframe Document HTML Extract

python beautifulsoup iframe document html extract

Browsers load the iframe content in a separate request. You'll have to do the same:

for iframe in iframexx:
response = urllib2.urlopen(iframe.attrs['src'])
iframe_soup = BeautifulSoup(response)

Remember: BeautifulSoup is not a browser; it won't fetch images, CSS and JavaScript resources for you either.

Extract iframes using BeautifulSoup with Python

I discover how to fix the problem.

I change the code:

iframe = soup.find('iframe')

to

iframe = soup.find_all('iframe')

Then, instead of getting None as a response, I begin to receive []. An empty value.

I tested it using:

if iframes != [] :
print( iframes[0]['src'] )

I got the content of src using the iframes[0]['src']

Python beautifulsoup iframe text extract

There is no need to use regex here.

A much easier way could be to use the attrs property of beautifulsoup's elements like:

from urllib.request import urlopen
from bs4 import BeautifulSoup
path='https://www.esquire.com/entertainment/tv/g28380481/best-anime-2019/'
f = urlopen(path)
html = str(f.read())
soup = BeautifulSoup(html, 'html.parser')
txt = soup.find_all('iframe')

for element in txt:
print(element.attrs["data-src"][2:])

Which produces the same results:

www.youtube.com/embed/6M7f41OJfcM?enablejsapi=1
www.youtube.com/embed/0glqBjvku84?enablejsapi=1
www.youtube.com/embed/YKJf876thxw?enablejsapi=1
www.youtube.com/embed/SdFgPGSmy0Y?enablejsapi=1
www.youtube.com/embed/Ie-bo3IulmY?enablejsapi=1
www.youtube.com/embed/ApLudqucq-s?enablejsapi=1
www.youtube.com/embed/FpRk3m3Y-Zg?enablejsapi=1
www.youtube.com/embed/J9tu253SOas?enablejsapi=1
www.youtube.com/embed/lCPf9SA4mgU?enablejsapi=1
www.youtube.com/embed/neqxQdpTyXE?enablejsapi=1

You can read more about how to process attributes here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes

How to extract the following src (iframe) from the code using python (BeautifulSoup)

To simulate POST on this site request you can use this example:

import requests
from bs4 import BeautifulSoup

url = "http://191.253.16.180:8080/ConsultaLei/Default.aspx"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

data = {}
for inp in soup.select("input[value]"):
data[inp["name"]] = inp["value"]

data["ctl00$MainContent$txtNumero"] = "3001" # <-- this is your number
data["ctl00$MainContent$ddlEspecie"] = ""
data["ctl00$MainContent$ddlAno"] = ""
data["ctl00$MainContent$txtConteudo"] = ""
data["ctl00$MainContent$txtEmenta"] = ""
data["ctl00$MainContent$imgBuscar.x"] = "1"
data["ctl00$MainContent$imgBuscar.y"] = "9"

soup = BeautifulSoup(requests.post(url, data=data).content, "html.parser")
print(soup.iframe["src"])

Prints:

../procuradoriacg/Leis\1994/8277_LEI30011994pag0001_strDocumentoOficial.pdf

EDIT: To get multiple pages:

import requests
from bs4 import BeautifulSoup

url = "http://191.253.16.180:8080/ConsultaLei/Default.aspx"
soup = BeautifulSoup(requests.get(url).content, "html.parser")

data = {}
for inp in soup.select("input[value]"):
data[inp["name"]] = inp["value"]

data["ctl00$MainContent$ddlEspecie"] = ""
data["ctl00$MainContent$ddlAno"] = ""
data["ctl00$MainContent$txtConteudo"] = ""
data["ctl00$MainContent$txtEmenta"] = ""
data["ctl00$MainContent$imgBuscar.x"] = "1"
data["ctl00$MainContent$imgBuscar.y"] = "9"

for i in range(3000, 3010):
data["ctl00$MainContent$txtNumero"] = i

s = BeautifulSoup(requests.post(url, data=data).content, "html.parser")
if s.find("iframe"):
print(i, s.iframe["src"])
else:
print(i, "Not Found")

Prints:

3000 Not Found
3001 ../procuradoriacg/Leis\1994/8277_LEI30011994pag0001_strDocumentoOficial.pdf
3002 Not Found
3003 ../procuradoriacg/Leis\1994/8279_LEI30031994pag0001_strDocumentoOficial.pdf
3004 Not Found
3005 Not Found
3006 ../procuradoriacg/Leis\1994/8282_LEI30061994pag0001_strDocumentoOficial.pdf
3007 Not Found
3008 Not Found
3009 Not Found

Python BeautifulSoup - Scrape Web Content Inside Iframes

You just need to obtain the src attribute of the iframe, and then request and parse its content:

import requests
from bs4 import BeautifulSoup

s = requests.Session()
r = s.get("https://www.aliexpress.com/store/feedback-score/1665279.html")

soup = BeautifulSoup(r.content, "html.parser")
iframe_src = soup.select_one("#detail-displayer").attrs["src"]

r = s.get(f"https:{iframe_src}")

soup = BeautifulSoup(r.content, "html.parser")
for row in soup.select(".history-tb tr"):
print("\t".join([e.text for e in row.select("th, td")]))

Result:


Feedback 1 Month 3 Months 6 Months
Positive (4-5 Stars) 154 562 1,550
Neutral (3 Stars) 8 19 65
Negative (1-2 Stars) 8 20 57
Positive feedback rate 95.1% 96.6% 96.5%

extract iFrame content using BeautifulSoup

Browsers will load the iframe content in a separate request, so you'll need to fetch the url that is present in the iframe src. You can use selenium if you want, or scrape the data itself directly.
Here is an example:

import requests
import re

url = 'https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/310079005&color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false'

response = requests.get(url)

Artist = re.search(b'(?<=artist":")(.*?)(?=")', response.content).group(0).decode("utf-8")
Song = re.search(b'(?<=title":")(.*?)(?=")', response.content).group(0).decode("utf-8")

print ("%s - %s" % (Artist, Song))

Private Life - Lost Boy

Scraping #document from an iframe tag using beautifulsoup

However, The Guardian offers an entire .csv file up for grabs, if you take a look at what's going on in the Developer Tool.

Here's how to grab data for Covid19 Gloabal Deaths:

import shutil

import requests

url = "https://interactive.guim.co.uk/2020/coronavirus-jh-timeline-data/time_series_covid19_deaths_global.csv"
data = requests.get(url, stream=True)
if data.status_code == 200:
with open("covid19_data.csv", 'wb') as f:
data.raw.decode_content = True
shutil.copyfileobj(data.raw, f)

And if you swap the last part of the URL with time_series_covid19_confirmed_global.csv that's what you're going to get back as a .csv file.

How to get html in python inside #document tag?

I usually use selenium to handle these situations.
Basically you have to get in the iframe to get the content.

See this question.



Related Topics



Leave a reply



Submit