python beautifulsoup iframe document html extract
Browsers load the iframe content in a separate request. You'll have to do the same:
for iframe in iframexx:
response = urllib2.urlopen(iframe.attrs['src'])
iframe_soup = BeautifulSoup(response)
Remember: BeautifulSoup is not a browser; it won't fetch images, CSS and JavaScript resources for you either.
Extract iframes using BeautifulSoup with Python
I discover how to fix the problem.
I change the code:
iframe = soup.find('iframe')
to
iframe = soup.find_all('iframe')
Then, instead of getting None as a response, I begin to receive []. An empty value.
I tested it using:
if iframes != [] :
print( iframes[0]['src'] )
I got the content of src using the iframes[0]['src']
Python beautifulsoup iframe text extract
There is no need to use regex here.
A much easier way could be to use the attrs
property of beautifulsoup's elements like:
from urllib.request import urlopen
from bs4 import BeautifulSoup
path='https://www.esquire.com/entertainment/tv/g28380481/best-anime-2019/'
f = urlopen(path)
html = str(f.read())
soup = BeautifulSoup(html, 'html.parser')
txt = soup.find_all('iframe')
for element in txt:
print(element.attrs["data-src"][2:])
Which produces the same results:
www.youtube.com/embed/6M7f41OJfcM?enablejsapi=1
www.youtube.com/embed/0glqBjvku84?enablejsapi=1
www.youtube.com/embed/YKJf876thxw?enablejsapi=1
www.youtube.com/embed/SdFgPGSmy0Y?enablejsapi=1
www.youtube.com/embed/Ie-bo3IulmY?enablejsapi=1
www.youtube.com/embed/ApLudqucq-s?enablejsapi=1
www.youtube.com/embed/FpRk3m3Y-Zg?enablejsapi=1
www.youtube.com/embed/J9tu253SOas?enablejsapi=1
www.youtube.com/embed/lCPf9SA4mgU?enablejsapi=1
www.youtube.com/embed/neqxQdpTyXE?enablejsapi=1
You can read more about how to process attributes here: https://www.crummy.com/software/BeautifulSoup/bs4/doc/#attributes
How to extract the following src (iframe) from the code using python (BeautifulSoup)
To simulate POST on this site request you can use this example:
import requests
from bs4 import BeautifulSoup
url = "http://191.253.16.180:8080/ConsultaLei/Default.aspx"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = {}
for inp in soup.select("input[value]"):
data[inp["name"]] = inp["value"]
data["ctl00$MainContent$txtNumero"] = "3001" # <-- this is your number
data["ctl00$MainContent$ddlEspecie"] = ""
data["ctl00$MainContent$ddlAno"] = ""
data["ctl00$MainContent$txtConteudo"] = ""
data["ctl00$MainContent$txtEmenta"] = ""
data["ctl00$MainContent$imgBuscar.x"] = "1"
data["ctl00$MainContent$imgBuscar.y"] = "9"
soup = BeautifulSoup(requests.post(url, data=data).content, "html.parser")
print(soup.iframe["src"])
Prints:
../procuradoriacg/Leis\1994/8277_LEI30011994pag0001_strDocumentoOficial.pdf
EDIT: To get multiple pages:
import requests
from bs4 import BeautifulSoup
url = "http://191.253.16.180:8080/ConsultaLei/Default.aspx"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
data = {}
for inp in soup.select("input[value]"):
data[inp["name"]] = inp["value"]
data["ctl00$MainContent$ddlEspecie"] = ""
data["ctl00$MainContent$ddlAno"] = ""
data["ctl00$MainContent$txtConteudo"] = ""
data["ctl00$MainContent$txtEmenta"] = ""
data["ctl00$MainContent$imgBuscar.x"] = "1"
data["ctl00$MainContent$imgBuscar.y"] = "9"
for i in range(3000, 3010):
data["ctl00$MainContent$txtNumero"] = i
s = BeautifulSoup(requests.post(url, data=data).content, "html.parser")
if s.find("iframe"):
print(i, s.iframe["src"])
else:
print(i, "Not Found")
Prints:
3000 Not Found
3001 ../procuradoriacg/Leis\1994/8277_LEI30011994pag0001_strDocumentoOficial.pdf
3002 Not Found
3003 ../procuradoriacg/Leis\1994/8279_LEI30031994pag0001_strDocumentoOficial.pdf
3004 Not Found
3005 Not Found
3006 ../procuradoriacg/Leis\1994/8282_LEI30061994pag0001_strDocumentoOficial.pdf
3007 Not Found
3008 Not Found
3009 Not Found
Python BeautifulSoup - Scrape Web Content Inside Iframes
You just need to obtain the src
attribute of the iframe
, and then request and parse its content:
import requests
from bs4 import BeautifulSoup
s = requests.Session()
r = s.get("https://www.aliexpress.com/store/feedback-score/1665279.html")
soup = BeautifulSoup(r.content, "html.parser")
iframe_src = soup.select_one("#detail-displayer").attrs["src"]
r = s.get(f"https:{iframe_src}")
soup = BeautifulSoup(r.content, "html.parser")
for row in soup.select(".history-tb tr"):
print("\t".join([e.text for e in row.select("th, td")]))
Result:
Feedback 1 Month 3 Months 6 Months
Positive (4-5 Stars) 154 562 1,550
Neutral (3 Stars) 8 19 65
Negative (1-2 Stars) 8 20 57
Positive feedback rate 95.1% 96.6% 96.5%
extract iFrame content using BeautifulSoup
Browsers will load the iframe content in a separate request, so you'll need to fetch the url that is present in the iframe src
. You can use selenium if you want, or scrape the data itself directly.
Here is an example:
import requests
import re
url = 'https://w.soundcloud.com/player/?url=https%3A//api.soundcloud.com/tracks/310079005&color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false'
response = requests.get(url)
Artist = re.search(b'(?<=artist":")(.*?)(?=")', response.content).group(0).decode("utf-8")
Song = re.search(b'(?<=title":")(.*?)(?=")', response.content).group(0).decode("utf-8")
print ("%s - %s" % (Artist, Song))
Private Life - Lost Boy
Scraping #document from an iframe tag using beautifulsoup
However, The Guardian
offers an entire .csv
file up for grabs, if you take a look at what's going on in the Developer Tool.
Here's how to grab data for Covid19 Gloabal Deaths
:
import shutil
import requests
url = "https://interactive.guim.co.uk/2020/coronavirus-jh-timeline-data/time_series_covid19_deaths_global.csv"
data = requests.get(url, stream=True)
if data.status_code == 200:
with open("covid19_data.csv", 'wb') as f:
data.raw.decode_content = True
shutil.copyfileobj(data.raw, f)
And if you swap the last part of the URL
with time_series_covid19_confirmed_global.csv
that's what you're going to get back as a .csv
file.
How to get html in python inside #document tag?
I usually use selenium to handle these situations.
Basically you have to get in the iframe to get the content.
See this question.
Related Topics
Import Arbitrary Python Source File. (Python 3.3+)
Error Message: 'Chromedriver' Executable Needs to Be Path
Python Script for Minifying CSS
How to Connect R Conda Env to Jupyter Notebook
Convert Backward Slash to Forward Slash in Python
Does Ruby Have Something Like Python's List Comprehensions
How to Convert a File into a Dictionary
Django - No Such Table: Main.Auth_User_Old
Module' Object Has No Attribute 'Loads' While Parsing JSON Using Python
Django: Multiple Models in One Template Using Forms
Purpose of Calling Function Without Brackets Python
How to Change the Styles of Pandas Dataframe Headers
Typeerror: Use() Got an Unexpected Keyword Argument 'Warn' When Importing Matplotlib
Programmatically Extract Data from an Excel Spreadsheet
How to Redirect Stdout to Both File and Console with Scripting
How to Round a Floating Point Number Up to a Certain Decimal Place