Using Beautiful Soup to Convert CSS Attributes to Individual HTML Attributes

Using Beautiful Soup to convert CSS attributes to individual HTML attributes?

For this type of thing, I'd recommend an HTML parser (like BeautifulSoup or lxml) in conjunction with a specialized CSS parser. I've had success with the cssutils package. You'll have a much easier time than trying to come up with regular expressions to match any possible CSS you might find in the wild.

For example:

>>> import cssutils
>>> css = 'width:150px;height:50px;float:right;'
>>> s = cssutils.parseStyle(css)
>>> s.width
u'150px'
>>> s.height
u'50px'
>>> s.keys()
[u'width', u'height', u'float']
>>> s.cssText
u'width: 150px;\nheight: 50px;\nfloat: right'
>>> del s['width']
>>> s.cssText
u'height: 50px;\nfloat: right'

So, using this you can pretty easily extract and manipulate the CSS properties you want and plug them into the HTML directly with BeautifulSoup. Be a little careful of the newline characters that pop up in the cssText attribute, though. I think cssutils is more designed for formatting things as standalone CSS files, but it's flexible enough to mostly work for what you're doing here.

How to pull out CSS attributes from inline styles with BeautifulSoup

You've got a couple options- quick and dirty or the Right Way. The quick and dirty way (which will break easily if the markup is changed) looks like

>>> from BeautifulSoup import BeautifulSoup
>>> import re
>>> soup = BeautifulSoup('<html><body><img style="background:url(/theRealImage.jpg) no-repate 0 0; height:90px; width:92px;") src="notTheRealImage.jpg"/></body></html>')
>>> style = soup.find('img')['style']
>>> urls = re.findall('url\((.*?)\)', style)
>>> urls
[u'/theRealImage.jpg']

Obviously, you'll have to play with that to get it to work with multiple img tags.

The Right Way, since I'd feel awful suggesting someone use regex on a CSS string :), uses a CSS parser. cssutils, a library I just found on Google and available on PyPi, looks like it might do the job.

BeautifulSoup: get css classes from html

BeautifulSoup itself doesn't parse CSS style declarations at all, but you can extract such sections then parse them with a dedicated CSS parser.

Depending on your needs, there are several CSS parsers available for python; I'd pick cssutils (requires python 2.5 or up (including python 3)), it is the most complete in it's support, and supports inline styles too.

Other options are css-py and tinycss.

To grab and parse such all style sections (example with cssutils):

import cssutils
sheets = []
for styletag in tree.findAll('style', type='text/css')
if not styletag.string: # probably an external sheet
continue
sheets.append(cssutils.parseStyle(styletag.string))

With cssutil you can then combine these, resolve imports, and even have it fetch external stylesheets.

parse embedded css beautifulsoup

You can use a css parser like [cssutils][1]. I don't know if there is a function in the package itself to do something like this (can someone comment regarding this?), but i made a custom function to get it.

from bs4 import BeautifulSoup
import cssutils
html='''
<html>
<head>
<style type="text/css">
* {margin:0; padding:0; text-indent:0; }
.s5 {color: #000; font-family:Verdana, sans-serif;
font-style: normal; font-weight: normal;
text-decoration: none; font-size: 17.5pt;
vertical-align: 10pt;}
</style>
</head>

<body>
<p class="s1" style="padding-left: 7pt; text-indent: 0pt; text-align:left;">
This is a sample sentence. <span class="s5"> 1</span>
</p>
</body>
</html>
'''
def get_property(class_name,property_name):
for rule in sheet:
if rule.selectorText=='.'+class_name:
for property in rule.style:
if property.name==property_name:
return property.value
soup=BeautifulSoup(html,'html.parser')
sheet=cssutils.parseString(soup.find('style').text)
vl=get_property('s5','vertical-align')
print(vl)

Output

10pt

This is not perfect but maybe you can improve upon it.
[1]: https://pypi.org/project/cssutils/

Custom attributes in BeautifulSoup?

Use Css Selector to get that.

from bs4 import BeautifulSoup
html = '''
<div data-asin="099655596X" data-index="1" class="sg-col-20-of-24 s-result-item sg-col-0-of-12 sg-col-28-of-32 sg-col-16-of-20 sg-col sg-col-32-of-36 sg-col-12-of-16 sg-col-24-of-28" data-cel widget="search_result_1">
'''
soup = BeautifulSoup(html,'html.parser')
items=soup.select('div[data-asin="099655596X"]')
for item in items:
print(item['data-asin'])

OutPut:

099655596X

OR

from bs4 import BeautifulSoup
html = '''
<div data-asin="099655596X" data-index="1" class="sg-col-20-of-24 s-result-item sg-col-0-of-12 sg-col-28-of-32 sg-col-16-of-20 sg-col sg-col-32-of-36 sg-col-12-of-16 sg-col-24-of-28" data-cel widget="search_result_1">
'''
soup = BeautifulSoup(html,'html.parser')
items=soup.select('div[data-asin$="X"]')
for item in items:
print(item['data-asin'])

Python BeautifulSoup get attribute values from any element containing an attribute

If you have bs4 4.7.1 or above you can use the following css selector.

for item in soup.select('[src]'):
print(item)

Python - beautifulsoup changes attribute positioning

From the documentation, you can use custom HTMLFormatter:

from bs4 import BeautifulSoup
from bs4.formatter import HTMLFormatter

txt = '''<link rel="stylesheet" href="assets/css/fontawesome-min.css">
<link rel="stylesheet" href="assets/css/bootstrap.min.css">
<link rel="stylesheet" href="assets/css/xsIcon.css">'''

class UnsortedAttributes(HTMLFormatter):
def attributes(self, tag):
for k, v in tag.attrs.items():
yield k, v

soup = BeautifulSoup(txt, 'html.parser')

#before HTMLFormatter
print( soup )

print('-' * 80)

#after HTMLFormatter
print( soup.encode(formatter=UnsortedAttributes()).decode('utf-8') )

Prints:

<link href="assets/css/fontawesome-min.css" rel="stylesheet"/>
<link href="assets/css/bootstrap.min.css" rel="stylesheet"/>
<link href="assets/css/xsIcon.css" rel="stylesheet"/>
--------------------------------------------------------------------------------
<link rel="stylesheet" href="assets/css/fontawesome-min.css"/>
<link rel="stylesheet" href="assets/css/bootstrap.min.css"/>
<link rel="stylesheet" href="assets/css/xsIcon.css"/>

Is there a way to extract CSS from a webpage using BeautifulSoup?

If your goal is to truly parse the css:

  • There are some various methods here: Prev Question w/ Answers
  • I also have used a nice example from this site: Python Code Article

Beautiful soup will pull the entire page - and it does include the header, styles, scripts, linked in css and js, etc. I have used the method in the pythonCodeArticle before and retested it for the link you provided.

import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin

# URL of the web page you want to extract
url = "ENTER YOUR LINK HERE"

# initialize a session & set User-Agent as a regular browser
session = requests.Session()
session.headers["User-Agent"] = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"

# get the HTML content
html = session.get(url).content

# parse HTML using beautiful soup
soup = bs(html, "html.parser")
print(soup)

By looking at the soup output (It is very long, I will not paste here).. you can see it is a complete page. Just make sure to paste in your specific link

NOW If you wanted to parse the result to pick up all css urls.... you can add this: (I am still using parts of the code from the very well described python Code article link above)

# get the CSS files
css_files = []
for css in soup.find_all("link"):
if css.attrs.get("href"):
# if the link tag has the 'href' attribute
css_url = urljoin(url, css.attrs.get("href"))
css_files.append(css_url)
print(css_files)

The output css_files will be a list of all css files. You can now go visit those separately and see the styles that are being imported.

NOTE:this particular site has a mix of styles inline with the html (i.e. they did not always use css to set the style properties... sometimes the styles are inside the html content.)

This should get you started.

How to get the full html with Beautiful Soup?

The page is javascript rendered. You need to use Selenium for it.

Code:

from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from time import sleep
url = 'http://opendatadpc.maps.arcgis.com/apps/opsdashboard/index.html#/b0c68bce2cce478eaac82fe38d4138b1'

driver = webdriver.Chrome(ChromeDriverManager().install())
driver.set_window_size(1024, 600)
driver.maximize_window()
driver.get(url)
time.sleep(10) # <--- waits for 10 seconds so that page can gets rendered
# action = webdriver.ActionChains(driver)
print(driver.page_source) # <--- this will give you source code

You can execute js script using:

driver.execute_script()

You can create wait timer like this:

WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.ID, "myDynamicElement"))) #waits 10 seconds until element is located. Can have other wait conditions such as visibility_of_element_located or text_to_be_present_in_element


<html dir="ltr" class="en-gb en dj_webkit dj_chrome dj_contentbox"><head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<title>COVID-19 ITALIA - Desktop</title>
<meta name="description" content="">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="icon" href="assets/images/favicon.ico?" type="image/x-icon">
<link href="https://js.arcgis.com/3.32/dijit/themes/claro/claro.css" rel="stylesheet" type="text/css">
<link href="https://js.arcgis.com/3.32/esri/css/esri.css" rel="stylesheet" type="text/css">
<link rel="stylesheet" href="assets/vendor-ff6a5e0c0264e398e1ffaeb015926635.css">
<link rel="stylesheet" href="assets/app-dark-a8116e0262a64a5113c183f5acb0a03b.css">
<script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/nls/jsapi_en-gb.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/dijit/ColorPicker.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/dijit/ColorPicker/HexPalette.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/dijit/form/DateTextBox.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/dijit/form/TimeTextBox.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/dojox/color.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/dijit/Legend.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/dijit/Scalebar.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/dijit/BasemapGallery.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/dijit/LayerList.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/dijit/Search.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/tasks/locator.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/toolbars/draw.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/plugins/FeatureLayerStatistics.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/geometry/geometryEngineAsync.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/geometry/geometryEngine.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/dojo/fx/easing.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/arcgis/Portal.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/styles/colors.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/moment/locale/en-gb.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/dojox/gfx/svg.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/dijit/Calendar.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/dijit/form/_DateTimeTextBox.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/dijit/_Tooltip.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/dijit/ColorPicker/colorUtil.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/dijit/HorizontalSlider.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/dijit/form/RadioButton.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/dijit/_TimePicker.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/dojox/color/_base.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/layers/VectorTileLayerImpl.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/tasks/AddressCandidate.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/dijit/CalendarLite.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/dijit/form/RangeBoundTextBox.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/toolbars/_toolbar.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/workers/WorkerClient.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/styles/basic.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/tasks/GenerateRendererTask.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/tasks/UniqueValueDefinition.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/tasks/ClassBreaksDefinition.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/tasks/GenerateRendererParameters.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/tasks/generateRenderer.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/tasks/ProjectParameters.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/workers/heatmapCalculator.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/dojox/gfx/filters.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/dojox/gfx/svgext.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/dijit/form/HorizontalRuleLabels.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/dijit/form/HorizontalSlider.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/dijit/form/CheckBox.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/dijit/form/_RadioButtonMixin.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/dijit/form/_ListMouseMixin.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/dojox/main.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/dojo/colors.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/layers/nls/VectorTileLayerImpl_en-gb.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/dijit/form/MappedTextBox.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/tasks/ClassificationDefinition.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/dijit/form/HorizontalRule.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/dojo/dnd/move.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/dijit/form/_ListBase.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/dijit/form/_CheckBoxMixin.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/dojo/selector/lite.js"></script><script type="text/javascript" charset="utf-8" src="assets/vendor-557b494b34c1b4f592d5f2948d530f35.js"></script><script type="text/javascript" charset="utf-8" src="assets/nickel-122f2be932fe8e42c7401c4190951f4c.js"></script><script type="text/javascript" charset="utf-8" src="assets/moment-timezone-with-data.min-f71eb5eba513b3ab182b567941a82ef5.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/layers/LabelLayer.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/tasks/support/pbfDeps.js"></script><script type="text/javascript" charset="utf-8" src="https://js.arcgis.com/3.32/esri/tasks/support/nls/pbfDeps_en-gb.js"></script><script src="assets/amd-config-7e9801fc9c916a27bb75c6f356e09e0d.js"></script>
<style>.cke{visibility:hidden;}</style></head>

<body class="claro ember-application">
<script src="https://js.arcgis.com/3.32/init.js" data-amd="true"></script>
<script src="assets/amd-loading-d8029d0343fa400ebae9865c42984750.js" data-amd-loading="true"></script>


<!---->
<div id="ember6" class="dashboard-page flex-vertical full panel panel-no-border panel-no-padding position-relative ember-view">
<!---->
<!---->


<!---->
<div style="color:#ffffff;" id="ember8" class="flex-fluid flex-vertical overflow-hidden dashboard-container ember-view">
<div id="ember9" class="flex-fix panel-container flex-vertical top-panel-container ember-view"><div class="margin-container" style="">
<!---->
<div class="full-container">
<div style="" id="ember10" class="header-panel flex-horizontal large ember-view"> <div class="flex-fix flex-align-center margin-left-1">
<a target="_blank" class="logo-img-btn no-pointer-events">
<img src="http://opendatadpc.maps.arcgis.com/sharing/rest/content/items/d97ea2b03e824d5ca261998c15204745/data">
</a>
</div>

<div class="flex-fix flex-align-center allow-shrink margin-left-1 flex-vertical">
<div class="title no-pointer-events text-ellipsis">Dipartimento della Protezione Civile</div>
<div class="subtitle text-ellipsis no-pointer-events">Aggiornamento casi COVID-19</div>
</div>

<div class="selectors-container flex-fluid flex-align-center flex-horizontal flex-justify-end">
<!----></div>

<div id="ember11" class="margin-left-1 flex-fix flex-align-center menu-links dropdown ember-view"><button aria-expanded="false" aria-haspopup="true" tabindex="0" id="ember12" class="btn btn-large dropdown-btn ember-view"> <span id="ember13" class="icon-element ember-view"><svg xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://www.w3.org/2000/svg" height="24px" width="24px" viewBox="0 0 24 24" id="ember14" class="ember-view"><path d="M21 6H3V4h18zm0 5H3v2h18zm0 7H3v2h18z"></path></svg></span>

</button>
<nav role="menu" id="ember15" class="dropdown-right dropdown-menu ember-view">
<!---->
<a target="_blank" href="http://www.governo.it/" role="menu-item" id="ember17" class="dropdown-link dropdown-menu-item ember-view"> <div class="flex-horizontal flex-align-items-center">
<!----> <div class="flex-fluid text-ellipsis ">Presidenza del Consiglio dei Ministri</div>
<!----> </div>

</a>
<a target="_blank" href="http://www.protezionecivile.it" role="menu-item" id="ember19" class="dropdown-link dropdown-menu-item ember-view"> <div class="flex-horizontal flex-align-items-center">
<!----> <div class="flex-fluid text-ellipsis ">Dipartimento della Protezione Civile</div>
<!----> </div>

</a>
<a target="_blank" href="http://www.salute.gov.it" role="menu-item" id="ember21" class="dropdown-link dropdown-menu-item ember-view"> <div class="flex-horizontal flex-align-items-center">
<!----> <div class="flex-fluid text-ellipsis ">Ministero della Salute</div>
<!----> </div>

</a>
<a target="_blank" href="http://arcg.is/081a51" role="menu-item" id="ember23" class="dropdown-link dropdown-menu-item ember-view"> <div class="flex-horizontal flex-align-items-center">
<!----> <div class="flex-fluid text-ellipsis ">Versione Mobile</div>
<!----> </div>

</a>
<a target="_blank" href="https://github.com/pcm-dpc/COVID-19" role="menu-item" id="ember25" class="dropdown-link dropdown-menu-item ember-view"> <div class="flex-horizontal flex-align-items-center">
<!----> <div class="flex-fluid text-ellipsis ">Repository dei dati</div>
<!----> </div>

</a>

<!---->
</nav>
</div></div>

</div>

<!---->
<!----></div>
</div>
<div class="flex-fluid flex-horizontal position-relative overflow-hidden">

<div id="ember26" class="flex-fluid panel-container flex-vertical left-panel-container slide-over ember-view"><div class="margin-container" style="">
<!---->
<div class="full-container">
<div id="ember27" class="full-height left-panel flex-vertical ember-view"> <div class="caption margin-right-1 flex-fix">
<table border="0" cellpadding="1" cellspacing="1" style="width:100%">
<tbody>
<tr>
<td style="text-align:center"><img alt="" src="http://opendatadpc.maps.arcgis.com/sharing/rest/content/items/b5176eff01df4ff798be038b1dabb09a/data" style="width:200px"></td>
</tr>
</tbody>
</table>

<p style="text-align:center"><span style="font-size:14px"><strong>Informazioni</strong></span></p>

<p style="text-align:center"> </p>

</div>

<div class="selectors-container flex-fluid flex-vertical overflow-y-auto">
<!----></div>

<div class="flex-fix description">
<p><span style="color:#ffffff"><span style="font-size:14px">Il 31 gennaio 2020, il Consiglio dei Ministri dichiara lo stato di emergenza, per la durata di sei mesi, in conseguenza del rischio sanitario connesso all'infezione da Coronavirus.</span></span></p>

<p><span style="color:#ffffff"><span style="font-size:14px">Al Capo del Dipartimento della Protezione Civile, Angelo Borrelli, è affidato il coordinamento degli interventi necessari a fronteggiare l'emergenza sul territorio nazionale.</span></span></p>
.
.
.
.

How to scrape a single element out of 2 elements having same set of attributes and same hierarchy in html source code (using python's beautiful soup)

One approach could be iterating over all siblings of <p class="sort-num_votes-visible"> and if you find a <span name="nv"> thats surrounded by a <span class="text-muted"> and a <span class="ghost"> then this must be the span you're looking for. This of course implies that the structure of this snippet of HTML is always the same. If one of those spans could be missing then this method obviously fails.

If it's guaranteed that those two spans always are there and in that exact order you could do something like this (your souped HTML is in html_soup):

votes = html_soup.find("p", {"class": "sort-num_votes-visible").find_all("span", {"name": "nv"})[0]

EDIT:

According to your comment you could do the following in order to parse the votes for multiple movies:

for p in html_soup.find("p", {"class": "sort-num_votes-visible"}):
votes = p.find_all("span", {"name": "nv"})[0]

< Put whatever code here for each of your movies
...
>


Related Topics



Leave a reply



Submit