How to Pull Out CSS Attributes from Inline Styles with Beautifulsoup

How to pull out CSS attributes from inline styles with BeautifulSoup

You've got a couple options- quick and dirty or the Right Way. The quick and dirty way (which will break easily if the markup is changed) looks like

>>> from BeautifulSoup import BeautifulSoup
>>> import re
>>> soup = BeautifulSoup('<html><body><img style="background:url(/theRealImage.jpg) no-repate 0 0; height:90px; width:92px;") src="notTheRealImage.jpg"/></body></html>')
>>> style = soup.find('img')['style']
>>> urls = re.findall('url\((.*?)\)', style)
>>> urls
[u'/theRealImage.jpg']

Obviously, you'll have to play with that to get it to work with multiple img tags.

The Right Way, since I'd feel awful suggesting someone use regex on a CSS string :), uses a CSS parser. cssutils, a library I just found on Google and available on PyPi, looks like it might do the job.

Remove all inline styles using BeautifulSoup

You don't need to parse any CSS if you just want to remove it all. BeautifulSoup provides a way to remove entire attributes like so:

for tag in soup():
for attribute in ["class", "id", "name", "style"]:
del tag[attribute]

Also, if you just want to delete entire tags (and their contents), you don't need extract(), which returns the tag. You just need decompose():

[tag.decompose() for tag in soup("script")]

Not a big difference, but just something else I found while looking at the docs. You can find more details about the API in the BeautifulSoup documentation, with many examples.

how to extract inline CSS style without style key in XHTML using beautiful soup

Use BeautifulSoup soup and css selector.

from bs4 import BeautifulSoup
html='''<p height="1em" width="0" align="justify"><span><i>
There’s the feather bed element here brother, ach! and not only that! There’s an attraction here—here you have the end of the world, an anchorage, a quiet haven, the navel of the earth, the three fishes that are the foundation of the world, the essence of pancakes, of savoury fish-pies, of the evening samovar, of soft sighs and warm shawls, and hot stoves to sleep on—as snug as though you were dead, and yet you’re alive—the advantages of both at once.
</i></span></p>'''
soup=BeautifulSoup(html,'html.parser')
print(soup.select_one('p[height]')['height'])
print(soup.select_one('p[width]')['width'])
print(soup.select_one('p[align]')['align'])

Output:

1em
0
justify

Using Beautiful Soup to convert CSS attributes to individual HTML attributes?

For this type of thing, I'd recommend an HTML parser (like BeautifulSoup or lxml) in conjunction with a specialized CSS parser. I've had success with the cssutils package. You'll have a much easier time than trying to come up with regular expressions to match any possible CSS you might find in the wild.

For example:

>>> import cssutils
>>> css = 'width:150px;height:50px;float:right;'
>>> s = cssutils.parseStyle(css)
>>> s.width
u'150px'
>>> s.height
u'50px'
>>> s.keys()
[u'width', u'height', u'float']
>>> s.cssText
u'width: 150px;\nheight: 50px;\nfloat: right'
>>> del s['width']
>>> s.cssText
u'height: 50px;\nfloat: right'

So, using this you can pretty easily extract and manipulate the CSS properties you want and plug them into the HTML directly with BeautifulSoup. Be a little careful of the newline characters that pop up in the cssText attribute, though. I think cssutils is more designed for formatting things as standalone CSS files, but it's flexible enough to mostly work for what you're doing here.

parse embedded css beautifulsoup

You can use a css parser like [cssutils][1]. I don't know if there is a function in the package itself to do something like this (can someone comment regarding this?), but i made a custom function to get it.

from bs4 import BeautifulSoup
import cssutils
html='''
<html>
<head>
<style type="text/css">
* {margin:0; padding:0; text-indent:0; }
.s5 {color: #000; font-family:Verdana, sans-serif;
font-style: normal; font-weight: normal;
text-decoration: none; font-size: 17.5pt;
vertical-align: 10pt;}
</style>
</head>

<body>
<p class="s1" style="padding-left: 7pt; text-indent: 0pt; text-align:left;">
This is a sample sentence. <span class="s5"> 1</span>
</p>
</body>
</html>
'''
def get_property(class_name,property_name):
for rule in sheet:
if rule.selectorText=='.'+class_name:
for property in rule.style:
if property.name==property_name:
return property.value
soup=BeautifulSoup(html,'html.parser')
sheet=cssutils.parseString(soup.find('style').text)
vl=get_property('s5','vertical-align')
print(vl)

Output

10pt

This is not perfect but maybe you can improve upon it.
[1]: https://pypi.org/project/cssutils/

Is there a way to extract CSS from a webpage using BeautifulSoup?

If your goal is to truly parse the css:

  • There are some various methods here: Prev Question w/ Answers
  • I also have used a nice example from this site: Python Code Article

Beautiful soup will pull the entire page - and it does include the header, styles, scripts, linked in css and js, etc. I have used the method in the pythonCodeArticle before and retested it for the link you provided.

import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin

# URL of the web page you want to extract
url = "ENTER YOUR LINK HERE"

# initialize a session & set User-Agent as a regular browser
session = requests.Session()
session.headers["User-Agent"] = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"

# get the HTML content
html = session.get(url).content

# parse HTML using beautiful soup
soup = bs(html, "html.parser")
print(soup)

By looking at the soup output (It is very long, I will not paste here).. you can see it is a complete page. Just make sure to paste in your specific link

NOW If you wanted to parse the result to pick up all css urls.... you can add this: (I am still using parts of the code from the very well described python Code article link above)

# get the CSS files
css_files = []
for css in soup.find_all("link"):
if css.attrs.get("href"):
# if the link tag has the 'href' attribute
css_url = urljoin(url, css.attrs.get("href"))
css_files.append(css_url)
print(css_files)

The output css_files will be a list of all css files. You can now go visit those separately and see the styles that are being imported.

NOTE:this particular site has a mix of styles inline with the html (i.e. they did not always use css to set the style properties... sometimes the styles are inside the html content.)

This should get you started.

Getting style attribute using BeautifulSoup

Just access the attribute using tag["attribute"]:

In [28]: soup = BeautifulSoup('<tr style="pretty"></tr>', 'html.parser')

In [29]: print(soup.find("tr")["style"])
pretty

If you only want the tr tags with style attributes an to get them all:

trs = s.find("table", class_="example-table").find_all("tr", style=True)

for tr in trs:
print(tr["style"])

Or using a css selector:

trs = s.select("table.example-table tr[style]")

for tr in trs:
print(tr["style"])

Using your actual url:

In [41]: r = requests.get("http://lol.esportswikis.com/wiki/G2_Esports/Match_History")

In [42]: s = BeautifulSoup(r.content, "lxml")

In [43]: trs = s.select("table.wikitable.sortable tr[style]")

In [44]:

In [44]: for tr in trs:
....: print(tr["style"])
....:
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#FFC7CE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#FFC7CE
background-color:#FFC7CE
background-color:#FFC7CE
background-color:#C6EFCE
background-color:#FFC7CE
background-color:#C6EFCE
background-color:#FFC7CE
background-color:#FFC7CE
background-color:#FFC7CE
background-color:#FFC7CE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#FFC7CE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#FFC7CE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#FFC7CE
background-color:#C6EFCE
background-color:#FFC7CE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#FFC7CE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#FFC7CE
background-color:#FFC7CE
background-color:#C6EFCE
background-color:#FFC7CE
background-color:#FFC7CE
background-color:#C6EFCE

BeautifulSoup: get css classes from html

BeautifulSoup itself doesn't parse CSS style declarations at all, but you can extract such sections then parse them with a dedicated CSS parser.

Depending on your needs, there are several CSS parsers available for python; I'd pick cssutils (requires python 2.5 or up (including python 3)), it is the most complete in it's support, and supports inline styles too.

Other options are css-py and tinycss.

To grab and parse such all style sections (example with cssutils):

import cssutils
sheets = []
for styletag in tree.findAll('style', type='text/css')
if not styletag.string: # probably an external sheet
continue
sheets.append(cssutils.parseStyle(styletag.string))

With cssutil you can then combine these, resolve imports, and even have it fetch external stylesheets.

Remove height and width from inline styles

A full walk-through would be:

from bs4 import BeautifulSoup
import re

string = """
<div id="attachment_9565" class="wp-caption aligncenter" style="width: 2010px;background-color:red">
<p>Some line here</p>
<hr/>
<p>Some other beautiful text over here</p>
</div>
"""

# look for width or height, followed by not a ;
rx = re.compile(r'(?:width|height):[^;]+;?')

soup = BeautifulSoup(string, "html5lib")

for div in soup.findAll('div'):
div['style'] = rx.sub("", string)

As stated by others, using regular expressions on the actual value is not a problem.



Related Topics



Leave a reply



Submit