How to pull out CSS attributes from inline styles with BeautifulSoup
You've got a couple options- quick and dirty or the Right Way. The quick and dirty way (which will break easily if the markup is changed) looks like
>>> from BeautifulSoup import BeautifulSoup
>>> import re
>>> soup = BeautifulSoup('<html><body><img style="background:url(/theRealImage.jpg) no-repate 0 0; height:90px; width:92px;") src="notTheRealImage.jpg"/></body></html>')
>>> style = soup.find('img')['style']
>>> urls = re.findall('url\((.*?)\)', style)
>>> urls
[u'/theRealImage.jpg']
Obviously, you'll have to play with that to get it to work with multiple img
tags.
The Right Way, since I'd feel awful suggesting someone use regex on a CSS string :), uses a CSS parser. cssutils, a library I just found on Google and available on PyPi, looks like it might do the job.
Remove all inline styles using BeautifulSoup
You don't need to parse any CSS if you just want to remove it all. BeautifulSoup provides a way to remove entire attributes like so:
for tag in soup():
for attribute in ["class", "id", "name", "style"]:
del tag[attribute]
Also, if you just want to delete entire tags (and their contents), you don't need extract()
, which returns the tag. You just need decompose()
:
[tag.decompose() for tag in soup("script")]
Not a big difference, but just something else I found while looking at the docs. You can find more details about the API in the BeautifulSoup documentation, with many examples.
how to extract inline CSS style without style key in XHTML using beautiful soup
Use BeautifulSoup soup and css selector.
from bs4 import BeautifulSoup
html='''<p height="1em" width="0" align="justify"><span><i>
There’s the feather bed element here brother, ach! and not only that! There’s an attraction here—here you have the end of the world, an anchorage, a quiet haven, the navel of the earth, the three fishes that are the foundation of the world, the essence of pancakes, of savoury fish-pies, of the evening samovar, of soft sighs and warm shawls, and hot stoves to sleep on—as snug as though you were dead, and yet you’re alive—the advantages of both at once.
</i></span></p>'''
soup=BeautifulSoup(html,'html.parser')
print(soup.select_one('p[height]')['height'])
print(soup.select_one('p[width]')['width'])
print(soup.select_one('p[align]')['align'])
Output:
1em
0
justify
Using Beautiful Soup to convert CSS attributes to individual HTML attributes?
For this type of thing, I'd recommend an HTML parser (like BeautifulSoup or lxml) in conjunction with a specialized CSS parser. I've had success with the cssutils package. You'll have a much easier time than trying to come up with regular expressions to match any possible CSS you might find in the wild.
For example:
>>> import cssutils
>>> css = 'width:150px;height:50px;float:right;'
>>> s = cssutils.parseStyle(css)
>>> s.width
u'150px'
>>> s.height
u'50px'
>>> s.keys()
[u'width', u'height', u'float']
>>> s.cssText
u'width: 150px;\nheight: 50px;\nfloat: right'
>>> del s['width']
>>> s.cssText
u'height: 50px;\nfloat: right'
So, using this you can pretty easily extract and manipulate the CSS properties you want and plug them into the HTML directly with BeautifulSoup. Be a little careful of the newline characters that pop up in the cssText
attribute, though. I think cssutils is more designed for formatting things as standalone CSS files, but it's flexible enough to mostly work for what you're doing here.
parse embedded css beautifulsoup
You can use a css parser like [cssutils][1]. I don't know if there is a function in the package itself to do something like this (can someone comment regarding this?), but i made a custom function to get it.
from bs4 import BeautifulSoup
import cssutils
html='''
<html>
<head>
<style type="text/css">
* {margin:0; padding:0; text-indent:0; }
.s5 {color: #000; font-family:Verdana, sans-serif;
font-style: normal; font-weight: normal;
text-decoration: none; font-size: 17.5pt;
vertical-align: 10pt;}
</style>
</head>
<body>
<p class="s1" style="padding-left: 7pt; text-indent: 0pt; text-align:left;">
This is a sample sentence. <span class="s5"> 1</span>
</p>
</body>
</html>
'''
def get_property(class_name,property_name):
for rule in sheet:
if rule.selectorText=='.'+class_name:
for property in rule.style:
if property.name==property_name:
return property.value
soup=BeautifulSoup(html,'html.parser')
sheet=cssutils.parseString(soup.find('style').text)
vl=get_property('s5','vertical-align')
print(vl)
Output
10pt
This is not perfect but maybe you can improve upon it.
[1]: https://pypi.org/project/cssutils/
Is there a way to extract CSS from a webpage using BeautifulSoup?
If your goal is to truly parse the css:
- There are some various methods here: Prev Question w/ Answers
- I also have used a nice example from this site: Python Code Article
Beautiful soup will pull the entire page - and it does include the header, styles, scripts, linked in css and js, etc. I have used the method in the pythonCodeArticle before and retested it for the link you provided.
import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin
# URL of the web page you want to extract
url = "ENTER YOUR LINK HERE"
# initialize a session & set User-Agent as a regular browser
session = requests.Session()
session.headers["User-Agent"] = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
# get the HTML content
html = session.get(url).content
# parse HTML using beautiful soup
soup = bs(html, "html.parser")
print(soup)
By looking at the soup output (It is very long, I will not paste here).. you can see it is a complete page. Just make sure to paste in your specific link
NOW If you wanted to parse the result to pick up all css urls.... you can add this: (I am still using parts of the code from the very well described python Code article link above)
# get the CSS files
css_files = []
for css in soup.find_all("link"):
if css.attrs.get("href"):
# if the link tag has the 'href' attribute
css_url = urljoin(url, css.attrs.get("href"))
css_files.append(css_url)
print(css_files)
The output css_files will be a list of all css files. You can now go visit those separately and see the styles that are being imported.
NOTE:this particular site has a mix of styles inline with the html (i.e. they did not always use css to set the style properties... sometimes the styles are inside the html content.)
This should get you started.
Getting style attribute using BeautifulSoup
Just access the attribute using tag["attribute"]
:
In [28]: soup = BeautifulSoup('<tr style="pretty"></tr>', 'html.parser')
In [29]: print(soup.find("tr")["style"])
pretty
If you only want the tr tags with style attributes an to get them all:
trs = s.find("table", class_="example-table").find_all("tr", style=True)
for tr in trs:
print(tr["style"])
Or using a css selector:
trs = s.select("table.example-table tr[style]")
for tr in trs:
print(tr["style"])
Using your actual url:
In [41]: r = requests.get("http://lol.esportswikis.com/wiki/G2_Esports/Match_History")
In [42]: s = BeautifulSoup(r.content, "lxml")
In [43]: trs = s.select("table.wikitable.sortable tr[style]")
In [44]:
In [44]: for tr in trs:
....: print(tr["style"])
....:
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#FFC7CE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#FFC7CE
background-color:#FFC7CE
background-color:#FFC7CE
background-color:#C6EFCE
background-color:#FFC7CE
background-color:#C6EFCE
background-color:#FFC7CE
background-color:#FFC7CE
background-color:#FFC7CE
background-color:#FFC7CE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#FFC7CE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#FFC7CE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#FFC7CE
background-color:#C6EFCE
background-color:#FFC7CE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#FFC7CE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#C6EFCE
background-color:#FFC7CE
background-color:#FFC7CE
background-color:#C6EFCE
background-color:#FFC7CE
background-color:#FFC7CE
background-color:#C6EFCE
BeautifulSoup: get css classes from html
BeautifulSoup itself doesn't parse CSS style declarations at all, but you can extract such sections then parse them with a dedicated CSS parser.
Depending on your needs, there are several CSS parsers available for python; I'd pick cssutils (requires python 2.5 or up (including python 3)), it is the most complete in it's support, and supports inline styles too.
Other options are css-py and tinycss.
To grab and parse such all style sections (example with cssutils):
import cssutils
sheets = []
for styletag in tree.findAll('style', type='text/css')
if not styletag.string: # probably an external sheet
continue
sheets.append(cssutils.parseStyle(styletag.string))
With cssutil
you can then combine these, resolve imports, and even have it fetch external stylesheets.
Remove height and width from inline styles
A full walk-through would be:
from bs4 import BeautifulSoup
import re
string = """
<div id="attachment_9565" class="wp-caption aligncenter" style="width: 2010px;background-color:red">
<p>Some line here</p>
<hr/>
<p>Some other beautiful text over here</p>
</div>
"""
# look for width or height, followed by not a ;
rx = re.compile(r'(?:width|height):[^;]+;?')
soup = BeautifulSoup(string, "html5lib")
for div in soup.findAll('div'):
div['style'] = rx.sub("", string)
As stated by others, using regular expressions on the actual value is not a problem.
Related Topics
Display a 'Loading' Message While a Time Consuming Function Is Executed in Flask
Calling R Script from Python Using Rpy2
How to Use Rpy2 to Save a Pandas Dataframe to an .Rdata File
What Does Blazeds Livecycle Data Services Do, That Something Like Pyamf or Rubyamf Not Do
Why Can't Python Find Shared Objects That Are in Directories in Sys.Path
How to Create a Large Pandas Dataframe from an SQL Query Without Running Out of Memory
Find All Combinations of a List of Numbers with a Given Sum
Installation Issue with Matplotlib Python
How to Get the Index of a Maximum Element in a Numpy Array Along One Axis
Passing a Matplotlib Figure to HTML (Flask)
Generate Correlated Data in Python (3.3)
Urllib2 in Python Equivalent for Ruby
Dead Simple Example of Using Multiprocessing Queue, Pool and Locking
Stream Large Binary Files with Urllib2 to File
Python Pandas: Get Index of Rows Which Column Matches Certain Value