python how to parse css file as key value
I would suggest to use the cssutils
module.
import cssutils
from pprint import pprint
css = u'''
body, html { color: blue }
h1, h2 { font-size: 1.5em; color: red}
h3, h4, h5 { font-size: small; }
'''
dct = {}
sheet = cssutils.parseString(css)
for rule in sheet:
selector = rule.selectorText
styles = rule.style.cssText
dct[selector] = styles
pprint(dct)
Output:
{u'body, html': u'color: blue',
u'h1, h2': u'font-size: 1.5em;\ncolor: red',
u'h3, h4, h5': u'font-size: small'}
In your question you asked for a key/value representation. But if you do want to access the individial selectors or proprties, use rule.selectorList
and iterate over its properties for rule.style
:
for property in rule.style:
name = property.name
value = property.value
Parse CSS in Python
Here's where i landed. Used BadKarma's strategy of cracking the string with a split.
from bs4 import BeautifulSoup
import re
class RichText(BeautifulSoup):
"""
subclass BeautifulSoup
add behavior for generating selectors and declaration_blocks from <style>
"""
def __init__(self, html_page):
super().__init__(html_page)
@property
def rules_as_str(self):
return str(self.style.string)
def rules(self):
split_rules = re.split('(\.c[0-9]*)', self.rules_as_str)
# side effect of split, first element is null
assert(split_rules[0] == '')
# enforce that it MUST be null, then pass over it
for i in range(1, len(split_rules), 2):
yield (split_rules[i].strip(), split_rules[i+1].strip())
if __name__ == '__main__':
with open('rich-text.html', 'r') as f:
html_file = f.read()
rich_text = RichText(html_file)
for selector, declaration_block in rich_text.rules():
print(selector)
print(declaration_block)
>>> with open("test.py") as f:
... code = compile(f.read(), "test.py", 'exec')
... exec(code)
...
.c0
{ padding: 1px 0px 0px; font-size: 11px }
.c1
{ margin: 0px; font-size: 11px }
.c2
{ font-size: 11px }
.c3
{ font-size: 11px; font-style: italic; font-weight: bold }
>>>
parse embedded css beautifulsoup
You can use a css parser like [cssutils][1]. I don't know if there is a function in the package itself to do something like this (can someone comment regarding this?), but i made a custom function to get it.
from bs4 import BeautifulSoup
import cssutils
html='''
<html>
<head>
<style type="text/css">
* {margin:0; padding:0; text-indent:0; }
.s5 {color: #000; font-family:Verdana, sans-serif;
font-style: normal; font-weight: normal;
text-decoration: none; font-size: 17.5pt;
vertical-align: 10pt;}
</style>
</head>
<body>
<p class="s1" style="padding-left: 7pt; text-indent: 0pt; text-align:left;">
This is a sample sentence. <span class="s5"> 1</span>
</p>
</body>
</html>
'''
def get_property(class_name,property_name):
for rule in sheet:
if rule.selectorText=='.'+class_name:
for property in rule.style:
if property.name==property_name:
return property.value
soup=BeautifulSoup(html,'html.parser')
sheet=cssutils.parseString(soup.find('style').text)
vl=get_property('s5','vertical-align')
print(vl)
Output
10pt
This is not perfect but maybe you can improve upon it.
[1]: https://pypi.org/project/cssutils/
Python CSS Parser
I solved this problem by using regular expressions. So I kind of ended up making my own parser. I formed a regular expression to search for color patterns such as #XXX, #XXXXXX, rgb(X,X,X), hsl(X,X,X) in the CSS file, maintained a list to keep the positions they are in the CSS file. Then I just re-wrote all the colors at the positions specified by list. This is the best summary I can give for what I did. Please add comment if you need a very detailed explanation. Thank you.
Is there a way to extract CSS from a webpage using BeautifulSoup?
If your goal is to truly parse the css:
- There are some various methods here: Prev Question w/ Answers
- I also have used a nice example from this site: Python Code Article
Beautiful soup will pull the entire page - and it does include the header, styles, scripts, linked in css and js, etc. I have used the method in the pythonCodeArticle before and retested it for the link you provided.
import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin
# URL of the web page you want to extract
url = "ENTER YOUR LINK HERE"
# initialize a session & set User-Agent as a regular browser
session = requests.Session()
session.headers["User-Agent"] = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"
# get the HTML content
html = session.get(url).content
# parse HTML using beautiful soup
soup = bs(html, "html.parser")
print(soup)
By looking at the soup output (It is very long, I will not paste here).. you can see it is a complete page. Just make sure to paste in your specific link
NOW If you wanted to parse the result to pick up all css urls.... you can add this: (I am still using parts of the code from the very well described python Code article link above)
# get the CSS files
css_files = []
for css in soup.find_all("link"):
if css.attrs.get("href"):
# if the link tag has the 'href' attribute
css_url = urljoin(url, css.attrs.get("href"))
css_files.append(css_url)
print(css_files)
The output css_files will be a list of all css files. You can now go visit those separately and see the styles that are being imported.
NOTE:this particular site has a mix of styles inline with the html (i.e. they did not always use css to set the style properties... sometimes the styles are inside the html content.)
This should get you started.
Parse rendered HTML for CSS property value in Python
index.html:
<!DOCTYPE html>
<html>
<head>
<!-- For html5 (default is UTF-8) -->
<meta charaset="UTF-8">
<title>Phantom JS Example</title>
<style>
img#red {
position: absolute;
left: 100px;
top: 100px;
z-index: 5; #***z-index set by CSS****
}
img#black {
position: absolute;
left: 100px;
top: 100px;
z-index: 2; #***z-index set by CSS****
}
</style>
</head>
<body>
<div>Hello</div>
<img src="4row_red.png" id="red" width="40" height="40">
<img src="4row_black.png" id="black" width="40" height="40">
<script>
window.onload = function() {
var red = document.getElementById('red');
red.style.zIndex = "0"; #****z-idex set by JAVASCRIPT****
};
</script>
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.set_window_size(1120, 550) #For bug
driver.get("http://localhost:8000")
png = driver.find_element_by_id('red')
#print png.css_value('zIndex') <--AttributeError: 'WebElement' object has no attribute 'css_value'
print "{} id={}-> {}".format(
png.tag_name,
png.get_attribute('id'),
png.value_of_css_property('zIndex')
)
#print png.style('zIndex') <--AttributeError: 'WebElement' object has no attribute 'style'
print "get_attribute('zIndex') -> {}".format(
png.get_attribute('zIndex')
)
print '-' * 20
imgs = driver.find_elements_by_tag_name('img')
for img in imgs:
print "{} id={}-> {}".format(
img.tag_name,
img.get_attribute('id'),
img.value_of_css_property('zIndex')
)
print "get_attribute('zIndex') -> {}".format(
imgs[-1].get_attribute('zIndex')
)
print '-' * 20
all_tags = driver.find_elements_by_tag_name('*')
for tag in all_tags:
print "{} --> {}".format(
tag.tag_name,
tag.value_of_css_property('zIndex')
)
driver.quit()
--output:--
img id=red-> 1 #Found the z-index set by the js.
get_attribute('zIndex') -> None #Didn't find the z-index set by the js
--------------------
img id=red-> 1
img id=black-> 3 #Found the z-index set by the css stylesheet
get_attribute('zIndex') -> None #Didn't find the z-index set by the css stylesheet
--------------------
html --> 0
head --> auto
meta --> auto
title --> auto
style --> auto
body --> auto
div --> auto
img --> 1
img --> 3
script --> auto
Setup:
$ pip install selenium
$ brew install phantomjs
https://realpython.com/blog/python/headless-selenium-testing-with-python-and-phantomjs/
value_of_css_property(property_name):
Returns the value of a CSS property
get_attribute(name):
Gets the given attribute or property of the element.
This method will return the value of the given property if this is set, otherwise it returns the value of the attribute with the same name if that exists, or None.
Values which are considered truthy, that is equals “true” or “false”, are returned as booleans. All other non-None values are returned as strings. For attributes or properties which does not exist, None is returned.
Args:
name - Name of the attribute/property to retrieve.
http://selenium-python.readthedocs.org/en/latest/api.html#selenium.webdriver.remote.webelement.WebElement.value_of_css_property
Not very good docs.
Parse CSS for url() values with Python 2.7
cssutils.getUrls
Related Topics
Why Does Python Assignment Not Return a Value
Python Image Library Fails with Message "Decoder Jpeg Not Available" - Pil
Control the Size Textarea Widget Look in Django Admin
Error When Installing Rpy2 Module in Python with Easy_Install
Efficient Ways to Duplicate Array/List in Python
Is There a Function That Checks If a Character in a String Is a Letter in the Alphabet? (Swift)
Ipython Reads Wrong Python Version
Performing a Getattr() Style Lookup in a Django Template
How to Save a Dictionary to a File
Python Parse CSV Ignoring Comma with Double-Quotes
What Are Some Good Python Orm Solutions
Matplotlib Table Formatting Column Width
Install Rpy2 on Windows7 64Bit for Python 2.7
Differencebetween Ruby and Python Versions Of"Self"
Convert Uiimage from Bgr to Rgb
Running an Outside Program (Executable) in Python
Typeerror: Can't Use a String Pattern on a Bytes-Like Object in Re.Findall()