Python How to Parse CSS File as Key Value

python how to parse css file as key value

I would suggest to use the cssutils module.

import cssutils
from pprint import pprint

css = u'''
body, html { color: blue }
h1, h2 { font-size: 1.5em; color: red}
h3, h4, h5 { font-size: small; }
'''

dct = {}
sheet = cssutils.parseString(css)

for rule in sheet:
    selector = rule.selectorText
    styles = rule.style.cssText
    dct[selector] = styles

pprint(dct)

Output:

{u'body, html': u'color: blue',
 u'h1, h2': u'font-size: 1.5em;\ncolor: red',
 u'h3, h4, h5': u'font-size: small'}

In your question you asked for a key/value representation. But if you do want to access the individial selectors or proprties, use rule.selectorList and iterate over its properties for rule.style:

for property in rule.style:
    name = property.name    
    value = property.value

Parse CSS in Python

Here's where i landed. Used BadKarma's strategy of cracking the string with a split.

from bs4 import BeautifulSoup
import re

class RichText(BeautifulSoup):
    """
    subclass BeautifulSoup
    add behavior for generating selectors and declaration_blocks from <style>
    """

    def __init__(self, html_page):
        super().__init__(html_page)

    @property
    def rules_as_str(self):
        return str(self.style.string)

    def rules(self):
        split_rules = re.split('(\.c[0-9]*)', self.rules_as_str)
        # side effect of split, first element is null
        assert(split_rules[0] == '')
        # enforce that it MUST be null, then pass over it
        for i in range(1, len(split_rules), 2):
            yield (split_rules[i].strip(), split_rules[i+1].strip())

if __name__ == '__main__':

    with open('rich-text.html', 'r') as f:
        html_file = f.read()

    rich_text = RichText(html_file)
    for selector, declaration_block in rich_text.rules():
        print(selector)
        print(declaration_block)

>>> with open("test.py") as f:
...     code = compile(f.read(), "test.py", 'exec')
...     exec(code)
... 
.c0
{ padding: 1px 0px 0px; font-size: 11px }
.c1
{ margin: 0px; font-size: 11px }
.c2
{ font-size: 11px }
.c3
{ font-size: 11px; font-style: italic; font-weight: bold }
>>>

parse embedded css beautifulsoup

You can use a css parser like [cssutils][1]. I don't know if there is a function in the package itself to do something like this (can someone comment regarding this?), but i made a custom function to get it.

from bs4 import BeautifulSoup
import cssutils
html='''
<html>
    <head>
        <style type="text/css">
        * {margin:0; padding:0; text-indent:0; }
        .s5 {color: #000; font-family:Verdana, sans-serif;
             font-style: normal; font-weight: normal;
             text-decoration: none; font-size: 17.5pt;
             vertical-align: 10pt;}
        </style>
    </head>

    <body>
        <p class="s1" style="padding-left: 7pt; text-indent: 0pt; text-align:left;">
        This is a sample sentence. <span class="s5"> 1</span>
        </p>
    </body>
</html>
'''
def get_property(class_name,property_name):
    for rule in sheet:
        if rule.selectorText=='.'+class_name:
            for property in rule.style:
                if property.name==property_name:
                    return property.value
soup=BeautifulSoup(html,'html.parser')
sheet=cssutils.parseString(soup.find('style').text)
vl=get_property('s5','vertical-align')
print(vl)

Output

10pt

This is not perfect but maybe you can improve upon it.
[1]: https://pypi.org/project/cssutils/

Python CSS Parser

I solved this problem by using regular expressions. So I kind of ended up making my own parser. I formed a regular expression to search for color patterns such as #XXX, #XXXXXX, rgb(X,X,X), hsl(X,X,X) in the CSS file, maintained a list to keep the positions they are in the CSS file. Then I just re-wrote all the colors at the positions specified by list. This is the best summary I can give for what I did. Please add comment if you need a very detailed explanation. Thank you.

Is there a way to extract CSS from a webpage using BeautifulSoup?

If your goal is to truly parse the css:

There are some various methods here: Prev Question w/ Answers
I also have used a nice example from this site: Python Code Article

Beautiful soup will pull the entire page - and it does include the header, styles, scripts, linked in css and js, etc. I have used the method in the pythonCodeArticle before and retested it for the link you provided.

import requests
from bs4 import BeautifulSoup as bs
from urllib.parse import urljoin

# URL of the web page you want to extract
url = "ENTER YOUR LINK HERE"

# initialize a session & set User-Agent as a regular browser
session = requests.Session()
session.headers["User-Agent"] = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"

# get the HTML content
html = session.get(url).content

# parse HTML using beautiful soup
soup = bs(html, "html.parser")
print(soup)

By looking at the soup output (It is very long, I will not paste here).. you can see it is a complete page. Just make sure to paste in your specific link

NOW If you wanted to parse the result to pick up all css urls.... you can add this: (I am still using parts of the code from the very well described python Code article link above)

# get the CSS files
css_files = []
for css in soup.find_all("link"):
    if css.attrs.get("href"):
        # if the link tag has the 'href' attribute
        css_url = urljoin(url, css.attrs.get("href"))
        css_files.append(css_url)
print(css_files)

The output css_files will be a list of all css files. You can now go visit those separately and see the styles that are being imported.

NOTE:this particular site has a mix of styles inline with the html (i.e. they did not always use css to set the style properties... sometimes the styles are inside the html content.)

This should get you started.

Parse rendered HTML for CSS property value in Python

index.html:

<!DOCTYPE html>
<html>
<head>
  <!-- For html5 (default is UTF-8) -->
  <meta charaset="UTF-8">
  <title>Phantom JS Example</title>

  <style>
    img#red {
      position: absolute;
      left: 100px;
      top: 100px;
      z-index: 5;    #***z-index set by CSS****
    }

    img#black {
      position: absolute;
      left: 100px;
      top: 100px;
      z-index: 2;    #***z-index set by CSS****
    }
  </style>
</head>

<body>
  <div>Hello</div>

  <img src="4row_red.png" id="red" width="40" height="40">
  <img src="4row_black.png" id="black" width="40" height="40">

<script>
window.onload = function() {
  var red = document.getElementById('red');
  red.style.zIndex = "0";    #****z-idex set by JAVASCRIPT****
};
</script>

from selenium import webdriver

driver = webdriver.PhantomJS()
driver.set_window_size(1120, 550) #For bug
driver.get("http://localhost:8000")

png = driver.find_element_by_id('red')
#print png.css_value('zIndex')  <--AttributeError: 'WebElement' object has no attribute 'css_value'
print "{} id={}-> {}".format(
    png.tag_name,
    png.get_attribute('id'),
    png.value_of_css_property('zIndex')
)
#print png.style('zIndex')  <--AttributeError: 'WebElement' object has no attribute 'style'
print "get_attribute('zIndex') -> {}".format(
    png.get_attribute('zIndex')
)

print '-' * 20

imgs = driver.find_elements_by_tag_name('img')

for img in imgs:
    print "{} id={}-> {}".format(
        img.tag_name,
        img.get_attribute('id'),
        img.value_of_css_property('zIndex')
    )

print "get_attribute('zIndex') -> {}".format(
    imgs[-1].get_attribute('zIndex')
)

print '-' * 20

all_tags = driver.find_elements_by_tag_name('*')

for tag in all_tags:
    print "{} --> {}".format(
            tag.tag_name,
            tag.value_of_css_property('zIndex')
    )

driver.quit()

--output:--
img id=red-> 1    #Found the z-index set by the js.
get_attribute('zIndex') -> None  #Didn't find the z-index set by the js
--------------------
img id=red-> 1
img id=black-> 3  #Found the z-index set by the css stylesheet
get_attribute('zIndex') -> None  #Didn't find the z-index set by the css stylesheet
--------------------
html --> 0
head --> auto
meta --> auto
title --> auto
style --> auto
body --> auto
div --> auto
img --> 1
img --> 3
script --> auto

Setup:

$ pip install selenium
$ brew install phantomjs

https://realpython.com/blog/python/headless-selenium-testing-with-python-and-phantomjs/

value_of_css_property(property_name):
Returns the value of a CSS property

get_attribute(name):
Gets the given attribute or property of the element.

This method will return the value of the given property if this is set, otherwise it returns the value of the attribute with the same name if that exists, or None.

Values which are considered truthy, that is equals “true” or “false”, are returned as booleans. All other non-None values are returned as strings. For attributes or properties which does not exist, None is returned.

Args:

name - Name of the attribute/property to retrieve.

http://selenium-python.readthedocs.org/en/latest/api.html#selenium.webdriver.remote.webelement.WebElement.value_of_css_property

Not very good docs.

Parse CSS for url() values with Python 2.7

cssutils.getUrls

Python How to Parse CSS File as Key Value