Parsing CSS by Regex

Parsing CSS by regex

That just seems too convoluted for a single regular expression. Well, I'm sure that with the right extentions, an advanced user could create the right regex. But then you'd need an even more advanced user to debug it.

Instead, I'd suggest using a regex to pull out the pieces, and then tokenising each piece separately. e.g.,

/([^{])\s*\{\s*([^}]*?)\s*}/

Then you end up with the selector and the attributes in separate fields, and then split those up. (Even the selector will be fun to parse.) Note that even this will have pains if }'s can appear inside quotes or something. You could, again, convolute the heck out of it to avoid that, but it's probably even better to avoid regex's altogether here, and handle it by parsing one field at a time, perhaps by using a recursive-descent parser or yacc/bison or whatever.

Javascript Regular Expression to Parse CSS

Don't try to match it with a complex regex. But, instead split using a less complex one.

For splitting, we can use String.split and pass the regex /[{}/ into it and then we use Array.map to trim the strings so as to get only the content and removing the white space. But prior to doing that, we will just remove the unwanted empty strings using Array.filter

var arr = str.split(/[{}]/).filter(String).map(function(str){
return str.trim();
});

It also has the advantage of working on all css-rules and not just classes, provided the CSS is a valid one.

Parse CSS expression with regex

\s means whitespace and \s* means zero or more occurrence of whitespace

this is what you are looking for:

((:?\.*|#*)+[a-zA-Z0-9_-]+\s*{[^}]*})

Demo: https://regex101.com/r/lR4aW1/3

Regular expression to parse css

You can use pyparsing to parse such nested parentheses.

import pyparsing as pp

string = '@keyframes mymove {from {top: 0px;} to {top: 200px;}}'
pattern = pp.Regex(r'^.*?(?= \{)') + pp.original_text_for(pp.nested_expr('{', '}'))
selector, rules = pattern.parse_string(string)

# Tests
assert selector == '@keyframes mymove'
assert rules == '{from {top: 0px;} to {top: 200px;}}'

* pyparsing can be installed by pip install pyparsing

See also this post: Python: How to match nested parentheses with regex?

Regex to parse CSS selector

Thanks all very much for your suggestions and help. I tied it all together into the following two Regex Patterns:

This one parses the CSS selector string (e.g. div#myid.myclass[attr=1,fred=3]) http://www.rubular.com/r/2L0N5iWPEJ

cssSelector = re.compile(r'^(?P<type>[\*|\w|\-]+)?(?P<id>#[\w|\-]+)?(?P<classes>\.[\w|\-|\.]+)*(?P<data>\[.+\])*$')

>>> cssSelector.match("table#john.test.test2[hello]").groups()
('table', '#john', '.test.test2', '[hello]')
>>> cssSelector.match("table").groups()
('table', None, None, None)
>>> cssSelector.match("table#john").groups()
('table', '#john', None, None)
>>> cssSelector.match("table.test.test2[hello]").groups()
('table', None, '.test.test2', '[hello]')
>>> cssSelector.match("table#john.test.test2").groups()
('table', '#john', '.test.test2', None)
>>> cssSelector.match("*#john.test.test2[hello]").groups()
('*', '#john', '.test.test2', '[hello]')
>>> cssSelector.match("*").groups()
('*', None, None, None)

And this one does the attributes (e.g. [link,key~=value]) http://www.rubular.com/r/2L0N5iWPEJ:

attribSelector = re.compile(r'(?P<word>\w+)\s*(?P<operator>[^\w\,]{0,2})\s*(?P<value>\w+)?\s*[\,|\]]')

>>> a = attribSelector.findall("[link, ds9 != test, bsdfsdf]")
>>> for x in a: print x
('link', '', '')
('ds9', '!=', 'test')
('bsdfsdf', '', '')

A couple of things to note:
1) This parses attributes using comma delimitation (since I am not using strict CSS).
2) This requires patterns take the format: tag, id, classes, attributes

The first regex does tokens, so the whitespace and '>' separated parts of a selector string. This is because I wanted to use it to check against my own object graph :)

Thanks again!

Parsing css background url and selector using regex

update

After a closer look, I offer 2 soulutions that mitigate backtracking issue's to a relative degree.

Before looking at them, I want to point out that there are only a very few delimiters associated with CSS syntax.

Moreover, it's more related to the order and content of allowed characters that define CSS syntax.

The cure to backtracking is to restrict the regex engine to fewer allowable

characters to match and withing strategic position.

If you look at the CSS specification here -> https://www.w3.org/TR/CSS21/syndata.html

you'll notice that it is entirely defined by regular expressions.

That indicates CSS parsers are entirely constructed with chopped version of regex.

However, while it would be an interesting exercise to put it into a

all encompasing regex, I will decline that challenge, because there is

nothing in it for me.

Instead, I offer these 2 regex tailored to your request.

Fisrt one:

  • Matches only the first url() block within the <style> element

<style[^>]*?>(?:[^{}:]*{[^{}]*?:[^{}()]*?})*?(?:([^{}:]*){[^{}]*?:\s*url\s*\(\s*([^{}()]*?)\s*\)\s*})

see -> https://regex101.com/r/2SNIks/1


Second one:

  • Matches all the url() blocks with the <style> element

(?:<style[^>]*?>|(?!^)\G)(?:(?:(?!</style)[^{}:])*{[^{}]*?:[^{}()]*?})*?(?:([^{}:]*){[^{}]*?:\s*url\s*\(\s*([^{}()]*?)\s*\)\s*})

see -> https://regex101.com/r/d8q6LH/1


For both regex,

  • The selector is in group 1
  • The url is in group 2

regex to parse CSS from HTML string fails when child combinator is used

The safer regex is this

/(?:<(style)(?:\s+(?=((?:"[\S\s]*?"|'[\S\s]*?'|(?:(?!\/>)[^>])?)+))\2)?\s*>)([\S\s]*?)<\/\1\s*>/

https://regex101.com/r/sx2YPf/1

and I recommend using this. The content is in group 3.

If you want to match all invisible content, put this in place of style script|style|object|embed|applet|noframes|noscript|noembed

For reading

 (?:
<
( style ) # (1), Invisible content; end tag req'd
(?:
\s+
(?=
( # (2 start)
(?:
" [\S\s]*? "
| ' [\S\s]*? '
| (?:
(?! /> )
[^>]
)?
)+
) # (2 end)
)
\2
)?
\s* >
)
( [\S\s]*? ) # (3)
</ \1 \s* >

If anybody is curious, the lookahead assertion matching the rest of the

style tag inner attr/vals specifically not only does that validation,

but also insures the style tag is not self contained (if even a typo).

The contents of the assertion is passive and is immune to backtracking,

and is captured and inserted just past the assertion where backtracking

environment is but now the backreference is just a literal.

In the non JS environment like php, this is accomplished by substituting

an atomic group (>..) instead of the assertion.

Parse inline CSS values with Regex?

Another way, using a regex:

$css = "color:#777;font-size:16px;font-weight:bold;left:214px;position:relative;top:   70px";

$results = array();
preg_match_all("/([\w-]+)\s*:\s*([^;]+)\s*;?/", $css, $matches, PREG_SET_ORDER);
foreach ($matches as $match) {
$results[$match[1]] = $match[2];
}

print_r($results);

Outputs:


Array
(
[color] => #777
[font-size] => 16px
[font-weight] => bold
[left] => 214px
[position] => relative
[top] => 70px
)


Related Topics



Leave a reply



Submit