W3C CSS Grammar, Syntax Oddities

W3C CSS grammar, syntax oddities

CSS, while similar to many programming languages, does have some rare instances where whitespace can be important.

Say we have the following base markup:

<html>
    <head>
        <style type="text/css">
            .blueborder { width:200px; height:200px; border: solid 2px #00f; }
            .redborder  { width:100px; height:100px; margin:50px; border: solid 2px #f00; }
        </style>
    </head>

    <body>
        <div class="blueborder">
            <div class="redborder"></div>
        </div>
    </body>

</html>

There's nothing special here, except a div inside of a div, with some styles on it so that you can see the difference between them.

Now lets add another class and an ID to the outer div:

<div class="blueborder MyClass" id="MyDiv">
    <div class="redborder"></div>
</div>

If I want to give a background to the outer div in the following manner:

.MyClass#MyDiv { background: #ccc; }

...then the whitespace becomes important. The rule above does style the outer div, as it is parsed differently than the following:

.MyClass #MyDiv { background: #ccc; }

...which does NOT style the outer div.

To see how these are parsed differently, you can look at how the selectors are tokenized:

Example1:

.MyClass#MyDiv -> DELIM IDENT HASH

Example2:

.MyClass #MyDiv -> DELIM IDENT S HASH

If we were to blindly ignore whitespace (as compilers usually do), we would miss this difference.

With that being said, I am not implying that this grammer is good. I also have a fair amount of experience in writing grammars, and I cringe when looking at this grammar. The easiest solution would have been to add the # and . symbols into the IDENT token and then everything else becomes much easier.

However they did not chose to do this, and the need for whitespace is an artifact of that decision.

Exercise 4.2.8 from Compilers - Principles, Techniques, & Tools (a.k.a. Dragon Book)

Instead of trying to write a grammar, we could just create a finite-state machine (which, as we know, is equivalent for a regular grammar). It should be clear that in this finite-state machine, each state will correspond to some subset of the n attribute types, and all of its valid transitions will be to states which represent a subset with exactly one more attribute type. (All of the states are accepting, since there is no minimum number of attributes.) So for the grammar with four attribute types (the original grammar), we end up with 16 states:

Sample Image

To convert that to a grammar, we only need to make each state a non-terminal, and each edge a production. Since there are never more than n transitions from any node, the total number of edges (and hence productions) is less than n·2ⁿ. (In fact, there are exactly n·2^n-1 transitions in the state machine.) Furthermore, every production right-hand side has at most two symbols.

Parsing CSS in C#: extracting all URLs

Finally got Alba.CsCss, my port of CSS parser from Mozilla Firefox, working.

First and foremost, the question contains two errors:

url (img) syntax is incorrect, because space is not allowed between url and ( in CSS grammar. Therefore, "img6", "img7" and "img8" should not be returned as URLs.
An unclosed quote in url function (url('img)) is a serious syntax error; web browsers, including Firefox, do not seem to recover from it and simply skip the rest of the CSS file. Therefore, requiring the parser to return "img9" and "img10" is unnecessary (but necessary if the two problematic lines are removed).

With CsCss, there are two solutions.

The first solution is to rely just on the tokenizer CssScanner.

List<string> uris = new CssLoader().GetUris(source).ToList();

This will return all "img" URLs (except mentioned in the error #1 above), but will also include "noimg3" as property names are not checked.

The second solution is to properly parse the CSS file. This will most closely mimic the behavior of browsers (including stopping parsing after an unclosed quote).

var css = new CssLoader().ParseSheet(source, SheetUri, BaseUri);
List<string> uris = css.AllStyleRules
    .SelectMany(styleRule => styleRule.Declaration.AllData)
    .SelectMany(prop => prop.Value.Unit == CssUnit.List
        ? prop.Value.List : new[] { prop.Value })
    .Where(value => value.Unit == CssUnit.Url)
    .Select(value => value.OriginalUri)
    .ToList();

If the two problematic lines are removed, this will return all correct "img" URLs.

(The LINQ query is complex, because background-image property in CSS3 can contain a list of URLs.)

Grammar restrictions on token look ahead

The second restriction can be relaxed somewhat by requiring that a parser can decide which production to use based on the first k tokens (as opposed to based on a single token). This allows for efficient (i.e. linear time) parsing algorithms for this class of grammars (see Recursive descent parser).

The main reason for choosing k=1 in practice seems to be that parsers for LL(1) grammars are easier to construct. Apparently many computer languages are designed to be generated by an LL(1) grammar. See LL parser.

The grammar comprised of productions S -> A | B, A -> a A b | eps, and B -> a B b b | eps is an example of a non-ambiguous non-LL(1) grammar because the parser cannot decide which production to use based on a single token. (Taken from here.)

How to store URLs in a table?

While browsers support URLs with thousands of characters, that is rare, and you said yourself that you want to be able to store an average URL. The length of your field would depend on what you expect to be average. Are you primarily going to be storing "home pages" (as in user profiles)? Or would it be URLs that deep link to content on a site with long paths and/or query strings -- basically any URL?

If it's the latter (any URL), then go with the 2083 number mentioned in the answers to this question. Keep in mind that there is a maximum row size in SQL Server of roughly 8000 bytes, though. If this is the same table as a bunch of other information (again, such as user profiles), you may hit that limit.

For my own projects, there is typically some more context around what URLs will be stored. I typically come up with a list of potential URLs and decide on a length that would accomodate all of those, and then double it. I typically end up with a field length of 300 or 500 for URLs, and I don't remember ever having a problem.

W3C CSS Grammar, Syntax Oddities