Regular Expressions vs XPath when parsing HTML text I think XPath is the primary option for traversing XML-like documents. With RegExp, it will be up to you to handle the different forms of writing a tag (with multiple spaces, double quotes, single quotes, no quotes, in one line, in multi-lines, with inner data, without inner data, etc). With XPath, this is all transparent to you, and it has many features (like accessing a node by index, selecting by attribute values, selecting simblings, and MANY others).
See how powerfull it can be at http://www.w3schools.com/xpath/.
EDIT: See also How do HTML parses work if they're not using regexp?
Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms Concentrate on this one:
A finite automaton (which is the data structure underlying a regular expression) does not have memory apart from the state it's in, and if you have arbitrarily deep nesting, you need an arbitrarily large automaton, which collides with the notion of a finite automaton.
The definition of regular expressions is equivalent to the fact that a test of whether a string matches the pattern can be performed by a finite automaton (one different automaton for each pattern). A finite automaton has no memory - no stack, no heap, no infinite tape to scribble on. All it has is a finite number of internal states, each of which can read a unit of input from the string being tested, and use that to decide which state to move to next. As special cases, it has two termination states: "yes, that matched", and "no, that didn't match".
HTML, on the other hand, has structures that can nest arbitrarily deep. To determine whether a file is valid HTML or not, you need to check that all the closing tags match a previous opening tag. To understand it, you need to know which element is being closed. Without any means to "remember" what opening tags you've seen, no chance.
Note however that most "regex" libraries actually permit more than just the strict definition of regular expressions. If they can match back-references, then they've gone beyond a regular language. So the reason why you shouldn't use a regex library on HTML is a little more complex than the simple fact that HTML is not regular.
Can extended regex implementations parse HTML? The answer to your question is that yes, so-called “extended regexes” — which are perhaps more properly called patterns than regular expressions in the formal sense — such as those found in Perl and PCRE are indeed capable of recursive descent parsing of context-free grammars.
This posting’s pair of approaches illustrate not so much theoretical as rather practical limits to applying regexes to X/HTML. The first approach given there, the one labelled naïve, is more like the sort you are apt to find in most programs that make such an attempt. This can be made to work on well-defined, non-generic X/HTML, often with very little effort. That is its best application, just as open-ended X/HTML is its worst.
The second approach, labelled wizardly, uses an actual grammar for parsing. As such, it is fully as powerful as any other grammatical approach. However, it is also far beyond the powers of the overwhelming majority of casual programmers. It also risks re-creating a perfectly fine wheel for negative benefit. I wrote it to show what can be done, but which under virtually no circumstances whatsoever ever should be done. I wanted to show people why they want to use a parser on open-ended X/HTML by showing them how devilishly hard it is to come even close to getting right even using some of the most powerful of pattern-matching facilities currently available.
Many have misread my posting as somehow advocating the opposite of what I am actually saying. Please make no mistake: I’m saying that it is far too complicated to use. It is a proof by counter-example. I had hoped that by showing how to do it with regexes, people would realize why they did not want to go down that road. While all things are possible, not all are expedient.
My personal rule of thumb is that if the required regex is of only the first category, I may well use it, but that if it requires the fully grammatical treatment of the second category, I use someone else’s already-written parser. So even though I can write a parser, I see no reason to do so, and plenty not to.
When carefully crafted for that explicit purpose, patterns can be more resisilient to malformed X/HTML than off-the-shelf parsers tend to be, particularly if you have no real opportunity to hack on said parsers to make them more resilient to the common failure cases that web browsers tend to tolerate but validators do not. However, the grammatical patterns I provide above were designed for only well-formed but reasonably generic HTML (albeit without entity replacement, which is easily enough added). Error recovery in parsers is a separate issue altogether, and by no means a pleasant one.
Patterns, especially the far more commonplace non-grammatical ones most people are used to seeing and using, are much better suited for grabbing up discrete chunks one at a time than they are for producing a full syntactic analysys. In other words, regexes usually work better for lexing than they do for parsing. Without grammatical regexes, you should not try parsing grammars.
But don’t take that too far. I certainly do not mean to imply that you should immediately turn to a full-blown parser just because you want to tackle something that is recursively defined. The easiest and perhaps most commonly seen example of this sort of thing is a pattern to detect nested items, like parentheses. It’s extremely common for me to just plop down something simple like this in my code, and be done with it:
# delete all nested parens s/\((?:[^()]*+|(?0))*\)//g;
How is the DOM parsed? Look at this:
Here is a good Example
What to do Regular expression pattern doesn't match anywhere in string? Contrary to all the answers here, for what you're trying to do regex is a perfectly valid solution. This is because you are NOT trying to match balanced tags-- THAT would be impossible with regex! But you are only matching what's in one tag, and that's perfectly regular.
Here's the problem, though. You can't do it with just one regex... you need to do one match to capture an <input>
tag, then do further processing on that. Note that this will only work if none of the attribute values have a >
character in them, so it's not perfect, but it should suffice for sane inputs.
Here's some Perl (pseudo)code to show you what I mean:
my $html = readLargeInputFile(); my @input_tags = $html =~ m/ ( <input # Starts with "<input" (?=[^>]*?type="hidden") # Use lookahead to make sure that type="hidden" [^>]+ # Grab the rest of the tag... \/> # ...except for the />, which is grabbed here )/xgm; # Now each member of @input_tags is something like <input type="hidden" name="SaveRequired" value="False" /> foreach my $input_tag (@input_tags) { my $hash_ref = {}; # Now extract each of the fields one at a time. ($hash_ref->{"name"}) = $input_tag =~ /name="([^"]*)"/; ($hash_ref->{"value"}) = $input_tag =~ /value="([^"]*)"/; # Put $hash_ref in a list or something, or otherwise process it }
The basic principle here is, don't try to do too much with one regular expression. As you noticed, regular expressions enforce a certain amount of order. So what you need to do instead is to first match the CONTEXT of what you're trying to extract, then do submatching on the data you want.
EDIT: However, I will agree that in general, using an HTML parser is probably easier and better and you really should consider redesigning your code or re-examining your objectives. :-) But I had to post this answer as a counter to the knee-jerk reaction that parsing any subset of HTML is impossible: HTML and XML are both irregular when you consider the entire specification, but the specification of a tag is decently regular, certainly within the power of PCRE.
What is the optimal regex for parsing HTML (even though you shouldn't)? Is there a perfect one? First of all let's get this straight:
Regex' incompatibility with HTML parsing is NOT a claim . Repeat after me: "Not a claim ".
It's a scientifically proven and well known fact . Further more the world was not created in 7 days and big-foot ain't real either. End of discussion.
But let's put this question within the following scope: we have no option other than Regex to parse HTML. Why? It doesn't matter
Funny that you write it doesn't matter. Given, that the "why " is actually what makes it either partially possible or completely impossible what you're planning to do. If there was one thing here that mattered, it'd be the "why ".
If the "why " is "validation ", then the answer is per definition: not possible. Validation requires no less than 100% language coverage. And regular expressions, being a subset of context-free grammars, therefor cannot cover 100%. By definition.
If the "why " however is "extraction ", then you can get quite good results using regex. Never 100% reliable, but good enough for most cases.
Truth is, you can do some incredibly powerful things with modern regex, even if rather verbose.
The sheer length, redundancy and complexity of this pattern shows that while it may not be impossible to describe valid email addresses in regex it at least is disproportionately difficult and does actually rather resemble a brute force dictionary list, than a clean grammar. And while we"re at it: date string validation is even worse. Leap years just to begin with.
To put my differentiation between "validation " and "extraction " into perspective:
To validate a simple email address one needs a monolythic 6400+ chararacters long regular expression.
To "extract " the domain name from an email address however the simple @([^\s]+)
or (?<=@)[^\s]+
would cover pretty much (if not exactly) 100%. Assuming the string is isolated and known to be a valid email address.
Is it possible to generate a perfect regular expression for parsing HTML?
You basically answered this one yourself by writing "perfect": No.
If so, is the proof constructive? Do we only know we can, or has it been done?
It's not about "is it just that nobody has managed to do it yet?" but more about "it's been mathematically proven to be impossible!". QED
If it is not possible, what is the most accurate one out there?
Given that it's by definition not possible the only correct answer to this would be "none".
The best approximation of a regex for parsing all (or as much as possible) of HTML would be an infinitely long regex pattern along the lines of x|y|z|…
with x, y, z … being all (brute forced) possible productions of the grammar of HTML chained together in an infinitely long logical OR. It would be a proper regex (even to the truest terms of regex), cover all of HTML (it lists and hence matches all possible strings after all), be only theoretically possible (or at least feasible, just like the turing machine) and practically utterly useless .
Regex can describe type-3 Chomsky languages (regular languages ), while HTML (and most other programming/markup languages) is a type-2 Chomsky language (context-free ). The regular languages are a subset of context-free languages . A grammar of type-n can always cover a subset of a language of type-(n-x) , but never all of it. Therefor regex can only describe a subset of HTML. Big-foot is a claim. This is a fact.
Being strictly left or right extended regex have NO understanding of balance or nesting (neither "S→aSa" nor the mixed linear "S→aA,A→Sb,S→ε"). Therefor you can NOT parse HTML.
A quick example for "S→aSa" (balanced nesting):
<div> <div> ... <div> <div>
Yep, the very core of what's HTML/XML is incompatible with regex. Pretty damn bad position to begin with, isn't it? HTML parsing via regex is literally rotten in its core. Broken by design. Guaranteed to fail.
And one for "S→aA,A→Sb,S→ε" (counting):
It is impossible to validate the correct (matching) number of <td>
per row:
<table> <tr> <td>1</td> <td>2</td> <td>3</td> <td>4</td> </tr> <tr> <td>1</td> <td>2</td> <td>3</td> <td>4</td> </tr> </table>
Another thing to keep in mind: as soon as you reduce the scope of what you can recognize of language "X" you're NO LONGER recognizing language "X", but a NEW and self-contained subset language "Y".
In the field of languages it is either all or nothing. There is no in between.
Now to those saying PCRE can do it!: Yes, it's called a context-free grammar then.
…by which it not only would no longer be a regular expression and thus fail the test:
we have no option other than Regex
but also still be the wrong tool to begin with . There are dedicated parsers for such tasks. Use 'em.
The email matching regex (as linked by OP) is a nightmare to read, let alone maintain:
(?:(?:\r\n)?[ \t])*(?:(?:(?:[^()<>@,;:\\".\[\] \000-\031]+(?:( ?:(?:\r\n)?[ \t])+|\Z|(?=[\["()<>@,;:\\".\[\]]))|"(?:[^\"\r\\] |\\.|(?:(?:\r\n)?[ \t]))*"(?:(?:\r\n)?[ \t])*)(?:\.(?:(?:\r\n) ?[ \t])*(?:[^()<>@,;:\\".\[\] \000-\031]+(?:(?:(?:\r\n)?[ \t]) ... (6400+ chars)
While here is an excerpt of the very same specification in form of a proper context-free grammar:
address = mailbox ; one addressee / group ; named list group = phrase ":" [#mailbox] ";" mailbox = addr-spec ; simple address / phrase route-addr ; name & addr-spec route-addr = "<" [route] addr-spec ">" route = 1#("@" domain) ":" ; path-relative addr-spec = local-part "@" domain ; global address local-part = word *("." word) ; uninterpreted ; case-preserved domain = sub-domain *("." sub-domain) sub-domain = domain-ref / domain-literal domain-ref = atom ; symbolic reference
Some people, when confronted with HTML, think "I know, I'll use regular expressions." Now they have two metric f*cktons of problems.
So tell me. Why on earth would anyone really been far even as decided to use even go want to do look more like?
regexp works when extracting HTML, but not with file_get_contents This code should work:
$file_string = file_get_contents('http://wiki.teamliquid.net/starcraft2/ASUS_ROG_NorthCon_2013'); preg_match_all('/<th.{0,30}>.*Organizer.*?<\/a>/msi', $file_string, $organizer); print_r($organizer); if (empty($organizer[0])) { echo "Couldn't get organizer \n"; $stats['organizer'] = 'ERROR'; } else { $stats['organizer'] = $organizer[0]; }
Instead of $organizer[1] put $organizer[0] because that will be your first (and only) result. You had to make .* lazy by putting question mark after it. That means that it will stop searching once it finds what its looking for.
For example this code
<a.*>(.*)<\/a>
Will search from first tag to last one on page (it doesn't stop when it finds </a>) while this code
<a.*?>(.*?)<\/a>
will stop searching after it finds first </a>
Check source code once you echo it. This will be result(I assume you wanted it like this with html included):
<th valign="top"> Organizer: </th> <td style="width:55%;"> <a rel="nofollow" target="_blank" class="external text" href="http://www.northcon.de/">NorthCon</a>
Related Topics