Regular Expression to Extract HTML Body Content

Regular Expression to Extract HTML Body Content

Would this work ?

((?:.(?!<body[^>]*>))+.<body[^>]*>)|(</body\>.+)

Of course, you need to add the necessary \s in order to take into account < body ...> (element with spaces), as in:

((?:.(?!<\s*body[^>]*>))+.<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)

On second thought, I am not sure why I needed a negative look-ahead... This should also work (for a well-formed xhtml document):

(.*<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)

regular expression to extract text from HTML

You can't really parse HTML with regular expressions. It's too complex. RE's won't handle <![CDATA[ sections correctly at all. Further, some kinds of common HTML things like <text> will work in a browser as proper text, but might baffle a naive RE.

You'll be happier and more successful with a proper HTML parser. Python folks often use something Beautiful Soup to parse HTML and strip out tags and scripts.


Also, browsers, by design, tolerate malformed HTML. So you will often find yourself trying to parse HTML which is clearly improper, but happens to work okay in a browser.

You might be able to parse bad HTML with RE's. All it requires is patience and hard work. But it's often simpler to use someone else's parser.

Regex Extract html Body

Don't use a regular expression for this - use something like the Html Agility Pack.

This is an agile HTML parser that
builds a read/write DOM and supports
plain XPATH or XSLT (you actually
don't HAVE to understand XPATH nor
XSLT to use it, don't worry...). It is
a .NET code library that allows you to
parse "out of the web" HTML files. The
parser is very tolerant with "real
world" malformed HTML. The object
model is very similar to what proposes
System.Xml, but for HTML documents (or
streams).

Then you can extract the body with an XPATH.

Extracting BODY text of HTML document in node js using REGEX

No need to over think it, you can just document.body.innerText

A Sample Document
Some strong and emphasized text

JSFiddle example

Regex to match content of HTML body in PHP

You simply have to add the s modifier to have the dot match all characters, including new lines :

preg_match("/<body.*\/body>/s", $content, $matches);

as explained in the doc : http://nl2.php.net/manual/en/reference.pcre.pattern.modifiers.php

Python Regex to extract content of src of an html tag?

I'm not good at regEx. So my answer may not be best.

Try this.

x = re.findall(r'(?=src)src=\"(?P<src>[^\"]+)', html)

than you can see x like below.

['/pic/earth.jpg', '/pic/redrose.jpg']

RegEx explanation :

(?=src) : positive lookup --> only see those have src word

src=\" : must include this specific word src="

(?P somthing) : this expression grouping somthing to name src

[^\"]+ : everything except " character

Regex to extract pure text within specific HTML tag

How real software engineers solve this problem: Use the right tool for the right job, i.e. don't use regexes to parse HTML

The most straightforward way is to use an HTML parsing library, since parsing even purely conforming XML with regex is extremely non-trivial, and handling all HTML edge cases is an inhumanly difficult task.


If your requirements are "you must use a regex library to pull innerHTML from a <p> element", I'd much prefer to split it into two tasks:

1) using regex to pull out the container element with its innerHTML. (I'm showing an example that only works for getting the outermost element of a known tag. To extract an arbitrary nested item you'd have to use some trick like https://blogs.msdn.microsoft.com/bclteam/2005/03/15/net-regular-expressions-regex-and-balanced-matching-ryan-byington/ to match the balanced expression)

2) using a simple Regex.Replace to strip out all tag content

let html = @"<p>This is some <strong>strong</strong> text</p>
<p>This is some <b><em>really<strong>strong</strong><em></b> text</p>"

for m in Regex.Matches(html, @"<p>(.*?)</p>") do
printfn "(%O)" (Regex.Replace(m.Groups.[1].Value, "<.*?>", ""))

(This is some strong text)
(This is some reallystrong text)

If you are constrained to a single "Regex.Matches" call, and you're okay with ignoring the possibility of nested <p> tags (as luck would have it, in conformant HTML you can't nest ps but this solution wouldn't work for a containing element like <div>) you should be able to do it with a nongreedy matching of a text part and a tag part wrapped up inside a <p>...</p> pattern. (Note 1: this is F#, but it should be trivial to convert to C#) (Note 2: This relies on .NET-flavored regex-isms like stackable group names and multiple captures per group)

let rx = @"
<p>
(?<p_text>
(?:
(?<text>[^<>]+)
(?:<.*?>)+
)*?
(?<text>[^<>]+)?
)</p>
"
let regex = new Regex(rx, RegexOptions.IgnorePatternWhitespace)
for m in regex.Matches(@"
<p>This is some <strong>strong</strong> text</p>
<p>This is some <b><em>really<strong>strong</strong><em></b> text</p>
") do
printfn "p content: %O" m
for capture in m.Groups.["text"].Captures do
printfn "text: %O" capture

p content: <p>This is some <strong>strong</strong> text</p>
text: This is some
text: strong
text: text
p content: <p>This is some <b><em>really<strong>strong</strong><em></b> text</p>
text: This is some
text: really
text: strong
text: text


Remember that both the above examples don't work that well on malformed HTML or cases where the same tag is nested in itsel



Related Topics



Leave a reply



Submit