Regular Expression to Extract HTML Body Content
Would this work ?
((?:.(?!<body[^>]*>))+.<body[^>]*>)|(</body\>.+)
Of course, you need to add the necessary \s
in order to take into account < body ...>
(element with spaces), as in:
((?:.(?!<\s*body[^>]*>))+.<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)
On second thought, I am not sure why I needed a negative look-ahead... This should also work (for a well-formed xhtml document):
(.*<\s*body[^>]*>)|(<\s*/\s*body\s*\>.+)
regular expression to extract text from HTML
You can't really parse HTML with regular expressions. It's too complex. RE's won't handle <![CDATA[
sections correctly at all. Further, some kinds of common HTML things like <text>
will work in a browser as proper text, but might baffle a naive RE.
You'll be happier and more successful with a proper HTML parser. Python folks often use something Beautiful Soup to parse HTML and strip out tags and scripts.
Also, browsers, by design, tolerate malformed HTML. So you will often find yourself trying to parse HTML which is clearly improper, but happens to work okay in a browser.
You might be able to parse bad HTML with RE's. All it requires is patience and hard work. But it's often simpler to use someone else's parser.
Regex Extract html Body
Don't use a regular expression for this - use something like the Html Agility Pack.
This is an agile HTML parser that
builds a read/write DOM and supports
plain XPATH or XSLT (you actually
don't HAVE to understand XPATH nor
XSLT to use it, don't worry...). It is
a .NET code library that allows you to
parse "out of the web" HTML files. The
parser is very tolerant with "real
world" malformed HTML. The object
model is very similar to what proposes
System.Xml, but for HTML documents (or
streams).
Then you can extract the body
with an XPATH.
Extracting BODY text of HTML document in node js using REGEX
No need to over think it, you can just document.body.innerText
A Sample Document
Some strong and emphasized text
JSFiddle example
Regex to match content of HTML body in PHP
You simply have to add the s
modifier to have the dot match all characters, including new lines :
preg_match("/<body.*\/body>/s", $content, $matches);
as explained in the doc : http://nl2.php.net/manual/en/reference.pcre.pattern.modifiers.php
Python Regex to extract content of src of an html tag?
I'm not good at regEx. So my answer may not be best.
Try this.
x = re.findall(r'(?=src)src=\"(?P<src>[^\"]+)', html)
than you can see x like below.
['/pic/earth.jpg', '/pic/redrose.jpg']
RegEx explanation :
(?=src) : positive lookup --> only see those have src word
src=\" : must include this specific word src="
(?P somthing) : this expression grouping somthing to name src
[^\"]+ : everything except " character
Regex to extract pure text within specific HTML tag
How real software engineers solve this problem: Use the right tool for the right job, i.e. don't use regexes to parse HTML
The most straightforward way is to use an HTML parsing library, since parsing even purely conforming XML with regex is extremely non-trivial, and handling all HTML edge cases is an inhumanly difficult task.
If your requirements are "you must use a regex library to pull innerHTML from a
<p>
element", I'd much prefer to split it into two tasks: 1) using regex to pull out the container element with its innerHTML. (I'm showing an example that only works for getting the outermost element of a known tag. To extract an arbitrary nested item you'd have to use some trick like https://blogs.msdn.microsoft.com/bclteam/2005/03/15/net-regular-expressions-regex-and-balanced-matching-ryan-byington/ to match the balanced expression)
2) using a simple Regex.Replace to strip out all tag content
let html = @"<p>This is some <strong>strong</strong> text</p>
<p>This is some <b><em>really<strong>strong</strong><em></b> text</p>"
for m in Regex.Matches(html, @"<p>(.*?)</p>") do
printfn "(%O)" (Regex.Replace(m.Groups.[1].Value, "<.*?>", ""))
(This is some strong text)
(This is some reallystrong text)
If you are constrained to a single "Regex.Matches" call, and you're okay with ignoring the possibility of nested <p>
tags (as luck would have it, in conformant HTML you can't nest p
s but this solution wouldn't work for a containing element like <div>
) you should be able to do it with a nongreedy matching of a text part and a tag part wrapped up inside a <p>...</p>
pattern. (Note 1: this is F#, but it should be trivial to convert to C#) (Note 2: This relies on .NET-flavored regex-isms like stackable group names and multiple captures per group)
let rx = @"
<p>
(?<p_text>
(?:
(?<text>[^<>]+)
(?:<.*?>)+
)*?
(?<text>[^<>]+)?
)</p>
"
let regex = new Regex(rx, RegexOptions.IgnorePatternWhitespace)
for m in regex.Matches(@"
<p>This is some <strong>strong</strong> text</p>
<p>This is some <b><em>really<strong>strong</strong><em></b> text</p>
") do
printfn "p content: %O" m
for capture in m.Groups.["text"].Captures do
printfn "text: %O" capture
p content: <p>This is some <strong>strong</strong> text</p>
text: This is some
text: strong
text: text
p content: <p>This is some <b><em>really<strong>strong</strong><em></b> text</p>
text: This is some
text: really
text: strong
text: text
Remember that both the above examples don't work that well on malformed HTML or cases where the same tag is nested in itsel
Related Topics
C#: Throwing Custom Exception Best Practices
Finding the Concrete Type Behind an Interface Instance
Regex to Get Src Value from an Img Tag
Instantiate a Class from Its Textual Name
How to Lock/Unlock a File Across Process
How to Implement the Equivalent of SQL In() Using .Net
Is There a Standard C++ Equivalent of Ienumerable<T> in C#
Use an Async Callback with Task.Continuewith
Auto Create Database Tables from Objects, Entity Framework
Getting The Ip Address of Server in Asp.Net
Calling a SQL User-Defined Function in a Linq Query
String.Format() Giving "Input String Is Not in Correct Format"
How to Check That a Uri String Is Valid
How to Add Attributes for C# Xml Serialization
How to Open PDF File in a New Tab or Window Instead of Downloading It (Using ASP.NET)