Regular Expression to Extract Text from HTML

regular expression to extract text from HTML

You can't really parse HTML with regular expressions. It's too complex. RE's won't handle <![CDATA[ sections correctly at all. Further, some kinds of common HTML things like <text> will work in a browser as proper text, but might baffle a naive RE.

You'll be happier and more successful with a proper HTML parser. Python folks often use something Beautiful Soup to parse HTML and strip out tags and scripts.


Also, browsers, by design, tolerate malformed HTML. So you will often find yourself trying to parse HTML which is clearly improper, but happens to work okay in a browser.

You might be able to parse bad HTML with RE's. All it requires is patience and hard work. But it's often simpler to use someone else's parser.

RegEx to extract text between a HTML tag

Your comment shows that you have neglected to escape the backslashes in your regex string.

And if you want to match lowercase letters add a-z to the character classes or use Pattern.CASE_INSENSITIVE (or add (?i) to the beginning of the regex)

"<([A-Za-z][A-Za-z0-9]*)\\b[^>]*>(.*?)</\\1>"

If the tag contents may contain newlines, then use Pattern.DOTALL or add (?s) to the beginning of the regex to turn on dotall/singleline mode.

Regular expression to extract text from a string in html format

First, you can do it without regex, by creating a dummy element to inject the HTML:

var s = "your_html_string";
var dummy = document.createElement("div");
dummy.innerHTML = s;
var title = dummy.getElementsByTagName("title")[0].innerText;

But if you really insist on using regex:

var s = "your_html_string";
var title = s.match(/<title>([^<]+)<\/title>/)[1];

Here's a DEMO illustrating both approaches.

How to extract text from between html tag using Regular Expressions?

You can try,

>>> print [x.strip() for x in re.findall('<textarea.*?>(.*)</textarea>', content, re.MULTILINE | re.DOTALL)]
['abc_text\n #include<abc>\n xyz']

extract text from html tags using regex

You might be better of using a parser here:

import html, xml.etree.ElementTree as ET

# decode
string = """<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, h elvetica, sans-serif;">
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
</span></p>"""

# construct the dom
root = ET.fromstring(html.unescape(string))

# search it
for p in root.findall("*"):
print(p.text)

This yields

Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.

Obviously, you might want to change the xpath, thus have a look at the possibilities.


Addendum:

It is possible to use a regular expression here, but this approach is really error-prone and not advisable:

import re

string = """<p style="text-align: justify;"><span style="font-size: small; font-family: lato, arial, h elvetica, sans-serif;">
Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.
</span></p>"""

rx = re.compile(r'(\b[A-Z][\w\s,]+\.)')

print(rx.findall(string))
# ['Irrespective of the kind of small business you own, using traditional sales and marketing tactics can prove to be expensive.']

The idea is to look for an uppercase letter and match word characters, whitespaces and commas up to a dot. See a demo on regex101.com.



Related Topics



Leave a reply



Submit