A Regular Expression to Remove a Given (X)HTML Tag from a String

A regular expression to remove a given (x)HTML tag from a string

Attempting to parse HTML with regular expressions is generally an extremely bad idea. Use a parser instead, there should be one available for your chosen language.

You might be able to get away with something like this:

</?tag[^>]*?>

But it depends on exactly what you're doing. For example, that won't remove the tag's content, and it may leave your HTML in an invalid state, depending on which tag you're trying to remove. It also copes badly with invalid HTML (and there's a lot of that about).

Use a parser instead :)

How to remove html tags from an Html string using RegEx?

You can use

.replace(/<br>(?=(?:\s*<[^>]*>)*$)|(<br>)|<[^>]*>/gi, (x,y) => y ? ' & ' : '')

See the JavaScript demo:

const text = '<div class="ExternalClassBE95E28C1751447DB985774141C7FE9C"><p>Tina Schmelz<br></p><p>Sascha Balke<br></p></div>';
const regex = /<br>(?=(?:\s*<[^>]*>)*$)|(<br>)|<[^>]*>/gi;
console.log(
text.replace(regex, (x,y) => y ? ' & ' : '')
);

Regular expression to remove HTML tags without br/ tab from a string

</?([a-z]+)> should do. If slash is after letters it will not match.

Regular expression to remove HTML tags

Using a regular expression to parse HTML is fraught with pitfalls. HTML is not a regular language and hence can't be 100% correctly parsed with a regex. This is just one of many problems you will run into. The best approach is to use an HTML / XML parser to do this for you.

Here is a link to a blog post I wrote awhile back which goes into more details about this problem.

  • http://blogs.msdn.com/b/jaredpar/archive/2008/10/15/regular-expression-limitations.aspx

That being said, here's a solution that should fix this particular problem. It in no way is a perfect solution though.

var pattern = @"<(img|a)[^>]*>(?<content>[^<]*)<";
var regex = new Regex(pattern);
var m = regex.Match(sSummary);
if ( m.Success ) {
sResult = m.Groups["content"].Value;

How can i remove HTML Tags from String by REGEX?

try this

// erase html tags from a string
public static string StripHtml(string target)
{
//Regular expression for html tags
Regex StripHTMLExpression = new Regex("<\\S[^><]*>", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Multiline | RegexOptions.CultureInvariant | RegexOptions.Compiled);

return StripHTMLExpression.Replace(target, string.Empty);
}

call

string htmlString="<div><span>hello world!</span></div>";
string strippedString=StripHtml(htmlString);

Removing certain HTML tags using Regex

function stripHTML(html) {
return str.replace(/<(\/?|\!?)(DOCTYPE html|html|head|body)>/g, "");
}

You need a global modifier to get all cases
http://regex101.com/r/aA1vL0

How to use regex to remove string within certain HTML tag and string must contain empty space

Using regex with HTML is fraught with various issues, that is why you should be aware of all possible consequences. So, your <code>.+?</code> regex will only work in case the <code> and </code> tags are on one line and if there are no nested <code> tags inside them.

Assuming there are no nested code tags you might extend your current approach:

import re
inputString = "I want to remove <code>tag with space</code> not sole <code>word</code>"
outputString = re.sub("<code>(.+?)</code>", lambda m: " " if " " in m.group(1) else m.group(), inputString, flags=re.S)
print(outputString)

The re.S flag will enable . to match line breaks and a lambda will help to perform a check against each match: any code tag that contains a whitespace in its node value will be turned into a regular space, else it will be kept.

See this Python demo

A more common way to parse HTML in Python is to use BeautifulSoup. First, parse the HTML, then get all the code tags and then replace the code tag if the nodes contains a space:

>>> from bs4 import BeautifulSoup
soup = BeautifulSoup('I want to remove <code>tag with space</code> not sole <code>word</code>', "html.parser")
>>> for p in soup.find_all('code'):
if p.string and " " in p.string:
p.replace_with(" ")

>>> print(soup)
I want to remove not sole <code>word</code>


Related Topics



Leave a reply



Submit