Remove HTML Tags from String Including &Nbsp in C#

Remove HTML tags from string including in C#

If you can't use an HTML parser oriented solution to filter out the tags, here's a simple regex for it.

string noHTML = Regex.Replace(inputHTML, @"<[^>]+>| ", "").Trim();

You should ideally make another pass through a regex filter that takes care of multiple spaces as

string noHTMLNormalised = Regex.Replace(noHTML, @"\s{2,}", " ");

How do I remove all HTML tags from a string without knowing which tags are in it?

You can use a simple regex like this:

public static string StripHTML(string input)
{
   return Regex.Replace(input, "<.*?>", String.Empty);
}

Be aware that this solution has its own flaw. See Remove HTML tags in String for more information (especially the comments of 'Mark E. Haase'/@mehaase)

Another solution would be to use the HTML Agility Pack.

You can find an example using the library here: HTML agility pack - removing unwanted tags without removing content?

How to remove html tags from string in view page C#

If you want to show your content without any formatting then you can use this Regex.Replace(input, "<.*?>", String.Empty) to strip all of Html tags from your string.

1) Add below code to top of view (.cshtml).

@using System.Text.RegularExpressions;

@helper StripHTML(string input)
{
    if (!string.IsNullOrEmpty(input))
    {
        input = Regex.Replace(input, "<.*?>", String.Empty);
        <span>@input</span>
    }
}

2) Use the above helper function like

<td>@StripHTML(item.Message)</td>

Correctly removing html entities from a string

Well is not a 'regular' space. When you are using System.Net.WebUtility.HtmlDecode it will return the textual representation of the named html entity which is ' '. It looks like regular whitespace but it has different meaning. The decimal representation of nbsp is actually 160 which in hex is A0, so your unit test and decoding are working correctly.

If you want to replace nbsp with regular whitespace you have several options, the easiest of which will be execute simple replace before the decoding:

// where the second argument is whitespace char with decimal representation 32
text = text.Replace(" ", " ");

About the initial running:
The hex value 2C is 44 in decimal which is the symbol ','(comma). Is it possible that you just have looked at the wrong character ?

About sql collation: the latin general is capable of storing nbsp symbols so.. i think this is not a problem.

How can i remove HTML Tags from String by REGEX?

try this

// erase html tags from a string
public static string StripHtml(string target)
{
//Regular expression for html tags
Regex StripHTMLExpression = new Regex("<\\S[^><]*>", RegexOptions.IgnoreCase |   RegexOptions.Singleline | RegexOptions.Multiline | RegexOptions.CultureInvariant | RegexOptions.Compiled);

return StripHTMLExpression.Replace(target, string.Empty);
}

call

string htmlString="<div><span>hello world!</span></div>";
string strippedString=StripHtml(htmlString);

splice html tags in html string

I'm not sure if your HTML is always within a  element or if the number of   elements are different from case to case. If it's not different and you can depend on the outer element being the same, you can use this to get the first and last   elements.

Option #1 - When parent element (p in this case) is known and number of br elemnts are known (3 in this case).

string html = "<p><br><span>MERV 9 Cartridge<b><br> </b>Prefilters </span><br></p>";
string outHtml = string.Empty;

var document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(html);
var rootNode = document.DocumentNode;
var firstBrNode = rootNode.SelectSingleNode("//p/br[1]");
var lastBrNode = rootNode.SelectSingleNode("//p/br[last()]");

firstBrNode?.Remove();
lastBrNode?.Remove();
outHtml = document.DocumentNode.OuterHtml;

output:

MERV 9 Cartridge Prefilters 

Option #2 - When parent element is not known and the number of br tags is not known, and it's assumed if one br element is present it will be retained in the HTML.

string html = "<p><br><span>MERV 9 Cartridge<b><br> </b>Prefilters </span><br></p>";
// string html = "<p><span>MERV 9 Cartridge<b><br> </b>Prefilters </span></p>";
string outHtml = string.Empty;
var document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(html);
var rootNode = document.DocumentNode;
// count all br nodes so we can bypass removal of br if there is only one in HTML
var brNodeCount = rootNode.SelectNodes("//br") == null ? 0 : rootNode.SelectNodes("//br").Count;
// get the parent node of the br element to be used in the xpath when we remove
// the br elements this will allow for different parent elements other than the `p` element
var parentNode = rootNode.SelectSingleNode("//br/parent::*");
// only removes br elements if more than one in HTML, assumes if 1 br element is present it's in the middle and will not be removed
if (brNodeCount > 1)
{ 
    var firstBrNode = rootNode.SelectSingleNode($"//{parentNode.Name}/br[1]");
    var lastBrNode = rootNode.SelectSingleNode($"//{parentNode.Name}/br[last()]");
    firstBrNode?.Remove();
    lastBrNode?.Remove();
}
outHtml = document.DocumentNode.OuterHtml;

output:

MERV 9 Cartridge Prefilters 

Option #3 - Takes into account the index of the first and last text nodes and removes all br elements that sit 'outside' them. Text nodes that contain an empty or an all white-space value are ignored.

// removes all br tags with an index before the first text node and
// all br tags with an index after the end of the last text node,
// any br tags between are not removed
private string RemoveStartAndEndBrTags(string html)
{
    if (string.IsNullOrEmpty(html)) return html;
    var document = new HtmlAgilityPack.HtmlDocument();
    document.LoadHtml(html);
    var rootNode = document.DocumentNode;
    // get first and last text nodes, excluding any only containing white-space
    var allNonEmptyTextNodes = rootNode.SelectNodes("//text()[not(self::text()[not(normalize-space())])]");
    if (allNonEmptyTextNodes == null || allNonEmptyTextNodes.Count == 0) return html;
    var firstTextNode = allNonEmptyTextNodes[0];
    var lastTextNode = allNonEmptyTextNodes[allNonEmptyTextNodes.Count - 1];
    // get the parent node of the first br element, it will be used when we remove the br elements,
    // this will allow for different parent elements other than the `p` element
    var parentNode = rootNode.SelectSingleNode("//br/parent::*");
    if (parentNode == null) return html;
    var allBrNodes = rootNode.SelectNodes($"//{parentNode.Name}/br");
    foreach (var brNode in allBrNodes)
    {
        if (brNode == null) continue;
        // check index of br nodes against first and last text nodes
        // and remove br nodes that sit outside text nodes
        if (brNode.OuterStartIndex <= firstTextNode.OuterStartIndex
            || brNode.OuterStartIndex >= lastTextNode.OuterStartIndex + lastTextNode.OuterLength)
        { 
            brNode.Remove();
        }
    }
    return document.DocumentNode.OuterHtml;
}

Test HTML Input:

<p><br><span>MERV 9 Cartridge<b><br> </b>Prefilters </span><br></p>
<p><span>MERV 9 Cartridge<b><br> </b>Prefilters </span></p>
<p><span>MERV 9 <br>Cartridge<b><br> </b>Prefilters </span></p>
<p><span>MERV 9 Cartridge<b><br> </b>Prefilters<br> </span></p>
<p><span>MERV 9 Cartridge<b><br> </b>Prefilters<br></span></p>

Test HTML Output:

<p><span>MERV 9 Cartridge<b><br> </b>Prefilters </span></p>
<p><span>MERV 9 Cartridge<b><br> </b>Prefilters </span></p>
<p><span>MERV 9 <br>Cartridge<b><br> </b>Prefilters </span></p>
<p><span>MERV 9 Cartridge<b><br> </b>Prefilters </span></p
<p><span>MERV 9 Cartridge<b><br> </b>Prefilters</span></p>

How to replace to space?

Can you try searching for

(?<=<[^>]*)

and replacing it with a single space?

This looks for inside tags (preceded by a < and possibly other characters except >).

This is extremely brittle, though. For example, it will fail if you have </> symbols in strings/attributes. Better avoid getting those into the wrong locations in the first place.