Remove HTML tags from string including   in C#
If you can't use an HTML parser oriented solution to filter out the tags, here's a simple regex for it.
string noHTML = Regex.Replace(inputHTML, @"<[^>]+>| ", "").Trim();
You should ideally make another pass through a regex filter that takes care of multiple spaces as
string noHTMLNormalised = Regex.Replace(noHTML, @"\s{2,}", " ");
How do I remove all HTML tags from a string without knowing which tags are in it?
You can use a simple regex like this:
public static string StripHTML(string input)
{
return Regex.Replace(input, "<.*?>", String.Empty);
}
Be aware that this solution has its own flaw. See Remove HTML tags in String for more information (especially the comments of 'Mark E. Haase'/@mehaase)
Another solution would be to use the HTML Agility Pack.
You can find an example using the library here: HTML agility pack - removing unwanted tags without removing content?
How to remove html tags from string in view page C#
If you want to show your content without any formatting then you can use this Regex.Replace(input, "<.*?>", String.Empty)
to strip all of Html tags from your string.
1) Add below code to top of view (.cshtml
).
@using System.Text.RegularExpressions;
@helper StripHTML(string input)
{
if (!string.IsNullOrEmpty(input))
{
input = Regex.Replace(input, "<.*?>", String.Empty);
<span>@input</span>
}
}
2) Use the above helper function like
<td>@StripHTML(item.Message)</td>
Correctly removing html entities from a string
Well
is not a 'regular' space. When you are using System.Net.WebUtility.HtmlDecode
it will return the textual representation of the named html entity which is ' '. It looks like regular whitespace but it has different meaning. The decimal representation of nbsp
is actually 160
which in hex is A0
, so your unit test and decoding are working correctly.
If you want to replace nbsp
with regular whitespace you have several options, the easiest of which will be execute simple replace before the decoding:
// where the second argument is whitespace char with decimal representation 32
text = text.Replace(" ", " ");
About the initial running:
The hex value 2C
is 44
in decimal which is the symbol ','(comma). Is it possible that you just have looked at the wrong character ?
About sql collation: the latin general is capable of storing nbsp symbols so.. i think this is not a problem.
How can i remove HTML Tags from String by REGEX?
try this
// erase html tags from a string
public static string StripHtml(string target)
{
//Regular expression for html tags
Regex StripHTMLExpression = new Regex("<\\S[^><]*>", RegexOptions.IgnoreCase | RegexOptions.Singleline | RegexOptions.Multiline | RegexOptions.CultureInvariant | RegexOptions.Compiled);
return StripHTMLExpression.Replace(target, string.Empty);
}
call
string htmlString="<div><span>hello world!</span></div>";
string strippedString=StripHtml(htmlString);
splice html tags in html string
I'm not sure if your HTML is always within a <p>
element or if the number of <br />
elements are different from case to case. If it's not different and you can depend on the outer element being the same, you can use this to get the first and last <br/>
elements.
Option #1 - When parent element (p
in this case) is known and number of br
elemnts are known (3 in this case).
string html = "<p><br><span>MERV 9 Cartridge<b><br> </b>Prefilters </span><br></p>";
string outHtml = string.Empty;
var document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(html);
var rootNode = document.DocumentNode;
var firstBrNode = rootNode.SelectSingleNode("//p/br[1]");
var lastBrNode = rootNode.SelectSingleNode("//p/br[last()]");
firstBrNode?.Remove();
lastBrNode?.Remove();
outHtml = document.DocumentNode.OuterHtml;
output:
<p><span>MERV 9 Cartridge<b><br> </b>Prefilters </span></p>
Option #2 - When parent element is not known and the number of br
tags is not known, and it's assumed if one br
element is present it will be retained in the HTML.
string html = "<p><br><span>MERV 9 Cartridge<b><br> </b>Prefilters </span><br></p>";
// string html = "<p><span>MERV 9 Cartridge<b><br> </b>Prefilters </span></p>";
string outHtml = string.Empty;
var document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(html);
var rootNode = document.DocumentNode;
// count all br nodes so we can bypass removal of br if there is only one in HTML
var brNodeCount = rootNode.SelectNodes("//br") == null ? 0 : rootNode.SelectNodes("//br").Count;
// get the parent node of the br element to be used in the xpath when we remove
// the br elements this will allow for different parent elements other than the `p` element
var parentNode = rootNode.SelectSingleNode("//br/parent::*");
// only removes br elements if more than one in HTML, assumes if 1 br element is present it's in the middle and will not be removed
if (brNodeCount > 1)
{
var firstBrNode = rootNode.SelectSingleNode($"//{parentNode.Name}/br[1]");
var lastBrNode = rootNode.SelectSingleNode($"//{parentNode.Name}/br[last()]");
firstBrNode?.Remove();
lastBrNode?.Remove();
}
outHtml = document.DocumentNode.OuterHtml;
output:
<p><span>MERV 9 Cartridge<b><br> </b>Prefilters </span></p>
Option #3 - Takes into account the index of the first and last text nodes and removes all br
elements that sit 'outside' them. Text nodes that contain an empty or an all white-space value are ignored.
// removes all br tags with an index before the first text node and
// all br tags with an index after the end of the last text node,
// any br tags between are not removed
private string RemoveStartAndEndBrTags(string html)
{
if (string.IsNullOrEmpty(html)) return html;
var document = new HtmlAgilityPack.HtmlDocument();
document.LoadHtml(html);
var rootNode = document.DocumentNode;
// get first and last text nodes, excluding any only containing white-space
var allNonEmptyTextNodes = rootNode.SelectNodes("//text()[not(self::text()[not(normalize-space())])]");
if (allNonEmptyTextNodes == null || allNonEmptyTextNodes.Count == 0) return html;
var firstTextNode = allNonEmptyTextNodes[0];
var lastTextNode = allNonEmptyTextNodes[allNonEmptyTextNodes.Count - 1];
// get the parent node of the first br element, it will be used when we remove the br elements,
// this will allow for different parent elements other than the `p` element
var parentNode = rootNode.SelectSingleNode("//br/parent::*");
if (parentNode == null) return html;
var allBrNodes = rootNode.SelectNodes($"//{parentNode.Name}/br");
foreach (var brNode in allBrNodes)
{
if (brNode == null) continue;
// check index of br nodes against first and last text nodes
// and remove br nodes that sit outside text nodes
if (brNode.OuterStartIndex <= firstTextNode.OuterStartIndex
|| brNode.OuterStartIndex >= lastTextNode.OuterStartIndex + lastTextNode.OuterLength)
{
brNode.Remove();
}
}
return document.DocumentNode.OuterHtml;
}
Test HTML Input:
<p><br><span>MERV 9 Cartridge<b><br> </b>Prefilters </span><br></p>
<p><span>MERV 9 Cartridge<b><br> </b>Prefilters </span></p>
<p><span>MERV 9 <br>Cartridge<b><br> </b>Prefilters </span></p>
<p><span>MERV 9 Cartridge<b><br> </b>Prefilters<br> </span></p>
<p><span>MERV 9 Cartridge<b><br> </b>Prefilters<br></span></p>
Test HTML Output:
<p><span>MERV 9 Cartridge<b><br> </b>Prefilters </span></p>
<p><span>MERV 9 Cartridge<b><br> </b>Prefilters </span></p>
<p><span>MERV 9 <br>Cartridge<b><br> </b>Prefilters </span></p>
<p><span>MERV 9 Cartridge<b><br> </b>Prefilters </span></p
<p><span>MERV 9 Cartridge<b><br> </b>Prefilters</span></p>
How to replace to space?
Can you try searching for
(?<=<[^>]*)
and replacing it with a single space?
This looks for
inside tags (preceded by a <
and possibly other characters except >
).
This is extremely brittle, though. For example, it will fail if you have <
/>
symbols in strings/attributes. Better avoid getting those
into the wrong locations in the first place.
Related Topics
Why Firefox Requires Geckodriver
How to Write Programs in C# .Net, to Run Them on Linux/Wine/Mono
How to Read an Entire File to a String Using C#
Unity Singleton Manager Classes
How to Run a Simple Bit of Code in a New Thread
How to Hide Only the Close (X) Button
How to Programmatically Fill in a Form and 'Post' a Web Page
How to Draw Line and Select It in Panel
Difference Between Convert.Tostring() and .Tostring()
How to Hide a Process in Task Manager in C#
Windows Application Startup Error Exception Code: 0Xe0434352
Itextsharp Insert Text to an Existing PDF
How to Connect to Database from Unity
ASP.NET MVC 5 - Identity. How to Get Current Applicationuser
Hosting External App in Wpf Window
Is Task.Run Considered Bad Practice in an Asp .Net MVC Web Application
How to Use the Paint Event to Draw Shapes at Mouse Coordinates