How to Clean HTML Tags Using C#

How do I remove all HTML tags from a string without knowing which tags are in it?

You can use a simple regex like this:

public static string StripHTML(string input)
{
return Regex.Replace(input, "<.*?>", String.Empty);
}

Be aware that this solution has its own flaw. See Remove HTML tags in String for more information (especially the comments of 'Mark E. Haase'/@mehaase)

Another solution would be to use the HTML Agility Pack.

You can find an example using the library here: HTML agility pack - removing unwanted tags without removing content?

How to clean HTML tags using C#

HTML Agility Pack:

    HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string s = doc.DocumentNode.SelectSingleNode("//body").InnerText;

How to remove html tags from string in view page C#

If you want to show your content without any formatting then you can use this Regex.Replace(input, "<.*?>", String.Empty) to strip all of Html tags from your string.

1) Add below code to top of view (.cshtml).

@using System.Text.RegularExpressions;

@helper StripHTML(string input)
{
if (!string.IsNullOrEmpty(input))
{
input = Regex.Replace(input, "<.*?>", String.Empty);
<span>@input</span>
}
}

2) Use the above helper function like

<td>@StripHTML(item.Message)</td>

How to remove start and end html tag using C#?

Use Regex

var item = "<p>Some text</p><p>More text</p>";
item = Regex.Replace(item,@"^<[^>^<.]*>","");
item = Regex.Replace(item,@"<[^>^<.]*>$","");
Console.WriteLine(item) //Will log Some text</p><p>More text

Regex Breakdown:

^: matches start of string

<: opening tag

>: closing tag

[^>^<.]*: exclude closing and opening tags inside tag and match any character except the excluded ones as often as possible

Do the same again just this time we match the end of the string with $at the end of the expression

Remove HTML tags from string including   in C#

If you can't use an HTML parser oriented solution to filter out the tags, here's a simple regex for it.

string noHTML = Regex.Replace(inputHTML, @"<[^>]+>| ", "").Trim();

You should ideally make another pass through a regex filter that takes care of multiple spaces as

string noHTMLNormalised = Regex.Replace(noHTML, @"\s{2,}", " ");

Remove Certain HTML tags in C#

As far as I can see, you want to remove the HTML elements that contain a style attribute, also remove their closing pairs. Unfortunately, there is no good way to do that with regexes. Without the 'also remove their closing pairs' clause, we could write an approximately good regex.

On the other hand, XSLT is the right tool for this, because it can handle the recursive nature of XML:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="//*[not(@style)]">

<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>

</xsl:stylesheet>

What's happening here? The <xsl:template match="//*[not(@style)]"> part matches everything that does not have a style attribute. Then the <xsl:copy>...</xsl:copy> part copies them entirely. I.e. the items that have a style attribute, they will not be copied.

For the record, this is a slight variant of the XSLT identity transformation:

<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>

Try to removing empty html tags in C#

You can just check for Value. Value will also be empty when there are child nodes (that are empty). Also, you are checking for attributes and not removing nodes that have attributes, but from your example you want to remove empty tags with attributes.

string src = @"
<html><body>
<div class=""sfd"">test</div>
<p dir = ""rtl"" style=""margin-bottom: 0;margin-left: 0;margin-right: 0;margin-top: 0;""><span style = ""font-size: 11pt;font-style: normal;font-weight: normal;margin: 0;padding: 0;"" > </span ></p >
<p dir=""rtl"" style=""font-family: David;font-size: 11pt;line-height: 115.0%;margin-top: 0;""><span style = ""font-size: 11pt;font-style: normal;font-weight: normal;margin: 0;padding: 0;"" > </span ></p >
<div class=""sfd"">test</div>
<p dir = ""rtl"" style=""font-family: David;font-size: 11pt;line-height: 115.0%;margin-bottom: 0;margin-left: 0;margin-right: 0;margin-top: 0;""><span style = ""font-size: 11pt;font-style: normal;font-weight: normal;margin: 0;padding: 0;"" > </span ></p >
<p dir=""rtl"" style=""font-family: David;font-size: 11pt;line-height: 115.0%;margin-bottom: 0;margin-left: 0;margin-right: 0;margin-top: 0;""><span style = ""font-size: 11pt;font-style: normal;font-weight: normal;margin: 0;padding: 0;"" > </span ></p >
<div class=""sfd"">test</div>
<p dir = ""rtl"" style=""font-family: David;font-size: 11pt;line-height: 115.0%;margin-right: 0;margin-top: 0;""><span style = ""font-size: 11pt;font-style: normal;font-weight: normal;margin: 0;padding: 0;"" > </span ></p >
</body></html>
";

XDocument xDoc = XDocument.Parse(src);

xDoc.Descendants().Where(node => string.IsNullOrWhiteSpace(node.Value)).Remove();

MessageBox.Show(xDoc.ToString());

To keep <br/>, just exclude it explicitely. Replace in above code:

xDoc.Descendants().Where(node => string.IsNullOrWhiteSpace(node.Value) && node.Name != "br").Remove();


Related Topics



Leave a reply



Submit