HTML Agility Pack - Removing Unwanted Tags Without Removing Content

HTML agility pack - removing unwanted tags without removing content?

I wrote an algorithm based on Oded's suggestions. Here it is. Works like a charm.

It removes all tags except strong, em, u and raw text nodes.

internal static string RemoveUnwantedTags(string data)
{
    if(string.IsNullOrEmpty(data)) return string.Empty;

    var document = new HtmlDocument();
    document.LoadHtml(data);

    var acceptableTags = new String[] { "strong", "em", "u"};

    var nodes = new Queue<HtmlNode>(document.DocumentNode.SelectNodes("./*|./text()"));
    while(nodes.Count > 0)
    {
        var node = nodes.Dequeue();
        var parentNode = node.ParentNode;

        if(!acceptableTags.Contains(node.Name) && node.Name != "#text")
        {
            var childNodes = node.SelectNodes("./*|./text()");

            if (childNodes != null)
            {
                foreach (var child in childNodes)
                {
                    nodes.Enqueue(child);
                    parentNode.InsertBefore(child, node);
                }
            }

            parentNode.RemoveChild(node);

        }
    }

    return document.DocumentNode.InnerHtml;
}

HTMLagilitypack is not removing all html tags How can I solve this efficiently?

Try HttpUtility.HtmlDecode

public static string StripHtmlTags(string html)
{
    if (String.IsNullOrEmpty(html)) return "";
    HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
    doc.LoadHtml(html);
    return HttpUtility.HtmlDecode(doc.DocumentNode.InnerText);
}

HtmlDecode will convert […] to […]

How to remove a tag link a href without removing the link text in Html Agility Pack?

I made this function, getting a html string as input.

public string CleanLinks(string input) {
            HtmlDocument doc = new HtmlDocument();
            doc.LoadHtml(input);
            var links = doc.DocumentNode.SelectNodes("//a");
            if (links == null) return input;
            foreach (HtmlNode tb in links)
            {
                HtmlNode lbl = doc.CreateElement("span");
                lbl.InnerHtml = tb.InnerHtml;

                tb.ParentNode.ReplaceChild(lbl, tb);
            }

            return doc.DocumentNode.OuterHtml;
        }

HTMLAgilityPack - Remove Node but retain its value

Use

            foreach (var item in anchorsSpan.ToArray())
            {
                item.ParentNode.RemoveChild(item, true);
            }

The Descendants function returns a dynamic list of child elements that is built while traversing over it. So, it is not allowed to change the document while traversing over the dynamic list. The solution is to make a static copy of the list beforehand (using ToArray) and traverse over that array.

How to make HtmlAgilityPack stop automatically removing slash of Singleton tags in html file?

Set the OptionWriteEmptyNodes property to true on your HtmlDocument.

string htmltext =File.ReadAllText("test.html");

HtmlDocument doc = new HtmlDocument();
doc.OptionWriteEmptyNodes = true;

doc.LoadHtml(htmltext);

Refer this
https://html-agility-pack.net/knowledge-base/11047739/optionwriteemptynodes-break-xml-declaration-using-htmlagilitypack

Html Agility Pack - Remove Tags by ID Or Class

The following code is a adapted from this Html Agility Pack forum page to fit your needs. Essentially, we will grab all divs and then loop through them and check their class or their id for a match. If it's there remove it.

var divs = htmldoc.DocumentNode.SelectNodes("//div");
if (divs != null)
{
    foreach (var tag in divs)
    {
        if (tag.Attributes["class"] != null && string.Compare(tag.Attributes["class"].Value, "divToRemove", StringComparison.InvariantCultureIgnoreCase) == 0)
        {
            tag.Remove();
        } else if(tag.Attributes["id"] != null && string.Compare(tag.Attributes["id"].Value, "divToRemove", StringComparison.InvariantCultureIgnoreCase) == 0) {
            tag.Remove();
        }
    }
}

You can also combine these if statements into one large if statement, but I thought this read better for the answer.

Finally, select the node you were looking for...

var mainDiv = htmldoc.DocumentNode.SelectSingleNode("//div[@id='mainDiv']");

How to strip comments from HTML using Agility Pack without losing DOCTYPE

Check that comment does not start with DOCTYPE

  foreach (var comment in nodes)
  {
     if (!comment.InnerText.StartsWith("DOCTYPE"))
         comment.ParentNode.RemoveChild(comment);
  }

htmlagilitypack - remove script and style?

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);

doc.DocumentNode.Descendants()
                .Where(n => n.Name == "script" || n.Name == "style")
                .ToList()
                .ForEach(n => n.Remove());

How do I remove all HTML tags from a string without knowing which tags are in it?

You can use a simple regex like this:

public static string StripHTML(string input)
{
   return Regex.Replace(input, "<.*?>", String.Empty);
}

Be aware that this solution has its own flaw. See Remove HTML tags in String for more information (especially the comments of 'Mark E. Haase'/@mehaase)

Another solution would be to use the HTML Agility Pack.

You can find an example using the library here: HTML agility pack - removing unwanted tags without removing content?

HTML Agility Pack - Removing Unwanted Tags Without Removing Content