HTML agility pack - removing unwanted tags without removing content?
I wrote an algorithm based on Oded's suggestions. Here it is. Works like a charm.
It removes all tags except strong
, em
, u
and raw text nodes.
internal static string RemoveUnwantedTags(string data)
{
if(string.IsNullOrEmpty(data)) return string.Empty;
var document = new HtmlDocument();
document.LoadHtml(data);
var acceptableTags = new String[] { "strong", "em", "u"};
var nodes = new Queue<HtmlNode>(document.DocumentNode.SelectNodes("./*|./text()"));
while(nodes.Count > 0)
{
var node = nodes.Dequeue();
var parentNode = node.ParentNode;
if(!acceptableTags.Contains(node.Name) && node.Name != "#text")
{
var childNodes = node.SelectNodes("./*|./text()");
if (childNodes != null)
{
foreach (var child in childNodes)
{
nodes.Enqueue(child);
parentNode.InsertBefore(child, node);
}
}
parentNode.RemoveChild(node);
}
}
return document.DocumentNode.InnerHtml;
}
HTMLagilitypack is not removing all html tags How can I solve this efficiently?
Try HttpUtility.HtmlDecode
public static string StripHtmlTags(string html)
{
if (String.IsNullOrEmpty(html)) return "";
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
return HttpUtility.HtmlDecode(doc.DocumentNode.InnerText);
}
HtmlDecode will convert […]
to […]
How to remove a tag link a href without removing the link text in Html Agility Pack?
I made this function, getting a html string as input.
public string CleanLinks(string input) {
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(input);
var links = doc.DocumentNode.SelectNodes("//a");
if (links == null) return input;
foreach (HtmlNode tb in links)
{
HtmlNode lbl = doc.CreateElement("span");
lbl.InnerHtml = tb.InnerHtml;
tb.ParentNode.ReplaceChild(lbl, tb);
}
return doc.DocumentNode.OuterHtml;
}
HTMLAgilityPack - Remove Node but retain its value
Use
foreach (var item in anchorsSpan.ToArray())
{
item.ParentNode.RemoveChild(item, true);
}
The Descendants
function returns a dynamic list of child elements that is built while traversing over it. So, it is not allowed to change the document while traversing over the dynamic list. The solution is to make a static copy of the list beforehand (using ToArray
) and traverse over that array.
How to make HtmlAgilityPack stop automatically removing slash of Singleton tags in html file?
Set the OptionWriteEmptyNodes property to true on your HtmlDocument.
string htmltext =File.ReadAllText("test.html");
HtmlDocument doc = new HtmlDocument();
doc.OptionWriteEmptyNodes = true;
doc.LoadHtml(htmltext);
Refer this
https://html-agility-pack.net/knowledge-base/11047739/optionwriteemptynodes-break-xml-declaration-using-htmlagilitypack
Html Agility Pack - Remove Tags by ID Or Class
The following code is a adapted from this Html Agility Pack forum page to fit your needs. Essentially, we will grab all divs and then loop through them and check their class or their id for a match. If it's there remove it.
var divs = htmldoc.DocumentNode.SelectNodes("//div");
if (divs != null)
{
foreach (var tag in divs)
{
if (tag.Attributes["class"] != null && string.Compare(tag.Attributes["class"].Value, "divToRemove", StringComparison.InvariantCultureIgnoreCase) == 0)
{
tag.Remove();
} else if(tag.Attributes["id"] != null && string.Compare(tag.Attributes["id"].Value, "divToRemove", StringComparison.InvariantCultureIgnoreCase) == 0) {
tag.Remove();
}
}
}
You can also combine these if statements into one large if statement, but I thought this read better for the answer.
Finally, select the node you were looking for...
var mainDiv = htmldoc.DocumentNode.SelectSingleNode("//div[@id='mainDiv']");
How to strip comments from HTML using Agility Pack without losing DOCTYPE
Check that comment does not start with DOCTYPE
foreach (var comment in nodes)
{
if (!comment.InnerText.StartsWith("DOCTYPE"))
comment.ParentNode.RemoveChild(comment);
}
htmlagilitypack - remove script and style?
HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(html);
doc.DocumentNode.Descendants()
.Where(n => n.Name == "script" || n.Name == "style")
.ToList()
.ForEach(n => n.Remove());
How do I remove all HTML tags from a string without knowing which tags are in it?
You can use a simple regex like this:
public static string StripHTML(string input)
{
return Regex.Replace(input, "<.*?>", String.Empty);
}
Be aware that this solution has its own flaw. See Remove HTML tags in String for more information (especially the comments of 'Mark E. Haase'/@mehaase)
Another solution would be to use the HTML Agility Pack.
You can find an example using the library here: HTML agility pack - removing unwanted tags without removing content?
Related Topics
Lambda Expressions in Immediate Window for VS2015
How to Check If a String Is a Number
How to Split a Number into Individual Digits in C#
How to Change Listview Selected Row Backcolor Even When Focus on Another Control
System.Valuetype Understanding
Gracefully Handling Corrupted State Exceptions
Add Vertical Scroll Bar to Panel
Encrypting/Decrypting Large Files (.Net)
Frombluetoothaddressasync Iasyncoperation Does Not Contain a Definition for 'Getawaiter' Error
How to Get the Colour of a Pixel at X,Y Using C#
Get Table-Data from Table-Name in Linq Datacontext
How to Put Conditional Required Attribute into Class Property to Work with Web API
How to Convert String "07:35" (Hh:Mm) to Timespan
Count Number of Mondays in a Given Date Range
Why Does Binarywriter Prepend Gibberish to the Start of a Stream? How to Avoid It