HTMLagilitypack Drops Option End Tags

HtmlAgilityPack Drops Option End Tags

The exact same error is reported on the HAP home page's discussion, but it looks like no meaningful fixes have been made to the project in a few years. Not encouraging.

A quick browse of the source suggests the error might be fixable by commenting out line 92 of HtmlNode.cs:

// they sometimes contain, and sometimes they don 't...
ElementsFlags.Add("option", HtmlElementFlag.Empty);

(Actually no, they always contain label text, although a blank string would also be valid text. A careless author might omit the end-tag, but then that's true of any element.)

ADD

An equivalent solution is calling HtmlNode.ElementsFlags.Remove("option"); before any use of liberary (without need to modify the liberary source code)

Why does HTMLAgilityPack remove my closing tag?

As others said in the comments, it's an invalid HTML so that might be the reason why the HtmlDocument class itself is removing </p> in the end when you store it into a file using the Save method, but as a workaround, you can store it using System.IO.File class and store the document.Text at the output location.

var html = "<p><ol><li>A bunch of text</li></ol><em>some em text</em> more text here.</p>";
var document = new HtmlDocument();
document.LoadHtml(html);
File.WriteAllText("insert_your_path_here", document.Text);

HtmlAgilityPack produces missing closing tags in OuterHtml

There are several options that you can set when you are loading the document.

OptionAutoCloseOnEnd

Defines if closing for non closed nodes must be done at the end or directly in the document. Setting this to true can actually change how browsers render the page.

document = new HtmlDocument();
document.OptionAutoCloseOnEnd = true;
document.LoadHtml(content);

Related sources worth reading:

HtmlAgilityPack Drops Option End Tags

Image tag not closing with HTMLAgilityPack

Using HtmlAgilityPack to close tags at the end of a parent tag

Get the parent tag with HTMLAgilityPack, get its text, get the length of the closing parent tag, add the closing child tag before the closing parent, replace tag

HtmlAgilityPack: Could someone please explain exactly what is the effect of setting the HtmlDocument OptionAutoCloseOnEnd to true?

The current code always closes the unclosed nodes just before the parent node is closed. So the following code

var doc = new HtmlDocument();
doc.LoadHtml("<x>hello<y>world</x>");
doc.Save(Console.Out);

will output this (the unclosed <y> is closed before the parent <x> is closed)

<x>hello<y>world</y></x>

Originally, the option, when set, was meant to be able to produce this instead (not for XML output types):

<x>hello<y>world</x></y>

with the closing <y> set at the end of the document (that's what the "end" means). Note in this case, you can still get overlapping elements.

This feature (maybe useless I can admit that) was broken somewhere in the past, I don't know why.

Note <p> tag case is special as it's by default being governed by custom HtmlElementFlag. This is how it's declared in HtmlNode.cs:

ElementsFlags.Add("p", HtmlElementFlag.Empty | HtmlElementFlag.Closed);

HtmlAgilityPack LoadHtml - Issue with empty P tags

The OptionWriteEmptyNodes flag is what you're looking for:

Defines if empty nodes must be written as closed during output.

And in your case:

doc.OptionWriteEmptyNodes = true;

Yields:

<div>something<p /></div>

HTML Agility Pack - Issue selecting an HTML select tag with the option tags within

You need to set the ElementsFlag field for the option tag to make it work

HtmlNode.ElementsFlags["option"] = HtmlElementFlag.Closed;

HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);

which should return your original HTML code.

I believe the reason that HtmlAgilityPack behaves this way is because the <option>-tag is ironically an optional tag in HTML that doesn't require a closing tag.

Taken from the documentation of the HtmlNode class and it's field ElementsFlags:

Gets a collection of flags that define specific behaviors for specific
element nodes. The table contains a DictionaryEntry list with the
lowercase tag name as the Key, and a combination of HtmlElementFlags
as the Value.

Further look into the HtmlElementFlag enums reveal this:

Empty - The node is empty. META or IMG are example of such nodes.
Closed - The node will automatically be closed during parsing.

You can view the source code for the class HtmlNode to see what other tags are considered 'specific'.

How can I get several similar tags data with HtmlAgilityPack?

You need to anchor your second XPath to look 'below' the h4:

Dim date1 As HtmlNode = h4.Parent.SelectSingleNode(".//span[starts-with(@class, 'date ')]")
^^^^^^^^^ ^^^

The .// tells Xpath to look under the node the Xpath is executed on. Thus by calling SelectSingleNode on the h4.Parent you get the date below the parent div tag of the h4.



Related Topics



Leave a reply



Submit