HtmlAgilityPack Drops Option End Tags
The exact same error is reported on the HAP home page's discussion, but it looks like no meaningful fixes have been made to the project in a few years. Not encouraging.
A quick browse of the source suggests the error might be fixable by commenting out line 92 of HtmlNode.cs:
// they sometimes contain, and sometimes they don 't...
ElementsFlags.Add("option", HtmlElementFlag.Empty);
(Actually no, they always contain label text, although a blank string would also be valid text. A careless author might omit the end-tag, but then that's true of any element.)
ADD
An equivalent solution is calling HtmlNode.ElementsFlags.Remove("option");
before any use of liberary (without need to modify the liberary source code)
Why does HTMLAgilityPack remove my closing tag?
As others said in the comments, it's an invalid HTML so that might be the reason why the HtmlDocument
class itself is removing </p>
in the end when you store it into a file using the Save
method, but as a workaround, you can store it using System.IO.File
class and store the document.Text
at the output location.
var html = "<p><ol><li>A bunch of text</li></ol><em>some em text</em> more text here.</p>";
var document = new HtmlDocument();
document.LoadHtml(html);
File.WriteAllText("insert_your_path_here", document.Text);
HtmlAgilityPack produces missing closing tags in OuterHtml
There are several options that you can set when you are loading the document.
OptionAutoCloseOnEnd
Defines if closing for non closed nodes must be done at the end or directly in the document. Setting this to true can actually change how browsers render the page.
document = new HtmlDocument();
document.OptionAutoCloseOnEnd = true;
document.LoadHtml(content);
Related sources worth reading:
HtmlAgilityPack Drops Option End Tags
Image tag not closing with HTMLAgilityPack
Using HtmlAgilityPack to close tags at the end of a parent tag
Get the parent tag with HTMLAgilityPack, get its text, get the length of the closing parent tag, add the closing child tag before the closing parent, replace tag
HtmlAgilityPack: Could someone please explain exactly what is the effect of setting the HtmlDocument OptionAutoCloseOnEnd to true?
The current code always closes the unclosed nodes just before the parent node is closed. So the following code
var doc = new HtmlDocument();
doc.LoadHtml("<x>hello<y>world</x>");
doc.Save(Console.Out);
will output this (the unclosed <y>
is closed before the parent <x>
is closed)
<x>hello<y>world</y></x>
Originally, the option, when set, was meant to be able to produce this instead (not for XML output types):
<x>hello<y>world</x></y>
with the closing <y>
set at the end of the document (that's what the "end" means). Note in this case, you can still get overlapping elements.
This feature (maybe useless I can admit that) was broken somewhere in the past, I don't know why.
Note <p>
tag case is special as it's by default being governed by custom HtmlElementFlag. This is how it's declared in HtmlNode.cs:
ElementsFlags.Add("p", HtmlElementFlag.Empty | HtmlElementFlag.Closed);
HtmlAgilityPack LoadHtml - Issue with empty P tags
The OptionWriteEmptyNodes
flag is what you're looking for:
Defines if empty nodes must be written as closed during output.
And in your case:
doc.OptionWriteEmptyNodes = true;
Yields:
<div>something<p /></div>
HTML Agility Pack - Issue selecting an HTML select tag with the option tags within
You need to set the ElementsFlag field for the option tag to make it work
HtmlNode.ElementsFlags["option"] = HtmlElementFlag.Closed;
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
which should return your original HTML code.
I believe the reason that HtmlAgilityPack behaves this way is because the <option>
-tag is ironically an optional tag in HTML that doesn't require a closing tag.
Taken from the documentation of the HtmlNode
class and it's field ElementsFlags
:
Gets a collection of flags that define specific behaviors for specific
element nodes. The table contains a DictionaryEntry list with the
lowercase tag name as the Key, and a combination of HtmlElementFlags
as the Value.
Further look into the HtmlElementFlag
enums reveal this:
Empty - The node is empty. META or IMG are example of such nodes.
Closed - The node will automatically be closed during parsing.
You can view the source code for the class HtmlNode to see what other tags are considered 'specific'.
How can I get several similar tags data with HtmlAgilityPack?
You need to anchor your second XPath to look 'below' the h4
:
Dim date1 As HtmlNode = h4.Parent.SelectSingleNode(".//span[starts-with(@class, 'date ')]")
^^^^^^^^^ ^^^
The .//
tells Xpath to look under the node the Xpath is executed on. Thus by calling SelectSingleNode
on the h4.Parent
you get the date below the parent div
tag of the h4
.
Related Topics
How to Make Image Caption Width to Match Image Width
R: Saving Multiple HTML Widgets Together
Are Multi-Line Options in HTML Select Tags Possible
Force Youtube Embed to Start in 720P
HTML5 Audio Not Working on Firefox
How to Create a Triangle in CSS3 Using Border-Radius
Xpath to Select Between Two HTML Comments Is Not Working
Font-Awesome Icons Not Rendering via the Boostrapcdn
How to Show an Animation That Is Hidden Behind a Colored Div Using a "Reveal" Div on The Surface
Github Pages and Relative Paths
How to Create a Frosted Glass Effect Using CSS
Items That Span All Columns/Rows Using CSS Grid Layout