Htmlagilitypack -- Does <Form> Close Itself for Some Reason

HtmlAgilityPack -- Does form close itself for some reason?

This is also reported in this workitem. It contains a suggested workaround from DarthObiwan.

You can change this without recompiling. The ElementFlags list is a
static property on the HtmlNode class. It can be removed with

    HtmlNode.ElementsFlags.Remove("form");

before doing the document load

HTML Agility Pack stripping self-closing tags from input

In the end it pains me to say that I fell back on processing the HTML with regex to add in the mising self-closing tag. I'd love a better solution as this is hacky and not future proof - it has to be added in for every tag that needs correcting:

sXHTML = Regex.Replace(sXHTML, "<input(.*?)>", "<input $1 />");

HtmlAgilityPack produces missing closing tags in OuterHtml

There are several options that you can set when you are loading the document.

OptionAutoCloseOnEnd

Defines if closing for non closed nodes must be done at the end or directly in the document. Setting this to true can actually change how browsers render the page.

document = new HtmlDocument();
document.OptionAutoCloseOnEnd = true;
document.LoadHtml(content);

Related sources worth reading:

HtmlAgilityPack Drops Option End Tags

Image tag not closing with HTMLAgilityPack

Selecting Inner Text Using HtmlAgilityPack

HTMLAgilityPack by default leaves options tags empty (you can see the author's reason for this at HtmlAgilityPack -- Does <form> close itself for some reason?). To fix it, add this line before selecting the nodes:

HtmlNode.ElementsFlags.Remove("option");

Html Agility Pack xPath issue

This is because the FORM tag has a special treatment by the HTML Agility Pack. The reasons are described here: HtmlAgilityPack -- Does <form> close itself for some reason?

So, you basically need to remove that special treatment, like this (must happen before any load):

// instruct the library to treat FORM like any other tag
HtmlNode.ElementsFlags.Remove("form");

HtmlDocument l_missionsDoc = new HtmlDocument();
l_missionsDoc.Load(l_stream);

XPathNavigator l_navigator = l_missionsDoc.CreateNavigator();
XPathNodeIterator l_iterator = l_navigator.Select("//form[@id='formliste']/table");

if (l_iterator.Count <= 0) continue;

Get entire form element as string using Html Agility Pack

Seems you're looking for HtmlNode.OuterHtml:

//
// Summary:
// Gets or Sets the object and its content in HTML.
public virtual string OuterHtml { get; }

So you just have to select your form node and get its OuterHtml property:

HtmlDocument doc = ... // load your HTML
HtmlNode formNode = doc.DocumentNode.SelectSingleNode("//form[@id='aspnetForm']");
string entireElementAsString = formNode.OuterHtml;

UPDATE

It seems there's a very old bug with how HAP treats form tags. Or maybe it's a feature!

In any case, here's a workaround:

HtmlNode.ElementsFlags.Remove("form");

So this should work:

HtmlNode.ElementsFlags.Remove("form");
HtmlDocument doc = ... // load your HTML
HtmlNode formNode = doc.DocumentNode.SelectSingleNode("//form[@id='aspnetForm']");
string entireElementAsString = formNode.OuterHtml;

Add form tag around a body tag using HtmlAgilityPack

The FORM element has a special treatment. See here on SO for more: HtmlAgilityPack -- Does <form> close itself for some reason?

So, you could do this:

var doc = new HtmlDocument();
HtmlNode.ElementsFlags.Remove("form"); // remove special handling for FORM
doc.LoadHtml(input);
var body = doc.DocumentNode.SelectSingleNode("//body");

if (doc.DocumentNode.SelectNodes("//form[@action]") == null)
{
var form = doc.CreateElement("form");
form.Attributes.Add("action", "/pages/event/10302");
body.PrependChild(form);
}

but it will get you this:

<html>
<head>
<title></title>
</head>
<body>
<form action="/pages/event/10302"></form>
<p>Full name: <input name="FullName" type="text" value=""></p>
<p><input name="btnSubmit" type="submit" value="Submit"></p>
</body>
</html>

Which is logical, you don't surround anything in that new form. So, instead you can do this:

var doc = new HtmlDocument();
doc.LoadHtml(input);
var body = doc.DocumentNode.SelectSingleNode("//body");

if (doc.DocumentNode.SelectNodes("//form[@action]") == null)
{
var form = body.CloneNode("form", true);
form.Attributes.Add("action", "/pages/event/10302");
body.ChildNodes.Clear();
body.PrependChild(form);
}

which will get you this:

<html>
<head>
<title></title>
</head>
<body><form action="/pages/event/10302">
<p>Full name: <input name="FullName" type="text" value=""></p>
<p><input name="btnSubmit" type="submit" value="Submit"></p>
</form></body>
</html>

This is not the only way, but it works, and you don't necessarily have to remove the FORM special treatment.

Problem parsing children of a node with HtmlAgilityPack

Well, I've given up on HtmlAgilityPack for now. Seems like there is still more work to do in that library to get everything working. To solve this problem I've moved the code over to use the SGMLReader library from here: http://developer.mindtouch.com/SgmlReader

Using this library all my unit tests pass properly and the sample code works as expected.



Related Topics



Leave a reply



Submit