How to Read HTML as Xml

How to read HTML as XML?

HTML simply isn’t the same as XML (unless the HTML actually happens to be conforming XHTML or HTML5 in XML mode). The best way is to use a HTML parser to read the HTML. Afterwards you may transform it to Linq to XML – or process it directly.

Parse HTML into Clean XML

Parsing the fake Excel / HTML input may have some issues:

  1. HTML is not well-formed.
  2. HTML Entities like   will break the XML Parser.

Assuming your HTML example above takes care of the first issue, you can brute force the second issue by decoding the input like this:

[xml]$html = [System.Net.WebUtility]::HtmlDecode(@'
<table class="c41">
<tr class="c5">
<td valign="top" class="c6"><p class="c7"><span class="c8">Cash Activity </span>
</p>
</td>
<td valign="top" class="c9"><p class="c10"><br/><span class="c2">FRIDAY   </span><br/><span class="c2"> </span></p>
</td>
</tr>
<tr class="c5">
<td valign="top" class="c6"><p class="c11"><br/></p>
</td>
<td valign="top" class="c9"><p class="c10"><br/><span class="c2">05-JAN-18</span><br/><span class="c2"> </span></p>
</td>
</tr>
<tr class="c12">
<td valign="top" class="c13"><p class="c7"><span class="c14">Prior Day Available Balance</span></p>
</td>
<td valign="top" class="c15"><p class="c10"><span class="c16">6,472,679.45
</span></p>
</td>
</tr>
</table>
'@);

Now it's just a matter of some simple XPath to select the nodes you want to get the desired XML you specified above (tested and working):

$xml = @'
<?xml version="1.0" encoding="utf-8" ?>
<Cash Activities>

'@;
$rows = $html.DocumentElement.SelectNodes('//tr');
foreach ($row in $rows) {
if ($row.GetAttribute('class') -eq 'c12') {
$xml += "`t<Cash Activity>`n";
$spans = $row.SelectNodes('.//descendant::span[@class]');
if ($spans.Count -eq 2) {
$xml += "`t`t<Activity>$($spans[0].InnerText.Trim())</Activity>`n";
$xml += "`t`t<Balance>$($spans[1].InnerText.Trim())</Balance>`n";
}
$xml += "`t</Cash Activity>`n";
}
}

$xml += @'
</Cash Activities>
'@;

C# - Parse HTML source as XML

I recommend HTML Agility Pack. You'll have to handle the GUI part yourself. It doesn't require valid HTML, but creates a HtmlDocument similar to XmlDocument.

Parsing an html document using an XML-parser

You can try parsing an HTML file using a XML parser, but it’s likely to fail. The reason is that HTML documents can have the following HTML features that XML parsers don’t understand.

  • elements that never have end tags and that don’t use XML’s so-called “self-closing tag syntax”; e.g., <br>, <meta>, <link>, and <img> (also known as void elements)
  • elements that don’t need end tags; e.g., <p> <dt> <li> (their end tags can be implied)
  • elements that can contain unescaped markup "<" characters; e.g., style, textarea, title, script; <script> if (a < b) … </script>, <title>Using the "<" operator</title>
  • attributes with unquoted values; for example, <meta charset=utf-8>
  • attributes that are empty, with no separate value given at all; e.g., <input disabled>

XML parsers will fail to parse any HTML document that uses any of those features.

HTML parsers, on the other hand, will basically never fail no matter what a document contains.


All that said, there’s also been work done toward developing a new type of XML parsing: so-called XML5 parsing, capable of handling things like empty/unquoted attributes attributes even in XML documents. There is a draft XML5 specification, as well as an XML5 parser, xml5ever.



The intended use is to make an HTML parser, that is part of a web
crawler application

If you’re going to create a web-crawler application, you should absolutely use an HTML parser—and ideally, an HTML parser that conforms to the parsing requirements in the HTML standard.

These days, there are such conformant HTML parsers for many (or even most) languages; e.g.:

  • parse5 (node.js/JavaScript)
  • html5lib (python)
  • html5ever (rust)
  • validator.nu html5 parser (java)
  • gumbo (c, with bindings for ruby, objective c, c++, per, php, c#, perl, lua, D, julia…)


Read XML file to HTML

You can load and manipulate an XML file using jQuery and it's Ajax functions.
I suggest you to read this article, it's very useful.

https://blog.udemy.com/jquery-xml/

It is also possibile to manage with PHP if you are using that preprocessor.

http://www.w3schools.com/php/php_xml_simplexml_read.asp

Example with Jquery

$(document).ready(function(){
$.ajax({
type: "GET",
url: "/myfile.xml",
dataType: "xml",
//what to do if the file is loaded
success: function(xml) {
var number = $(xml).find('contact > telefone_number').text();
//etc.
}
//what to do if there are errors
error: function() {
alert("Cannot load XML file");
}
});

});

Parsing HTML to XML

In general, you can use PHP's built-in DOM handling routines to parse HTML and output XML:

$html = <<< HEREDOC
<!DOCTYPE html>
<body>
<div>
<figure class="table ">
<figcaption>
<p class="table_number"></p>
<p class="table_title" epub:type="title"></p>
</figcaption>
<table class="code ">
<tr>
<td width="50">
<img alt="Sample Image" height="239" src="http://example.com/image.png" width="272">
</td>
</tr>
</table>
</figure>
</div>
</body>
HEREDOC;

$dom = new DOMDocument;
$dom->loadHTML($html);
echo $dom->saveXml($dom), PHP_EOL;

Unfortunately, your use of an XML prolog and attempt to extend the HTML 5 Doctype as if it were an XML/SGML Doctype prevents the DOM library from successfully parsing it.

Parse HTML to XML

I think problem is in HEAD of HTML file.
From MSDN: resonse should return XML ("text/xml"), but your http://www.website.com/file.asp returns HTML content, with ("text/html") mime type.

Converting HTML to XML

I was successful using tidy command line utility. On linux I installed it quickly with apt-get install tidy. Then the command:

tidy -q -asxml --numeric-entities yes source.html >file.xml

gave an xml file, which I was able to process with xslt processor. However I needed to set up xhtml1 dtds correctly.

This is their homepage: html-tidy.org (and the legacy one: HTML Tidy)

How can I read the HTML inside a XML tag?

InnerText gets only the Text (for mixed content or text content). Use InnerXml instead.

Example:

<A>
Some text in mixed content
<B>OnlyText</B>
</A

Gives the result:

  • InnerText = "Some text in mixed content\r\nOnlyText"
  • InnerXml = "Some text in mixed content\r\n<B>OnlyText</B>";


Related Topics



Leave a reply



Submit