Converting HTML to Xml

Converting HTML to XML

I was successful using tidy command line utility. On linux I installed it quickly with apt-get install tidy. Then the command:

tidy -q -asxml --numeric-entities yes source.html >file.xml

gave an xml file, which I was able to process with xslt processor. However I needed to set up xhtml1 dtds correctly.

This is their homepage: html-tidy.org (and the legacy one: HTML Tidy)

Parse HTML into Clean XML

Parsing the fake Excel / HTML input may have some issues:

  1. HTML is not well-formed.
  2. HTML Entities like   will break the XML Parser.

Assuming your HTML example above takes care of the first issue, you can brute force the second issue by decoding the input like this:

[xml]$html = [System.Net.WebUtility]::HtmlDecode(@'
<table class="c41">
<tr class="c5">
<td valign="top" class="c6"><p class="c7"><span class="c8">Cash Activity </span>
</p>
</td>
<td valign="top" class="c9"><p class="c10"><br/><span class="c2">FRIDAY   </span><br/><span class="c2"> </span></p>
</td>
</tr>
<tr class="c5">
<td valign="top" class="c6"><p class="c11"><br/></p>
</td>
<td valign="top" class="c9"><p class="c10"><br/><span class="c2">05-JAN-18</span><br/><span class="c2"> </span></p>
</td>
</tr>
<tr class="c12">
<td valign="top" class="c13"><p class="c7"><span class="c14">Prior Day Available Balance</span></p>
</td>
<td valign="top" class="c15"><p class="c10"><span class="c16">6,472,679.45
</span></p>
</td>
</tr>
</table>
'@);

Now it's just a matter of some simple XPath to select the nodes you want to get the desired XML you specified above (tested and working):

$xml = @'
<?xml version="1.0" encoding="utf-8" ?>
<Cash Activities>

'@;
$rows = $html.DocumentElement.SelectNodes('//tr');
foreach ($row in $rows) {
if ($row.GetAttribute('class') -eq 'c12') {
$xml += "`t<Cash Activity>`n";
$spans = $row.SelectNodes('.//descendant::span[@class]');
if ($spans.Count -eq 2) {
$xml += "`t`t<Activity>$($spans[0].InnerText.Trim())</Activity>`n";
$xml += "`t`t<Balance>$($spans[1].InnerText.Trim())</Balance>`n";
}
$xml += "`t</Cash Activity>`n";
}
}

$xml += @'
</Cash Activities>
'@;

Convert html to xml using java

Try jTidy

JTidy can be used as a tool for cleaning up malformed and faulty HTML

How can I convert HTML to XML (which conforms with XML schema or DTD)

Tidy can convert HTML to XHTML (the same structure of elements and attributes but meeting the rules for XML well-formedness), but it can't convert it to meet the requirements of some arbitrary DTD.

You'll need to write an explicit mapping between the two data formats for that. XSLT is a popular language for doing that.

Convert html list to xml through jQuery

Here is best and easy concept to convert html to xml.. no need to use val(). instead of val() you need to use .text() || .html() working example as below

 $('#go').click(function() {    var xml = '<List>';     $("ul#list li").each(function(){      var name = $(this).children('.name-block').text();      var value = $(this).children(".value-block").text();      if(name && value){        xml += "<Item>\n";        xml += "<Name>" + name + "</Name>\n";        xml += "<Value>" + value + "</Value>\n";        xml += "</Item>\n";      }    });    xml += "</list>"    $('.modal-body').append(xml);    $("#myModal").modal('show');    console.log(xml)    })
    <!DOCTYPE html>    <html>    <head>        <title></title>        <script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js"></script><link href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.2/css/bootstrap.min.css" rel="stylesheet" /><script src="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.2/js/bootstrap.min.js"></script>
</head> <body> <div class="form-group"> <div class="col-sm-offset-6 col-sm-3"> <button type="button" id="go" class="btn btn-primary">Open XML Modal Box</button> </div> </div><!--Modal if input is empty--><div class="modal fade" id="myModal"> <div class="modal-dialog"> <div class="modal-content"> <div class="modal-header"> <button type="button" class="close" data-dismiss="modal" aria-label="Close"><span aria-hidden="true">×</span>
</button> <h4 class="modal-title">you xml value printed below</h4>
</div> <div class="modal-body"> </div> <div class="modal-footer"> <button type="button" class="btn btn-default" data-dismiss="modal">Close</button> </div> </div> <!-- /.modal-content --> </div> <!-- /.modal-dialog --></div>
<ul id="list"> <li> <span class="name name-block">Hello</span><span>=</span><span class="name value-block">World</span> <span class="btn delete">Delete</span> </li> <li> <span class="name name-block">Happy</span><span>=</span><span class="name value-block">Coding</span> <span class="btn delete">Delete</span> </li> </ul>
<!-- /.modal --> <!--End Modal--> </body> </html>

HTML to XML conversion using XSLT 2.0

You can try this:

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">        
<xsl:template match="body">
<book>
<xsl:for-each-group select="p" group-starting-with="p[@class='h1']">
<sectionA>
<title>
<xsl:value-of select="node()"/>
</title>
<xsl:for-each-group select="current-group() except ." group-starting-with="p[@class='h2']">
<xsl:choose>
<xsl:when test="self::p[@class='h2']">
<sectionB>
<title>
<xsl:value-of select="node()"/>
</title>
<xsl:for-each-group select="current-group() except ." group-starting-with="p[@class='h3']">
<xsl:choose>
<xsl:when test="self::p[@class='h3']">
<sectionC>
<title>
<xsl:value-of select="node()"/>
</title>
<xsl:apply-templates select="current-group() except ."></xsl:apply-templates>
</sectionC>
</xsl:when>
<xsl:otherwise>
<xsl:apply-templates select="current-group()"></xsl:apply-templates>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each-group>
</sectionB>
</xsl:when>
<xsl:otherwise>
<xsl:apply-templates select="current-group()"></xsl:apply-templates>
</xsl:otherwise>
</xsl:choose>
</xsl:for-each-group>

</sectionA>
</xsl:for-each-group>
</book>
</xsl:template>

<xsl:template match="p">
<xsl:copy>
<xsl:copy-of select="@*"/>
<xsl:apply-templates select="node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet> <!-- added by edit -->

How to convert HTML to XML, Import parsed XML to google sheets

Issue:

"Looking for guidance on setting the DOCTYPE, I do not have access to change the body of the emails."

[Document: No DOCTYPE declaration, Root is [Element: ]]

If you want to declare a DocType for the newly created Xml document you can do so by using the appropriate method(s) ( setDocType(), createDocType() )

Once you declare the DocType though, you'll still have to work on your parsing because all you will append with the current code is the same string, but now with a DocType declared ;)

Here's a working example:

function createXml() {
// Get sheet
var ss = SpreadsheetApp.getActiveSheet();
// Create Xml document with root element "threads"
var doc = XmlService.createDocument(XmlService.createElement('threads'));
// Declare DocType "whatever-you-like" for the Xml document
doc.setDocType(XmlService.createDocType('threads'));
// Get the root element
var root = doc.getRootElement();
// Get some threads from Gmail
var threads = GmailApp.getStarredThreads();
// For each thread
for (var i = 0; i < threads.length; i++) {
// Get messages
var messages = threads[i].getMessages();
// And for each message
for (var j=0; j<messages.length; j++){
// Get the plain html body
var msg = messages[j].getPlainBody();
// Create a child element "thread"
var child = XmlService.createElement('thread')
// Set "messageCount" attr
.setAttribute('messageCount', threads[i].getMessageCount())
// Set "isUnread" attr
.setAttribute('isUnread', threads[i].isUnread())
// Set text attr
.setText(threads[i].getFirstMessageSubject()+msg);
// Add the child element to root
root.addContent(child);
}
// Get prettyfied xml document
var xml = XmlService.getPrettyFormat().format(doc);
// Log the prettyfied xml doc
Logger.log(xml);
// Create list of parsed children to append to sheet row
var parsed = [];
// Add the parsed text from children elements of root
XmlService.parse(xml).getRootElement().getChildren().forEach((child) => {parsed.push(child.getText())});
// Append the row with parsed data
ss.appendRow(parsed)
}
}


Related Topics



Leave a reply



Submit