How to Convert HTML to JSON Using PHP

How to convert HTML to JSON using PHP?

If you are able to obtain a DOMDocument object representing your HTML, then you just need to traverse it recursively and construct the data structure that you want.

Converting your HTML document into a DOMDocument should be as simple as this:

function html_to_obj($html) {
$dom = new DOMDocument();
$dom->loadHTML($html);
return element_to_obj($dom->documentElement);
}

Then, a simple traversal of $dom->documentElement which gives the kind of structure you described could look like this:

function element_to_obj($element) {
$obj = array( "tag" => $element->tagName );
foreach ($element->attributes as $attribute) {
$obj[$attribute->name] = $attribute->value;
}
foreach ($element->childNodes as $subElement) {
if ($subElement->nodeType == XML_TEXT_NODE) {
$obj["html"] = $subElement->wholeText;
}
else {
$obj["children"][] = element_to_obj($subElement);
}
}
return $obj;
}

Test case

$html = <<<EOF
<!DOCTYPE html>
<html lang="en">
<head>
<title> This is a test </title>
</head>
<body>
<h1> Is this working? </h1>
<ul>
<li> Yes </li>
<li> No </li>
</ul>
</body>
</html>

EOF;

header("Content-Type: text/plain");
echo json_encode(html_to_obj($html), JSON_PRETTY_PRINT);

Output

{
"tag": "html",
"lang": "en",
"children": [
{
"tag": "head",
"children": [
{
"tag": "title",
"html": " This is a test "
}
]
},
{
"tag": "body",
"html": " \n ",
"children": [
{
"tag": "h1",
"html": " Is this working? "
},
{
"tag": "ul",
"children": [
{
"tag": "li",
"html": " Yes "
},
{
"tag": "li",
"html": " No "
}
],
"html": "\n "
}
]
}
]
}

Answer to updated question

The solution proposed above does not work with the <script> element, because it is parsed not as a DOMText, but as a DOMCharacterData object. This is because the DOM extension in PHP is based on libxml2, which parses your HTML as HTML 4.0, and in HTML 4.0 the content of <script> is of type CDATA and not #PCDATA.

You have two solutions for this problem.

  1. The simple but not very robust solution would be to add the LIBXML_NOCDATA flag to DOMDocument::loadHTML. (I am not actually 100% sure whether this works for the HTML parser.)

  2. The more difficult but, in my opinion, better solution, is to add an additonal test when you are testing $subElement->nodeType before the recursion. The recursive function would become:

function element_to_obj($element) {
echo $element->tagName, "\n";
$obj = array( "tag" => $element->tagName );
foreach ($element->attributes as $attribute) {
$obj[$attribute->name] = $attribute->value;
}
foreach ($element->childNodes as $subElement) {
if ($subElement->nodeType == XML_TEXT_NODE) {
$obj["html"] = $subElement->wholeText;
}
elseif ($subElement->nodeType == XML_CDATA_SECTION_NODE) {
$obj["html"] = $subElement->data;
}
else {
$obj["children"][] = element_to_obj($subElement);
}
}
return $obj;
}

If you hit on another bug of this type, the first thing you should do is check the type of node $subElement is, because there exists many other possibilities my short example function did not deal with.

Additionally, you will notice that libxml2 has to fix mistakes in your HTML in order to be able to build a DOM for it. This is why an <html> and a <head> elements will appear even if you don't specify them. You can avoid this by using the LIBXML_HTML_NOIMPLIED flag.

Test case with script

$html = <<<EOF
<script type="text/javascript">
alert('hi');
</script>
EOF;

header("Content-Type: text/plain");
echo json_encode(html_to_obj($html), JSON_PRETTY_PRINT);

Output

{
"tag": "html",
"children": [
{
"tag": "head",
"children": [
{
"tag": "script",
"type": "text\/javascript",
"html": "\n alert('hi');\n "
}
]
}
]
}

How to Convert HTML Table to JSON in PHP

I prefer to use XPath with DomDocument because of utility/ease of the syntax. By targeting the only the <tr> elements inside the <tbody> tag, you can access all required data.

With the exception of the href value, the final "all-letters" substring in each <td> class value represents your desired key for the associated value. For this I am using preg_match() to extract the final "word" in the class attribute.

When the $key is name, the href attribute value must be stored with the hardcode key: user_link.

Your sample date values require some preparation to yield the desired format. As your input data varies, you may need to modify the regular expression to allow strtotime() to properly handle the date expression.

Code: (Demo)

$html = <<<HTML
<!DOCTYPE html>
<html>
<head>
<title></title>
</head>
<body>

<table class="table-list table table-responsive table-striped" border="1">
<thead>
<tr>
<th class="coll-1 name">name</th>
<th class="coll-2">height</th>
<th class="coll-3">weight</th>
<th class="coll-date">date</th>
<th class="coll-4"><span class="info">info</span></th>
<th class="coll-5">country</th>
</tr>
</thead>
<tbody>
<tr>
<td class="coll-1 name">
<a href="/username/Jhon Doe/" class="icon"><i class="flaticon-user"></i></a>
<a href="/username/Jhon Doe/">Jhon Doe</a>
</td>
<td class="coll-2 height">45</td>
<td class="coll-3 weight">50</td>
<td class="coll-date">9am May. 16th</td>
<td class="coll-4 size mob-info">abcd</td>
<td class="coll-5 country"><a href="/country/CA/">CA</a></td>
</tr>
<tr>
<td class="coll-1 name">
<a href="/username/Kasim Shk/" class="icon"><i class="flaticon-user"></i></a>
<a href="/username/Kasim Shk/">Kasim Shk</a>
</td>
<td class="coll-2 height">33</td>
<td class="coll-3 weight">54</td>
<td class="coll-date">Mar. 14th '18</td>
<td class="coll-4 size mob-info">ijkl</td>
<td class="coll-5 country"><a href="/country/UAE/">UAE</a></td>
</tr>
</tbody>
</table>

</body>
</html>
HTML;

$dom = new DOMDocument;
$dom->loadHTML($html);
$xpath = new DOMXPath($dom);
foreach ($xpath->query('//tbody/tr') as $tr) {
$tmp = []; // reset the temporary array so previous entries are removed
foreach ($xpath->query("td[@class]", $tr) as $td) {
$key = preg_match('~[a-z]+$~', $td->getAttribute('class'), $out) ? $out[0] : 'no_class';
if ($key === "name") {
$tmp['user_link'] = $xpath->query("a[@class = 'icon']", $td)[0]->getAttribute('href');
}
$tmp[$key] = trim($td->textContent);
}
$tmp['date'] = date("M. dS 'y", strtotime(preg_replace('~\.|\d+[ap]m *~', '', $tmp['date'])));
$result[] = $tmp;
}
var_export($result);
echo "\n----\n";
echo json_encode($result);

Output: (as multidim array, then json encoded string)

array (
0 =>
array (
'user_link' => '/username/Jhon Doe/',
'name' => 'Jhon Doe',
'height' => '45',
'weight' => '50',
'date' => 'May. 16th \'18',
'info' => 'abcd',
'country' => 'CA',
),
1 =>
array (
'user_link' => '/username/Kasim Shk/',
'name' => 'Kasim Shk',
'height' => '33',
'weight' => '54',
'date' => 'Jan. 01st \'70',
'info' => 'ijkl',
'country' => 'UAE',
),
)
----
[{"user_link":"\/username\/Jhon Doe\/","name":"Jhon Doe","height":"45","weight":"50","date":"May. 16th '18","info":"abcd","country":"CA"},{"user_link":"\/username\/Kasim Shk\/","name":"Kasim Shk","height":"33","weight":"54","date":"Jan. 01st '70","info":"ijkl","country":"UAE"}]

convert html to json in PHP

I completly agree with Magnus in the comments, that you should contact the API providers, and ask them for an JSON endpoint..

But if that is not possible, you could do something like this :

<?php 
$theFile = file_get_contents('https://www.israelpost.co.il/zip_data.nsf/SearchZip?OpenAgent&Location=%25u05EA%25u05DC%20%25u05D0%25u05D1%25u05D9%25u05D1%20-%20%25u05D9%25u05E4%25u05D5&POB=&Street=%25u05D3%25u05D9%25u05D6%25u05E0%25u05D2%25u05D5%25u05E3&House=99&Entrance=');

libxml_use_internal_errors(true); //Prevents Warnings, remove if desired
$dom = new DOMDocument();
$dom->loadHTML($theFile);
$body = "";
foreach($dom->getElementsByTagName("body")->item(0)->childNodes as $child) {
$body .= $dom->saveHTML($child);
}
echo $body;

This will get the content of the body tag for you.

This example will output RES86439611 - whatever that means to you

How to convert HTML data to json, php, mysql?

You have also one error in the cycle in your code.

Try this:

if (count($query) > 0) {
foreach ($query as $queryElement) {
$el = $queryElement;
$el['description'] = trim(preg_replace('/\s+/', ' ', strip_tags($el['description'])));
$arr[] = $el;
}
}

Convert HTML entities in Json back to characters

There is the solution. I needed to

  1. convert & to & to standardize encoding systems;
  2. convert all applicable characters to HTML entities.

There is the final code. Many thanks to all for all your comments and suggestions.

Full code and online test here: https://www.tehplayground.com/zythX4MUdF3ric4l

array_walk_recursive($data, function(&$item, $key) {
if(is_string($item)) {
$item = str_replace("&", "&", $item); // 1. Replace & by &
$item = html_entity_decode($item); // 2. Convert HTML entities to their corresponding characters
}
});

Convert html to json in php laravel

There's a 1 at the end of the output, possibly you're echoing something extra that you shouldn't .

I suspect you expect curl to return the actual result but you are not using the appropriate flag. The reason I suspect that is because you are assigning the return result to $json but without the flag CURLOPT_RETURNTRANSFERwill return true and not any json value.

Here's what you can try:

$url ='https://graph.facebook.com/' . $connection->provider_id . '?fields=link&access_token=' . $connection->token;
$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER);
$json= curl_exec ($ch);

$jsonArray = json_decode($json, true);
$link = $jsonArray["link"];

More information on the curl flags in the manual



Related Topics



Leave a reply



Submit