Dom Parser That Allows HTML5-Style </ in <Script> Tag

PHP Simple HTML DOM Parser - Remove Script Data

After requesting support directly from the project on Sourceforge, it turns out everything is working as expected and that regex had a limit which needed to be increased.

The code which helped me troubleshoot the issue is below:

<?php
// Normal regex doesn't work for such a large script section. The regex parser hits the backtrack limit, so we need to increase it temporarily.
ini_set('pcre.backtrack_limit', '10485760'); // 10MB

// The total file size of the webpage is also much bigger than normally allowed. Use this definition to increase the upper boundary of the parser.
define('MAX_FILE_SIZE', 10485760); // 10MB

include_once 'simple_html_dom.php';
$html = file_get_html('https://techcrunch.com/');

// This will list all headers
foreach($html->find('h2') as $h2) echo $h2->plaintext . PHP_EOL;

// Use this code to remove script tags from the DOM (i.e. if you need to save the DOM for later use. In my tests the file size went down from 1.9MB to 30KB
foreach($html->find('script') as $script) $script->remove();

Creating a new DOM element from an HTML string using built-in DOM methods or Prototype

Note: most current browsers support HTML <template> elements, which provide a more reliable way of turning creating elements from strings. See Mark Amery's answer below for details.

For older browsers, and node/jsdom: (which doesn't yet support <template> elements at the time of writing), use the following method. It's the same thing the libraries use to do to get DOM elements from an HTML string (with some extra work for IE to work around bugs with its implementation of innerHTML):

function createElementFromHTML(htmlString) {
var div = document.createElement('div');
div.innerHTML = htmlString.trim();

// Change this to div.childNodes to support multiple top-level nodes.
return div.firstChild;
}

Note that unlike HTML templates this won't work for some elements that cannot legally be children of a <div>, such as <td>s.

If you're already using a library, I would recommend you stick to the library-approved method of creating elements from HTML strings:

  • Prototype has this feature built-into its update() method.
  • jQuery has it implemented in its jQuery(html) and jQuery.parseHTML methods.

Where should I put script tags in HTML markup?

Here's what happens when a browser loads a website with a <script> tag on it:

  1. Fetch the HTML page (e.g. index.html)
  2. Begin parsing the HTML
  3. The parser encounters a <script> tag referencing an external script file.
  4. The browser requests the script file. Meanwhile, the parser blocks and stops parsing the other HTML on your page.
  5. After some time the script is downloaded and subsequently executed.
  6. The parser continues parsing the rest of the HTML document.

Step #4 causes a bad user experience. Your website basically stops loading until you've downloaded all scripts. If there's one thing that users hate it's waiting for a website to load.

Why does this even happen?

Any script can insert its own HTML via document.write() or other DOM manipulations. This implies that the parser has to wait until the script has been downloaded and executed before it can safely parse the rest of the document. After all, the script could have inserted its own HTML in the document.

However, most JavaScript developers no longer manipulate the DOM while the document is loading. Instead, they wait until the document has been loaded before modifying it. For example:

<!-- index.html -->
<html>
<head>
<title>My Page</title>
<script src="my-script.js"></script>
</head>
<body>
<div id="user-greeting">Welcome back, user</div>
</body>
</html>

JavaScript:

// my-script.js
document.addEventListener("DOMContentLoaded", function() {
// this function runs when the DOM is ready, i.e. when the document has been parsed
document.getElementById("user-greeting").textContent = "Welcome back, Bart";
});

Because your browser does not know my-script.js isn't going to modify the document until it has been downloaded and executed, the parser stops parsing.

Antiquated recommendation

The old approach to solving this problem was to put <script> tags at the bottom of your <body>, because this ensures the parser isn't blocked until the very end.

This approach has its own problem: the browser cannot start downloading the scripts until the entire document is parsed. For larger websites with large scripts and stylesheets, being able to download the script as soon as possible is very important for performance. If your website doesn't load within 2 seconds, people will go to another website.

In an optimal solution, the browser would start downloading your scripts as soon as possible, while at the same time parsing the rest of your document.

The modern approach

Today, browsers support the async and defer attributes on scripts. These attributes tell the browser it's safe to continue parsing while the scripts are being downloaded.

async

<script src="path/to/script1.js" async></script>
<script src="path/to/script2.js" async></script>

Scripts with the async attribute are executed asynchronously. This means the script is executed as soon as it's downloaded, without blocking the browser in the meantime.
This implies that it's possible that script 2 is downloaded and executed before script 1.

According to http://caniuse.com/#feat=script-async, 97.78% of all browsers support this.

defer

<script src="path/to/script1.js" defer></script>
<script src="path/to/script2.js" defer></script>

Scripts with the defer attribute are executed in order (i.e. first script 1, then script 2). This also does not block the browser.

Unlike async scripts, defer scripts are only executed after the entire document has been loaded.

According to http://caniuse.com/#feat=script-defer, 97.79% of all browsers support this. 98.06% support it at least partially.

An important note on browser compatibility: in some circumstances, Internet Explorer 9 and earlier may execute deferred scripts out of order. If you need to support those browsers, please read this first!

(To learn more and see some really helpful visual representations of the differences between async, defer and normal scripts check the first two links at the references section of this answer)

Conclusion

The current state-of-the-art is to put scripts in the <head> tag and use the async or defer attributes. This allows your scripts to be downloaded ASAP without blocking your browser.

The good thing is that your website should still load correctly on the 2% of browsers that do not support these attributes while speeding up the other 98%.

References

  • async vs defer attributes
  • Efficiently load JavaScript with defer and async
  • Remove Render-Blocking JavaScript
  • Async, Defer, Modules: A Visual Cheatsheet

Parse an HTML string with JS

Create a dummy DOM element and add the string to it. Then, you can manipulate it like any DOM element.

var el = document.createElement( 'html' );
el.innerHTML = "<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>";

el.getElementsByTagName( 'a' ); // Live NodeList of your anchor elements

Edit: adding a jQuery answer to please the fans!

var el = $( '<div></div>' );
el.html("<html><head><title>titleTest</title></head><body><a href='test0'>test01</a><a href='test1'>test02</a><a href='test2'>test03</a></body></html>");

$('a', el) // All the anchor elements

Can scripts be inserted with innerHTML?

You have to use eval() to execute any script code that you've inserted as DOM text.

MooTools will do this for you automatically, and I'm sure jQuery would as well (depending on the version. jQuery version 1.6+ uses eval). This saves a lot of hassle of parsing out <script> tags and escaping your content, as well as a bunch of other "gotchas".

Generally if you're going to eval() it yourself, you want to create/send the script code without any HTML markup such as <script>, as these will not eval() properly.

Scraping script tag with certain keyword using Simple HTML Dom Parser

I ended up using the below code to search all script tags for my keyword:

$scripts = $html->find('script');
foreach($scripts as $s) {
if(strpos($s->innertext, 'PRODUCT_METADATA') !== false) {
$script = $s;
}
}

domDocument - Identifying position of a br /

You can't get the position of an element using getElementsByTagName. You should go through childNodes of each element and process text nodes and elements separately.

In the general case you'll need recursion, like this:

function processElement(DOMNode $element){
foreach($element->childNodes as $child){
if($child instanceOf DOMText){
echo $child->nodeValue,PHP_EOL;
}elseif($child instanceOf DOMElement){
switch($child->nodeName){
case 'br':
echo 'BREAK: ',PHP_EOL;
break;
case 'p':
echo 'PARAGRAPH: ',PHP_EOL;
processElement($child);
echo 'END OF PARAGRAPH;',PHP_EOL;
break;
// etc.
// other cases:
default:
processElement($child);
}
}
}
}

$D = new DOMDocument;
$D->loadHTML('<p>Multiple-line paragraph<br />that has a close tag</p>');
processElement($D);

This will output:

PARAGRAPH: 
Multiple-line paragraph
BREAK:
that has a close tag
END OF PARAGRAPH;


Related Topics



Leave a reply



Submit