Anyone Have a Diff Algorithm for Rendered HTML

Anyone have a diff algorithm for rendered HTML?

There's another nice trick you can use to significantly improve the look of a rendered HTML diff. Although this doesn't fully solve the initial problem, it will make a significant difference in the appearance of your rendered HTML diffs.

Side-by-side rendered HTML will make it very difficult for your diff to line up vertically. Vertical alignment is crucial for comparing side-by-side diffs. In order to improve the vertical alignment of a side-by-side diff, you can insert invisible HTML elements in each version of the diff at "checkpoints" where the diff should be vertically aligned. Then you can use a bit of client-side JavaScript to add vertical spacing around checkpoint until the sides line up vertically.

Explained in a little more detail:

If you want to use this technique, run your diff algorithm and insert a bunch of visibility:hidden <span>s or tiny <div>s wherever your side-by-side versions should match up, according to the diff. Then run JavaScript that finds each checkpoint (and its side-by-side neighbor) and adds vertical spacing to the checkpoint that is higher-up (shallower) on the page. Now your rendered HTML diff will be vertically aligned up to that checkpoint, and you can continue repairing vertical alignment down the rest of your side-by-side page.

Building an HTML Diff/Patch Algorithm

If you were going to start from scratch, a useful search term would be "tree diff".

There's a pretty awesome blog post here, although I just found it by googling "daisydiff python" so I bet you've already seen it. Besides all the interesting theoretical stuff, he mentions the existence of Logilab's xmldiff, an open-source XML differ written in Python. That might be a decent starting point — maybe less correct than trying to wrap or reimplement DaisyDiff, but probably easier to get up and running quickly.

There's also html-tree-diff on pypi, which I found via this Quora link: http://www.quora.com/Is-there-any-good-Python-implementation-of-a-tree-diff-algorithm

There's some theoretical stuff about tree diffing at efficient diff algorithm for trees and Levenshtein distance on cstheory.stackexchange.

BTW, just to clarify, you are talking about diffing two DOM trees, but not necessarily rendering the diff/merge back into any particular HTML, right? (EDIT: Right.) A lot of the similarly-worded questions on here are really asking "how can I color deleted lines red and added lines green" or "how can I make matching paragraphs line up visually", skipping right over the theoretical hard part of "how do I diff two DOM trees in the first place" and the practical hard part of "how do I parse possibly malformed HTML into a DOM tree even before that". :)

Anyone know of a good algorithm for rendering an HTML table to an image?

html table rendering is non-trivial due to the various ways that the sizes of the cells may be specified, tables nested within tables, etc.

if all you want is the image, a simple solution would be the .NET browser control (which is basically the COM component for IE) and a screen-capture function

if you want to get some source to manipulate, the Mozilla source should still be available

Do you know a webpage appearance comparator?

In modern webpages the appearance is controlled by various 'things': html code, css styles and images at least (also javascript in some pages). Simple text-based diff programs are not enough because their output can be irrelevant to the webpage appearance (i.e. cleaning up css can show many differences but the rendered webpage remains the same).

For simpler pages HTML Match mentioned above could do the job. If I have to compare the design of two "complex" pages (including layout, space, image and text changes) I would do a two-step approach:

  1. Run a diff tool on the html sources to highlight the textual content differences. Then I would modify one of the pages to show the same content as the other (in order to make the next step more accurate and 'focused' to show 'real' layout changes). Of course it works only with very similar html.
  2. Load the pages in the same web browser, get some screenshots from the rendered output at fixed positions and compare the images (i.e. with ImageMagick). It should show all visual differences in the rendered output.

It is not perfect but should work.

[UPDATE] HTML Match seems dead, see this answer for an alternative solution.

Generate pretty diff html in Python

There's diff_prettyHtml() in the diff-match-patch library from Google.

track and display changes in html

You may want to use diff for that.

If you can use PHP on your server there's a handy pear package to perform the task you require. Here's an example :

https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-6174867.html

There's actually a actually Javascript implementations outhere as well, not tested though:

http://ejohn.org/projects/javascript-diff-algorithm/



Related Topics



Leave a reply



Submit