How to Validate Xhtml with Nokogiri

How do I validate XHTML with nokogiri?

It's not just you. What you're doing is supposed to be the right way to do it, but I've never had any luck with it. As far as I can tell, there's some disconnect somewhere between Nokogiri and libxml which causes it to not load SYSTEM DTDs, or to recognize PUBLIC DTDs. It will work if you define the DTD within the XML file, but good luck doing that with the XHTML DTDs.

The best thing I can recommend is to use the schemas for XHTML instead:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri::XML(open('http://www.w3.org'))
xsd = Nokogiri::XML::Schema(open('http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd'))

#this is a true/false validation
xsd.valid?(doc) # => true

#this gives a listing of errors
xsd.validate(doc) # => []

How do I validate XHTML, ATOM, and CSS in Ruby?

Nokogiri ( http://github.com/tenderlove/nokogiri/tree/master ) is great tool for parsing XML/XHTML/HTML/etc and it looks like it can validate as well:

Nokogiri::XML.parse(string_or_io, nil, nil, Nokogiri::XML::PARSE_DTDVALID)

At the moment, I don't believe that you'll find a pure ruby project that will validate your CSS directives, but there are many that will let you use ruby code to generate valid CSS.

How do I validate XHTML, ATOM, and CSS in Ruby?

Nokogiri ( http://github.com/tenderlove/nokogiri/tree/master ) is great tool for parsing XML/XHTML/HTML/etc and it looks like it can validate as well:

Nokogiri::XML.parse(string_or_io, nil, nil, Nokogiri::XML::PARSE_DTDVALID)

At the moment, I don't believe that you'll find a pure ruby project that will validate your CSS directives, but there are many that will let you use ruby code to generate valid CSS.

can i validate xhtml programmatically from a php script?

You can use W3C's validator API. There's a PHP library available through PEAR (click here) which uses said API.

You can also install the validator on your local server (instructions here), though you might not have sufficient permissions to do so if you are using shared hosting.

Unclosed tags and Nokogiri

Give this a try:

require 'open-uri'
require 'nokogiri'

@doc = Nokogiri::HTML(File.open('t.html', 'r'))
@doc.at_css('#qcbody').to_html

In IRB:

>> @doc.at_css('#qcbody').to_html
=> "<div id="qcbody"> \r\n <form method="post" name="form" id="form" action="#">\r\n <input type="hidden" name="Search Engine" id="Search Engine"><input type="hidden" name="Keyword" id="Keyword"><input type="button" onclick="javascript:validate()" name="sendsubmit" id="sendsubmit" class="submit">\n</form>\r\n <div class="clear"></div>\r\n </div>"

The difference between using Nokogiri::XML and Nokogiri::HTML is the leniency when parsing the document. XML is required to validate and be correct. Some XML parsers would reject an XML file that doesn't meet the standard. Nokogiri allows us to set how picky it is. (And in the case of XML, you can look at the errors array after parsing to see if there is a problem.)

For HTML, Nokogiri relaxes the parser so there's a better chance of handling real-world HTML. I've seen it handle some really ugly markup and keep on going when lesser parsers blew their lunch. If you look at Nokogiri::HTML.parse it has options = XML::ParseOptions::DEFAULT_HTML defined, which are the relaxed settings. You can override that if you want to make sure the HTML conforms.

How to get Meta Keywords using Nokogiri?

Here is a simple example:

require 'rubygems'
require 'nokogiri'


doc = Nokogiri::HTML("<html><head><meta name=\"Keywords\" content=\"one, two, three\"></head><body></body></html>")

doc.xpath("//meta[@name='Keywords']/@content").each do |attr|
puts attr.value
end

Nokogiri losing attributes

I can't duplicate a problem with Nokogiri stripping an id parameter from a <body> tag in valid HTML. Here's my Nokogiri/LibXML and Ruby particulars:

nokogiri: 1.5.9
ruby:
version: 1.9.3
platform: x86_64-darwin10.8.0
description: ruby 1.9.3p392 (2013-02-22 revision 39386) [x86_64-darwin10.8.0]
engine: ruby
libxml:
binding: extension
compiled: 2.7.7
loaded: 2.7.7

Here's a simple test of Nokogiri:

doc = Nokogiri::HTML('<html><body id="foo">bar</body></html>')

puts doc.to_html

Returns:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body id="foo">bar</body></html>

When I parse 'http://www.femmeactuelle.fr/', Nokogiri's errors method returns a number of errors, including in the <head> and <body>. Nokogiri tries to fix the document when it is broken, which can result in tags being moved, or, as I suspect in this case, parameters getting lost.

Validating the document returns all sorts of errors, so I think the problem lies outside of Nokogiri. If you want to try to fix it before passing it to Nokogiri, you can send the file through HTMLTidy, and then see if Nokogiri can make better sense of it. Otherwise, spend some time digging through the HTML, figure out what's broken, and write some string manipulation code to patch it up.

You can't compare the source of a page that a browser renders with what a parser, like Nokogiri, outputs. They are very different pieces of code, with very different goals. A browser wants to make the page render something, and has all sorts of fall-backs for dealing with broken HTML. A parser doesn't, because its job is to accurately translate the HTML or XML into its true structure so we can dig through it.

Target text without tags using Nokogiri

Using Nokogiri and XPath you could do something like this:

def extract_span_data(html)
doc = Nokogiri::HTML(html)
doc.xpath("//span").reduce({}) do |memo, span|
text = ''
node = span.next_sibling
while node && (node.name != 'span')
text += node.text
node = node.next_sibling
end
memo[span.text] = text.strip
memo
end
end

extract_span_data(html_string)
# {
# "Address" => "123 Main Street\nSometown",
# "Telephone" => "212-555-555",
# "Hours" => "M-F: 8:00-21:00\n Sat-Sun: 8:00-21:00"
# }

Using a proper parser is easier and more robust than using regular expressions (which is a well documented bad ideaTM.)

How do I pretty-print HTML with Nokogiri?

By "pretty printing" of HTML page I presume you meant that you want to reformat the HTML structure with proper indentation. Nokogiri doesn't support this; the pretty_print method is for the "pp" library and the output is useful for debugging only.

There are several projects that understand HTML well enough to be able to reformat it without destroying whitespace that is actually significant (the famous one is HTML Tidy), but by Googling I've found this post titled "Pretty printing XHTML with Nokogiri and XSLT".

It comes down to this:

xsl = Nokogiri::XSLT(File.open("pretty_print.xsl"))
html = Nokogiri(File.open("source.html"))
puts xsl.apply_to(html).to_s

It requires you, of course, to download the linked XSL file to your filesystem. I've tried it very quickly on my machine and it works like a charm.



Related Topics



Leave a reply



Submit