How do I validate XHTML with nokogiri?
It's not just you. What you're doing is supposed to be the right way to do it, but I've never had any luck with it. As far as I can tell, there's some disconnect somewhere between Nokogiri and libxml which causes it to not load SYSTEM
DTDs, or to recognize PUBLIC
DTDs. It will work if you define the DTD within the XML file, but good luck doing that with the XHTML DTDs.
The best thing I can recommend is to use the schemas for XHTML instead:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::XML(open('http://www.w3.org'))
xsd = Nokogiri::XML::Schema(open('http://www.w3.org/2002/08/xhtml/xhtml1-strict.xsd'))
#this is a true/false validation
xsd.valid?(doc) # => true
#this gives a listing of errors
xsd.validate(doc) # => []
How do I validate XHTML, ATOM, and CSS in Ruby?
Nokogiri ( http://github.com/tenderlove/nokogiri/tree/master ) is great tool for parsing XML/XHTML/HTML/etc and it looks like it can validate as well:
Nokogiri::XML.parse(string_or_io, nil, nil, Nokogiri::XML::PARSE_DTDVALID)
At the moment, I don't believe that you'll find a pure ruby project that will validate your CSS directives, but there are many that will let you use ruby code to generate valid CSS.
How do I validate XHTML, ATOM, and CSS in Ruby?
Nokogiri ( http://github.com/tenderlove/nokogiri/tree/master ) is great tool for parsing XML/XHTML/HTML/etc and it looks like it can validate as well:
Nokogiri::XML.parse(string_or_io, nil, nil, Nokogiri::XML::PARSE_DTDVALID)
At the moment, I don't believe that you'll find a pure ruby project that will validate your CSS directives, but there are many that will let you use ruby code to generate valid CSS.
can i validate xhtml programmatically from a php script?
You can use W3C's validator API. There's a PHP library available through PEAR (click here) which uses said API.
You can also install the validator on your local server (instructions here), though you might not have sufficient permissions to do so if you are using shared hosting.
Unclosed tags and Nokogiri
Give this a try:
require 'open-uri'
require 'nokogiri'
@doc = Nokogiri::HTML(File.open('t.html', 'r'))
@doc.at_css('#qcbody').to_html
In IRB:
>> @doc.at_css('#qcbody').to_html
=> "<div id="qcbody"> \r\n <form method="post" name="form" id="form" action="#">\r\n <input type="hidden" name="Search Engine" id="Search Engine"><input type="hidden" name="Keyword" id="Keyword"><input type="button" onclick="javascript:validate()" name="sendsubmit" id="sendsubmit" class="submit">\n</form>\r\n <div class="clear"></div>\r\n </div>"
The difference between using Nokogiri::XML
and Nokogiri::HTML
is the leniency when parsing the document. XML is required to validate and be correct. Some XML parsers would reject an XML file that doesn't meet the standard. Nokogiri allows us to set how picky it is. (And in the case of XML, you can look at the errors
array after parsing to see if there is a problem.)
For HTML, Nokogiri relaxes the parser so there's a better chance of handling real-world HTML. I've seen it handle some really ugly markup and keep on going when lesser parsers blew their lunch. If you look at Nokogiri::HTML.parse
it has options = XML::ParseOptions::DEFAULT_HTML
defined, which are the relaxed settings. You can override that if you want to make sure the HTML conforms.
How to get Meta Keywords using Nokogiri?
Here is a simple example:
require 'rubygems'
require 'nokogiri'
doc = Nokogiri::HTML("<html><head><meta name=\"Keywords\" content=\"one, two, three\"></head><body></body></html>")
doc.xpath("//meta[@name='Keywords']/@content").each do |attr|
puts attr.value
end
Nokogiri losing attributes
I can't duplicate a problem with Nokogiri stripping an id
parameter from a <body>
tag in valid HTML. Here's my Nokogiri/LibXML and Ruby particulars:
nokogiri: 1.5.9
ruby:
version: 1.9.3
platform: x86_64-darwin10.8.0
description: ruby 1.9.3p392 (2013-02-22 revision 39386) [x86_64-darwin10.8.0]
engine: ruby
libxml:
binding: extension
compiled: 2.7.7
loaded: 2.7.7
Here's a simple test of Nokogiri:
doc = Nokogiri::HTML('<html><body id="foo">bar</body></html>')
puts doc.to_html
Returns:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body id="foo">bar</body></html>
When I parse 'http://www.femmeactuelle.fr/', Nokogiri's errors
method returns a number of errors, including in the <head>
and <body>
. Nokogiri tries to fix the document when it is broken, which can result in tags being moved, or, as I suspect in this case, parameters getting lost.
Validating the document returns all sorts of errors, so I think the problem lies outside of Nokogiri. If you want to try to fix it before passing it to Nokogiri, you can send the file through HTMLTidy, and then see if Nokogiri can make better sense of it. Otherwise, spend some time digging through the HTML, figure out what's broken, and write some string manipulation code to patch it up.
You can't compare the source of a page that a browser renders with what a parser, like Nokogiri, outputs. They are very different pieces of code, with very different goals. A browser wants to make the page render something, and has all sorts of fall-backs for dealing with broken HTML. A parser doesn't, because its job is to accurately translate the HTML or XML into its true structure so we can dig through it.
Target text without tags using Nokogiri
Using Nokogiri and XPath you could do something like this:
def extract_span_data(html)
doc = Nokogiri::HTML(html)
doc.xpath("//span").reduce({}) do |memo, span|
text = ''
node = span.next_sibling
while node && (node.name != 'span')
text += node.text
node = node.next_sibling
end
memo[span.text] = text.strip
memo
end
end
extract_span_data(html_string)
# {
# "Address" => "123 Main Street\nSometown",
# "Telephone" => "212-555-555",
# "Hours" => "M-F: 8:00-21:00\n Sat-Sun: 8:00-21:00"
# }
Using a proper parser is easier and more robust than using regular expressions (which is a well documented bad ideaTM.)
How do I pretty-print HTML with Nokogiri?
By "pretty printing" of HTML page I presume you meant that you want to reformat the HTML structure with proper indentation. Nokogiri doesn't support this; the pretty_print
method is for the "pp" library and the output is useful for debugging only.
There are several projects that understand HTML well enough to be able to reformat it without destroying whitespace that is actually significant (the famous one is HTML Tidy), but by Googling I've found this post titled "Pretty printing XHTML with Nokogiri and XSLT".
It comes down to this:
xsl = Nokogiri::XSLT(File.open("pretty_print.xsl"))
html = Nokogiri(File.open("source.html"))
puts xsl.apply_to(html).to_s
It requires you, of course, to download the linked XSL file to your filesystem. I've tried it very quickly on my machine and it works like a charm.
Related Topics
How to Format This International Phone Number in Rails
How to Remove '---' on Top of a Yaml File
How to Calculate How Many Years Passed Since a Given Date in Ruby
Rspec Failing Error: Expected False to Respond to 'False'
Ruby Backslash to Continue String on a New Line
Deleting a Modified Object from a Set in a No-Op
Sass/Compass Compile into Many Locations
Decrypting Salted Aes File Generated on Command Line with Ruby
Trouble Yielding Inside a Block/Lambda
How to Enable Tls V1.2 in Ruby? If So, How
What Is the Best Practice When It Comes to Testing "Infinite Loops"
How to Get a Particular Line from a File
Rails 3 Install Error: "Invalid Value for @Cert_Chain"