How to Prevent Nokogiri from Adding <Doctype> Tags

How do I prevent Nokogiri from adding unnecessary HTML tags?

Nokogiri assumes you've already determined whether you're receiving appropriate content for parsing. It's up to you to check prior to passing it to Nokogiri.

Don't use

doc = Nokogiri::HTML(open(url))

You can look at the returned HTTP headers for the "CONTENT-TYPE", which should be "application/json" for a JSON response, or "TEXT/HTML" for HTML. The OpenURI documentation has the following example:

open("http://www.ruby-lang.org/en") {|f|
f.each_line {|line| p line}
p f.base_uri # <URI::HTTP:0x40e6ef2 URL:http://www.ruby-lang.org/en/>
p f.content_type # "text/html"
p f.charset # "iso-8859-1"
p f.content_encoding # []
p f.last_modified # Thu Dec 05 02:45:02 UTC 2002
}

Or, you can look at the first character of the returned body, which will tell you whether it's HTML/XML or JSON. The first two will start with < and JSON will start with either [ or {.

Something like this would be a decent start:

content = open('http://www.example.com').read

if content.lstrip[0] == '<'
# it's XML/HTML so parse it with Nokogiri
else
# it's JSON so parse it with the JSON parser
end

Nokogiri should not include DOCTYPE

To achieve this you could use document fragments and the Builder.with method, like this:

require 'nokogiri'
include Nokogiri

fragment = HTML.fragment('')

HTML::Builder.with(fragment) do |f|
f.div('foo')
end

fragment.to_html
# => <div>foo</div>

Prevent Nokogiri from URL-encoding src attributes

I think you're expecting Nokogiri to do things it shouldn't.

<foo src="{{bar}}"></foo>

is not HTML, as <foo> is not a known HTML tag. On the other hand, it could be a valid XML tag.

Looking at what Nokogiri does with your fragment, here's what happens with it as HTML:

require 'nokogiri'
doc = Nokogiri::HTML.fragment('<a src="{{bar}}"></a>')
# => #(DocumentFragment:0x3fe6d6897ba8 {
# name = "#document-fragment",
# children = [
# #(Element:0x3fe6d6897900 {
# name = "a",
# attributes = [
# #(Attr:0x3fe6d68978d8 { name = "src", value = "{{bar}}" })]
# })]
# })
doc.to_s
# => "<a src=\"%7B%7Bbar%7D%7D\"></a>"

And what happens if it's treated correctly as XML:

doc = Nokogiri::XML.fragment('<a src="{{bar}}"></a>')
# => #(DocumentFragment:0x3fe6d68930d0 {
# name = "#document-fragment",
# children = [
# #(Element:0x3fe6d6892eb4 {
# name = "a",
# attributes = [
# #(Attr:0x3fe6d6892e8c { name = "src", value = "{{bar}}" })]
# })]
# })
doc.to_s
# => "<a src=\"{{bar}}\"/>"
doc.to_xml
# => "<a src=\"{{bar}}\"/>"
doc.to_html
# => "<a src=\"%7B%7Bbar%7D%7D\"></a>"

Nokogiri has a set of rules it uses when parsing HTML, but it basically turns a HTML DOM into an XML DOM internally, which is visible when you look at the inspection of the document after parsing it as HTML. It's during the output of the document that the conversion happens. You might be able to nudge Nokogiri using the parsing options and get it to output what you want.

If you feel this is improper behavior for Nokogiri, I'd highly recommend taking it up with the maintainers in a bug report. They occasionally drop by here to answer questions, but you'll get faster responses on the Nokogiri talk mail list, or on the git hub page.

If you have markup that isn't valid HTML then Nokogiri will try to coerce into some sort of semblance of valid HTML. At that point you should be able to get reasonable XML, XHTML or HTML from it, where "reasonable" means it'll be semantically valid, just maybe not exactly what you hoped for.

Preventing Nokogiri from escaping characters in URLs

require 'nokogiri'

doc = Nokogiri("<a href='*|UNSUB|*'>unsubscribe</a>")

puts doc.to_html
#=> <a href="*%7CUNSUB%7C*">unsubscribe</a>

puts doc.to_xml
#=> <?xml version="1.0"?>
#=> <a href="*|UNSUB|*">unsubscribe</a>

Alternatively:

puts doc.to_html.gsub('%7C','|')
#=> <a href="*|UNSUB|*">unsubscribe</a>

Nokogiri substituting text between tags

By using Nokogiri::Document(), you're asking Nokogiri to create a representation of an HTML document based on what you're passing in. If you're not passing in a full HTML document but, as here, a fragment, Nokogiri wraps your input text into its own template – and if you don't have any outer tags, it will add the <p> element. You can see this by calling #to_s on the document:

Nokogiri::HTML('Hello world').to_s

# => <!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n" +
"<html><body><p>Hello world</p></body></html>"

You could try and faff around trying to find ways to get Nokogiri to use a better template construction, but really I'd say you'll get to what you want faster by one of two approaches:

  1. Consider whether using a document fragment (e.g., Nokogiri::HTML.fragment(body)) would give you what you want. This would probably require larger refactoring of your code, but what you end up with might be neater and more maintainable.
  2. You could get a quick win by wrapping your method's body input in your own HTML document template, so Nokogiri doesn't do this for you. For example:
def html_parser(body, terms:)
html = "<html><body>#{body}</body></html>"
doc = Nokogiri::HTML(html)
# etc.

The latter option will fix your issue faster, but the code might not be as neat.

How can i change the some text between two tags with Nokogiri?

Do as below using Nokogiri::XML::Node#content=:

Set the Node’s content to a Text node containing string. The string gets XML escaped, not interpreted as markup.

document.at_xpath('//svg/text[@id="text1"]').content = "Goodbye"


Related Topics



Leave a reply



Submit