How do I prevent Nokogiri from adding unnecessary HTML tags?
Nokogiri assumes you've already determined whether you're receiving appropriate content for parsing. It's up to you to check prior to passing it to Nokogiri.
Don't use
doc = Nokogiri::HTML(open(url))
You can look at the returned HTTP headers for the "CONTENT-TYPE", which should be "application/json" for a JSON response, or "TEXT/HTML" for HTML. The OpenURI documentation has the following example:
open("http://www.ruby-lang.org/en") {|f|
f.each_line {|line| p line}
p f.base_uri # <URI::HTTP:0x40e6ef2 URL:http://www.ruby-lang.org/en/>
p f.content_type # "text/html"
p f.charset # "iso-8859-1"
p f.content_encoding # []
p f.last_modified # Thu Dec 05 02:45:02 UTC 2002
}
Or, you can look at the first character of the returned body, which will tell you whether it's HTML/XML or JSON. The first two will start with <
and JSON will start with either [
or {
.
Something like this would be a decent start:
content = open('http://www.example.com').read
if content.lstrip[0] == '<'
# it's XML/HTML so parse it with Nokogiri
else
# it's JSON so parse it with the JSON parser
end
Nokogiri should not include DOCTYPE
To achieve this you could use document fragments and the Builder.with
method, like this:
require 'nokogiri'
include Nokogiri
fragment = HTML.fragment('')
HTML::Builder.with(fragment) do |f|
f.div('foo')
end
fragment.to_html
# => <div>foo</div>
Prevent Nokogiri from URL-encoding src attributes
I think you're expecting Nokogiri to do things it shouldn't.
<foo src="{{bar}}"></foo>
is not HTML, as <foo>
is not a known HTML tag. On the other hand, it could be a valid XML tag.
Looking at what Nokogiri does with your fragment, here's what happens with it as HTML:
require 'nokogiri'
doc = Nokogiri::HTML.fragment('<a src="{{bar}}"></a>')
# => #(DocumentFragment:0x3fe6d6897ba8 {
# name = "#document-fragment",
# children = [
# #(Element:0x3fe6d6897900 {
# name = "a",
# attributes = [
# #(Attr:0x3fe6d68978d8 { name = "src", value = "{{bar}}" })]
# })]
# })
doc.to_s
# => "<a src=\"%7B%7Bbar%7D%7D\"></a>"
And what happens if it's treated correctly as XML:
doc = Nokogiri::XML.fragment('<a src="{{bar}}"></a>')
# => #(DocumentFragment:0x3fe6d68930d0 {
# name = "#document-fragment",
# children = [
# #(Element:0x3fe6d6892eb4 {
# name = "a",
# attributes = [
# #(Attr:0x3fe6d6892e8c { name = "src", value = "{{bar}}" })]
# })]
# })
doc.to_s
# => "<a src=\"{{bar}}\"/>"
doc.to_xml
# => "<a src=\"{{bar}}\"/>"
doc.to_html
# => "<a src=\"%7B%7Bbar%7D%7D\"></a>"
Nokogiri has a set of rules it uses when parsing HTML, but it basically turns a HTML DOM into an XML DOM internally, which is visible when you look at the inspection of the document after parsing it as HTML. It's during the output of the document that the conversion happens. You might be able to nudge Nokogiri using the parsing options and get it to output what you want.
If you feel this is improper behavior for Nokogiri, I'd highly recommend taking it up with the maintainers in a bug report. They occasionally drop by here to answer questions, but you'll get faster responses on the Nokogiri talk mail list, or on the git hub page.
If you have markup that isn't valid HTML then Nokogiri will try to coerce into some sort of semblance of valid HTML. At that point you should be able to get reasonable XML, XHTML or HTML from it, where "reasonable" means it'll be semantically valid, just maybe not exactly what you hoped for.
Preventing Nokogiri from escaping characters in URLs
require 'nokogiri'
doc = Nokogiri("<a href='*|UNSUB|*'>unsubscribe</a>")
puts doc.to_html
#=> <a href="*%7CUNSUB%7C*">unsubscribe</a>
puts doc.to_xml
#=> <?xml version="1.0"?>
#=> <a href="*|UNSUB|*">unsubscribe</a>
Alternatively:
puts doc.to_html.gsub('%7C','|')
#=> <a href="*|UNSUB|*">unsubscribe</a>
Nokogiri substituting text between tags
By using Nokogiri::Document()
, you're asking Nokogiri to create a representation of an HTML document based on what you're passing in. If you're not passing in a full HTML document but, as here, a fragment, Nokogiri wraps your input text into its own template – and if you don't have any outer tags, it will add the <p>
element. You can see this by calling #to_s
on the document:
Nokogiri::HTML('Hello world').to_s
# => <!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.0 Transitional//EN\" \"http://www.w3.org/TR/REC-html40/loose.dtd\">\n" +
"<html><body><p>Hello world</p></body></html>"
You could try and faff around trying to find ways to get Nokogiri to use a better template construction, but really I'd say you'll get to what you want faster by one of two approaches:
- Consider whether using a document fragment (e.g.,
Nokogiri::HTML.fragment(body)
) would give you what you want. This would probably require larger refactoring of your code, but what you end up with might be neater and more maintainable. - You could get a quick win by wrapping your method's
body
input in your own HTML document template, so Nokogiri doesn't do this for you. For example:
def html_parser(body, terms:)
html = "<html><body>#{body}</body></html>"
doc = Nokogiri::HTML(html)
# etc.
The latter option will fix your issue faster, but the code might not be as neat.
How can i change the some text between two tags with Nokogiri?
Do as below using Nokogiri::XML::Node#content=
:
Set the Node’s content to a Text node containing string. The string gets
XML
escaped, not interpreted as markup.
document.at_xpath('//svg/text[@id="text1"]').content = "Goodbye"
Related Topics
Parsing String to Add to Url-Encoded Url
Rails 4 Unpermitted Parameters for Array
Ruby CSV - Get Current Line/Row Number
Setting Elastic Search Limit to "Unlimited"
How to Pass an Argument to Array.Map Short Cut
Unexpected Return (Localjumperror)
How to Remove Gem from Ruby on Rails Application
Fail to Bundle Install Puma 4.3.5 or Gem Puma with Ruby-2.6.6 on MACos-10.15.6
Ruby on Rails: How to Edit Database.Yml for Postgresql
Sorting a Ruby Array of Objects by an Attribute That Could Be Nil
How to Deal with the Conflict Between Activesupport::JSON and the JSON Gem
Finding the Element of a Ruby Array with the Maximum Value for a Particular Attribute
How to Check to See If My Array Includes an Object
Differencebetween Integer and Fixnum
How to Validate Exits and Aborts in Rspec