Rexml::Document.New How to Give Encode Parameters on This Line

REXML is wrapping long lines. How do I switch that off?

As Ryan Calhoun said in his previous answer, REXML uses 80 as its wrap line length. I'm pretty sure this is a bug (although I couldn't find a bug report just now). I was able to fix it by overwriting the Formatters::Pretty class's write_text method so that it uses the configurable @width attribute instead of the hard-coded 80.

require "rubygems"
require "rexml/document"
include REXML

long_xml = "<root><tag>As Ryan Calhoun said in his previous answer, REXML uses 80 as its wrap line length. I'm pretty sure this is a bug (although I couldn't find a bug report just now). I was able to *fix* it by overwriting the Formatters::Pretty class's write_text method.</tag></root>"

xml = Document.new(long_xml)

#fix bug in REXML::Formatters::Pretty
class MyPrecious < REXML::Formatters::Pretty
def write_text( node, output )
s = node.to_s()
s.gsub!(/\s/,' ')
s.squeeze!(" ")

#The Pretty formatter code mistakenly used 80 instead of the @width variable
#s = wrap(s, 80-@level)
s = wrap(s, @width-@level)

s = indent_text(s, @level, " ", true)
output << (' '*@level + s)
end
end

printer = MyPrecious.new(5)
printer.width = 1000
printer.compact = true
printer.write(xml, STDOUT)

How to specify output file encoding in Ruby?

Here's an example that outputs a file in the UTF-16LE encoding:

open("data.txt", "w:UTF-16LE")

Ruby looks at the encoding of the string you are writing, and transcodes as necessary. Here's a very detailed blog post describing mechanics with excellent examples (see the section called "The Default External and Internal Encodings").

Invalid characters before my XML in Ruby

To answer my own question, the real problem here is that encoding support in Ruby 1.8.7 is lacking. StringIO is particular seems to make a mess of it. REXML also has trouble handling unicode in Ruby 1.8.7.

The most attractive solution would be of course to upgrade to 1.9.3, but that's not practical for this project right now.

So what I ended up doing is, avoid StringIO and simply download to a file on disk, and then instead of processing the XML with REXML, use nokogiri instead.

Together, that solves all my problems.

Why can't REXML parse CDATA preceded by a line break?

Why

Having anything before the <![CDATA[]]> overrides whatever is in the <![CDATA[]]>. Anything from a letter, to a newline (like you've discovered), or a single space. This makes sense, because your example is getting the text of the element, and whitespace counts as text. In the examples where you are able to access <![CDATA[]]>, it is because text is nil.


Solution

If you look at the documentation for Element, you'll see that it has a function called cdatas() that:

Get an array of all CData children. IMMUTABLE.

So, in your example, if you do an inner loop on contentElement.cdatas() you would see the content of all your missing tags.



Related Topics



Leave a reply



Submit