REXML is wrapping long lines. How do I switch that off?
As Ryan Calhoun said in his previous answer, REXML uses 80 as its wrap line length. I'm pretty sure this is a bug (although I couldn't find a bug report just now). I was able to fix it by overwriting the Formatters::Pretty class's write_text method so that it uses the configurable @width attribute instead of the hard-coded 80.
require "rubygems"
require "rexml/document"
include REXML
long_xml = "<root><tag>As Ryan Calhoun said in his previous answer, REXML uses 80 as its wrap line length. I'm pretty sure this is a bug (although I couldn't find a bug report just now). I was able to *fix* it by overwriting the Formatters::Pretty class's write_text method.</tag></root>"
xml = Document.new(long_xml)
#fix bug in REXML::Formatters::Pretty
class MyPrecious < REXML::Formatters::Pretty
def write_text( node, output )
s = node.to_s()
s.gsub!(/\s/,' ')
s.squeeze!(" ")
#The Pretty formatter code mistakenly used 80 instead of the @width variable
#s = wrap(s, 80-@level)
s = wrap(s, @width-@level)
s = indent_text(s, @level, " ", true)
output << (' '*@level + s)
end
end
printer = MyPrecious.new(5)
printer.width = 1000
printer.compact = true
printer.write(xml, STDOUT)
How to specify output file encoding in Ruby?
Here's an example that outputs a file in the UTF-16LE encoding:
open("data.txt", "w:UTF-16LE")
Ruby looks at the encoding of the string you are writing, and transcodes as necessary. Here's a very detailed blog post describing mechanics with excellent examples (see the section called "The Default External and Internal Encodings").
Invalid characters before my XML in Ruby
To answer my own question, the real problem here is that encoding support in Ruby 1.8.7 is lacking. StringIO is particular seems to make a mess of it. REXML also has trouble handling unicode in Ruby 1.8.7.
The most attractive solution would be of course to upgrade to 1.9.3, but that's not practical for this project right now.
So what I ended up doing is, avoid StringIO and simply download to a file on disk, and then instead of processing the XML with REXML, use nokogiri instead.
Together, that solves all my problems.
Why can't REXML parse CDATA preceded by a line break?
Why
Having anything before the <![CDATA[]]>
overrides whatever is in the <![CDATA[]]>
. Anything from a letter, to a newline (like you've discovered), or a single space. This makes sense, because your example is getting the text
of the element, and whitespace counts as text. In the examples where you are able to access <![CDATA[]]>
, it is because text is nil.
Solution
If you look at the documentation for Element, you'll see that it has a function called cdatas()
that:
Get an array of all CData children. IMMUTABLE.
So, in your example, if you do an inner loop on contentElement.cdatas()
you would see the content of all your missing tags.
Related Topics
How to Use 'Debugger' and 'Pry' When Developing a Gem? (Ruby)
What Is the Fully Qualified Name of a Model in Ruby on Rails
Understanding Ruby Method Parameters Syntax
Rails Form_For Never Invokes the Create Controller Action to Use Redirect_To
How to Convert a Formatted String into Plain Text
Rails Contact Form Not Working
To_Model Delegated to Attachment, But Attachment Is Nil
Boot Up Rails App, Make Request to App from Outside Local Network
Finding Parenthesis via Regular Expression
Enter & Ioerror: Byte Oriented Read for Character Buffered Io
Bind and Destructure Block Arguments
Assets Precompiling Error with Jquery UI Plugin
Vps Apache Config - Invalid Command 'Passengerdefaultruby' After Adding Latest Passenger Gem
Deleting a Line in a Text File
How to Get Order Username and Provisiondate for All Softlayer MAChines Using Ruby