Delete Non-Utf Characters from a String in Ruby

How do I remove non UTF-8 characters from a String?

We have a few problems.

The biggest is that a Ruby String stores arbitrary bytes along with a supposed encoding, with no guarantee that the bytes are valid in that encoding and with no obvious reason for that encoding to have been chosen. (I might be biased as a heavy user of Python 3. We would never speak of "changing a string from one encoding to another".)

Fortunately, the editor did not eat your post, but it's hard to see that. I'm guessing that you decoded the string as Windows-1252 in order to display it, which only obscures the issue.

Here's your string of bytes as I see it:

>> s = "\x06-~$A\xA7ruG\xF9\"\x9A\f\xB6/K".b
=> "\x06-~$A\xA7ruG\xF9\"\x9A\f\xB6/K"
>> s.bytes
=> [6, 45, 126, 36, 65, 167, 114, 117, 71, 249, 34, 154, 12, 182, 47, 75]

And it does contain bytes that are not valid UTF-8.

>> s.encoding
=> #<Encoding:ASCII-8BIT>
>> String::new(s).force_encoding(Encoding::UTF_8).valid_encoding?
=> false

We can ask to decode this as UTF-8 and insert � where we encounter bytes that are not valid UTF-8:

>> s.encode('utf-8', 'binary', :undef => :replace)
=> "\u0006-~$A�ruG�\"�\f�/K"

Replacing non standard characters in Ruby

If you want to remove non-ASCII chars, then

strings.map{| s | s.encode('ASCII', 'binary', invalid: :replace, undef: :replace, replace: '')}

Ruby: Remove invisible characters after converting string to UTF-8

Without seeing your code, it's hard to know exactly what's going on for you. I'll point out, however, that String#force_encoding doesn't transcode the String; it's a way of saying, "No, really, this is UTF-8", for example. To transcode from one encoding to another, use String#encode.

This seems to work for me:

require 'net/http'
s = Net::HTTP.get('www.eximsystems.com', '/LaVerdad/Antiguo/Gn/Genesis.htm')
s.force_encoding('windows-1252')
s.encode!('utf-8')

In general, /[[:space:]]/ should capture more kinds of whitespace that /\s/ (which is equivalent to /[ \t\r\n\f]/), but it doesn't appear to be necessary in this case. I can't find any abnormal whitespace in s at this point. If you're still having problems, you'll need to post your code and a more precise description of the issue.

Update: Thanks for updating your question with your code and an example of the problem. It looks like the issue is non-breaking spaces. I think it's simplest to get rid of them at the source:

require 'nokogiri'
require 'open-uri'

URL = 'http://www.eximsystems.com/LaVerdad/Antiguo/Gn/Genesis.htm'
s = open(URL).read # Separate these three lines to convert  
s.gsub!(' ', ' ') # to normal ' ' in source rather than after
html = Nokogiri.HTML(s) # conversion to unicode non-breaking space

# Extract Paragraphs
text = ''
html.css('p').each do |p|
text += p.text
end

# Clean Up Text
text.gsub!(/\s+/, ' ')

puts text

There's now just a single, normal space between the period at the end of 15 and the number 16:

15) Besó también José a todos sus hermanos, orando sobre cada uno de ellos; después de cuyas demostraciones cobraron aliento para conversar con él. 16 Al punto corrió la voz, y se divulgó generalmente esta noticia en el palacio del rey: Han venido los hermanos de José; y holgóse de ello Faraón y toda su corte.

How can I remove non-printable invisible characters from string?

First, let's figure out what the offending character is:

str = "Kanha‬"
p str.codepoints
# => [75, 97, 110, 104, 97, 8236]

The first five codepoints are between 0 and 127, meaning they're ASCII characters. It's safe to assume they're the letters K-a-n-h-a, although this is easy to verify if you want:

p [75, 97, 110, 104, 97].map(&:ord)
# => ["K", "a", "n", "h", "a"]

That means the offending character is the last one, codepoint 8236. That's a decimal (base 10) number, though, and Unicode characters are usually listed by their hexadecimal (base 16) number. 8236 in hexadecimal is 202C (8236.to_s(16) # => "202c"), so we just have to google for U+202C.

Google very quickly tells us that the offending character is U+202C POP DIRECTIONAL FORMATTING and that it's a member of the "Other, Format" category of Unicode characters. Wikipedia says of this category:

Includes the soft hyphen, joining control characters (zwnj and zwj), control characters to support bi-directional text, and language tag characters

It also tells us that the "value" or code for the category is "Cf". If these sound like characters you want to remove from your string along with U+202C, you can use the \p{Cf} property in a Ruby regular expression. You can also use \P{Print} (note the capital P) as an equivalent to [^[:print]]:

str = "Kanha‬"
p str.length # => 6

p str.gsub(/\P{Print}|\p{Cf}/, '') # => "Kahna"
p str.gsub(/\P{Print}|\p{Cf}/, '').length # => 5

See it on repl.it: https://repl.it/@jrunning/DutifulRashTag

How to remove non-printable/invisible characters in ruby?

try this:

>>"aaa\f\d\x00abcd".gsub(/[^[:print:]]/,'.')
=>"aaa.d.abcd"

Remove weird invalid character in ruby

If it were byte sequences actually invalid for the encoding (UTF-8), then in ruby 2.1+, you could use the String#scrub method. It will by default replace invalid chars with the "unicode replacement character" (usually represneted as a question mark in a box), but you can also use it to remove them entirely.

However, as you note, your 'weird byte' is actually valid UTF-8 represneting the unicode codepoint "\u000F", the SHIFT IN control character. (Good job figuring out the actual bytes/character involved, that's the hard part!)

So we have to be clear about what we mean by "characters like that", if we want to remove them. Characters like what?

Nokogiri is complaining that it's invalid in an XML "PCDATA" (Parsed Character Data) area. Why would it be legal unicode/UTF-8, but invalid in XML PCDATA? What is legal in XML character data? I tried to figure it out, but it gets confusing, with the spec apparently saying that some characters are 'discouraged' (what?), and making what are to my eyes contradictory statements about other things.

I'm not sure exactly what characters Nokogiri will disallow from PCData, we'd have to look at the Nokogiri source (or more likely the libxml source), or try to ask a question of someone who knows more about nokogiri/libxml's source.

However, "\u000F" is a "control character", it's unlikely you want control characters in your XML character data (unless you know you do), and the XML spec seems to discourage control characters (and apparently Nokogiri/libxml actually disallows them?). So one way to interpret "characters like this" is "control characters".

You can remove all control characters from a string with this regex, for example:

"Some string \u000F more".gsub(/[\u0001-\u001A]/ , '') # remove control chars, unicode codepoints from 0001 to 001A
# => "Some string more"

If we interpret "characters like this" as any character that doesn't print -- a wider category than "control characters", and will include some that nokogiri has no problem with at all. We can try to remove a bit more than just control characters by using ruby's support for unicode character classes in regexes:

some_string.gsub(/[^[:print:]]/ , '')

[:print] is documented rather vaguely as "excludes control characters, and similar", so that's kind of a match for our vague spec of what we want to do. :)

So it really depends on what we mean by "characters like this". Really, "characters like this" for your case probably means "any char that Nokogiri/libxml will refuse to allow", and I'm afraid I haven't actually answered that question, because I'm not sure and was not able to easily figure it out. But for many cases, removing control chars, or even better removing chars that don't match [:print] will probably do just fine, unless you have a reason to want control chars and similar to remain (if you knew you needed them as record separators, for instance).

If instead of removing, you wanted to replace them with the unicode replacement char, which is commonly used to stand in for "byte sequence we couldn't handle":

"Shift in: \u000F".gsub(/[^[:print:]]/, "\uFFFD")
# => "Shift in: �"

If instead of removing them you want to escape them in some way they can be reconstructed after XML parsing.... ask again with that and I'll figure it out, but I haven't yet now. :)

Welcome to dealing with character encoding issues, it sure does get confusing sometimes.



Related Topics



Leave a reply



Submit