How to remove non-printable/invisible characters in ruby?
try this:
>>"aaa\f\d\x00abcd".gsub(/[^[:print:]]/,'.')
=>"aaa.d.abcd"
How can I remove non-printable invisible characters from string?
First, let's figure out what the offending character is:
str = "Kanha"
p str.codepoints
# => [75, 97, 110, 104, 97, 8236]
The first five codepoints are between 0 and 127, meaning they're ASCII characters. It's safe to assume they're the letters K-a-n-h-a, although this is easy to verify if you want:
p [75, 97, 110, 104, 97].map(&:ord)
# => ["K", "a", "n", "h", "a"]
That means the offending character is the last one, codepoint 8236. That's a decimal (base 10) number, though, and Unicode characters are usually listed by their hexadecimal (base 16) number. 8236 in hexadecimal is 202C (8236.to_s(16) # => "202c"
), so we just have to google for U+202C.
Google very quickly tells us that the offending character is U+202C POP DIRECTIONAL FORMATTING and that it's a member of the "Other, Format" category of Unicode characters. Wikipedia says of this category:
Includes the soft hyphen, joining control characters (zwnj and zwj), control characters to support bi-directional text, and language tag characters
It also tells us that the "value" or code for the category is "Cf". If these sound like characters you want to remove from your string along with U+202C, you can use the \p{Cf}
property in a Ruby regular expression. You can also use \P{Print}
(note the capital P
) as an equivalent to [^[:print]]
:
str = "Kanha"
p str.length # => 6
p str.gsub(/\P{Print}|\p{Cf}/, '') # => "Kahna"
p str.gsub(/\P{Print}|\p{Cf}/, '').length # => 5
See it on repl.it: https://repl.it/@jrunning/DutifulRashTag
Alternative to Ruby String.dump that doesn't escape printable characters like double quotes (Ruby 1.8.7)
You can test for non-printable chars (or gsub them) with this regexp:
/[^[:print:]]/
How can I clean source code files of invisible characters?
You don't get the character in the editor, because you can't find it in text editors. #FEFF or #FFFE are so-called byte-order marks. They are a Microsoft invention to tell in a Unicode file, in which order multi-byte characters are stored.
To get rid of it, tell your editor to save the file either as ANSI/ISO-8859 or as Unicode without BOM. If your editor can't do so, you'll either have to switch editors (sadly) or use some kind of truncation tool like, e.g., a hex editor that allows you to see how the file really looks.
On googling, it seems, that TextWrangler has a "UTF-8, no BOM" mode. Otherwise, if you're comfortable with the terminal, you can use Vim:
:set nobomb
and save the file. Presto!
The characters are always the very first in a text file. Editors with support for the BOM will not, as I mentioned, show it to you at all.
Best practice in filtering non-printable characters
Apparently there are 2 sets of UTF-8 non-printable control characters based on this resource:
http://www.utf8-chartable.de/
With that in mind the array in the function would look like that:
array(
'/\x00/', '/\x01/', '/\x02/', '/\x03/', '/\x04/',
'/\x05/', '/\x06/', '/\x07/', '/\x08/', '/\x09/',
'/\x0A/', '/\x0B/', '/\x0C/', '/\x0D/', '/\x0E/', '/\x0F/', '/\x10/',
'/\x11/', '/\x12/', '/\x13/', '/\x14/', '/\x15/', '/\x16/', '/\x17/',
'/\x18/', '/\x19/', '/\x1A/', '/\x1B/', '/\x1C/', '/\x1D/', '/\x1E/',
'/\x1F/', '/\x7F/', '/\xC2 \x80/', '/\xC2 \x81/', '/\xC2 \x82/',
'/\xC2 \x83/', '/\xC2 \x84/', '/\xC2 \x85/', '/\xC2 \x86/', '/\xC2 \x87/',
'/\xC2 \x88/', '/\xC2 \x89/', '/\xC2 \x8A/', '/\xC2 \x8B/', '/\xC2 \x8C/',
'/\xC2 \x8D/', '/\xC2 \x8E/', '/\xC2 \x8F/', '/\xC2 \x90/', '/\xC2 \x91/',
'/\xC2 \x92/', '/\xC2 \x93/', '/\xC2 \x94/', '/\xC2 \x95/', '/\xC2 \x96/',
'/\xC2 \x97/', '/\xC2 \x98/', '/\xC2 \x99/', '/\xC2 \x9A/', '/\xC2 \x9B/',
'/\xC2 \x9C/', '/\xC2 \x9D/', '/\xC2 \x9E/', '/\xC2 \x8F/'
);
Related Topics
How to Reference a Constant in a Yaml with Rails
Errno::Eaccess: Permission Denied @ Dir_S_Mkdir
Execjs and Could Not Find a JavaScript Runtime
Rails Routes: Wrong Singular for Resources
Ruby What Class Gets a Method When There Is No Explicit Receiver
Capybara Synchronize with Has_No_Css
Override "Show" Resource Route in Rails
How to Make a Class Whose Constructor Looks Like the Constructor of a Built-In Class
How to Read Files in an Eventmachine-Based App
Ruby String Split with Terminal Strings Empty
Add a CSS Class to <%= F.Submit %>
Regular Expression "Empty Range in Char Class Error"