How to Remove All Non - Ascii Characters from a String in Ruby

Replacing non standard characters in Ruby

If you want to remove non-ASCII chars, then

strings.map{| s | s.encode('ASCII', 'binary', invalid: :replace, undef: :replace, replace: '')}

Regex to remove ascii characters from string

Well, just do the opposite of answers you found!

"aquot;.gsub(/\p{ASCII}/, '') #=> "quot;
"aquot;.delete("\u{0000}-\u{007F}") #=> #=> "quot;

Note that the question you linked was using \P which means negation of \p for String#gsub. And ^a-z which means delete all except characters from a to z for String#delete.

How to remove non-printable/invisible characters in ruby?

try this:

>>"aaa\f\d\x00abcd".gsub(/[^[:print:]]/,'.')
=>"aaa.d.abcd"

How can I remove non-printable invisible characters from string?

First, let's figure out what the offending character is:

str = "Kanha‬"
p str.codepoints
# => [75, 97, 110, 104, 97, 8236]

The first five codepoints are between 0 and 127, meaning they're ASCII characters. It's safe to assume they're the letters K-a-n-h-a, although this is easy to verify if you want:

p [75, 97, 110, 104, 97].map(&:ord)
# => ["K", "a", "n", "h", "a"]

That means the offending character is the last one, codepoint 8236. That's a decimal (base 10) number, though, and Unicode characters are usually listed by their hexadecimal (base 16) number. 8236 in hexadecimal is 202C (8236.to_s(16) # => "202c"), so we just have to google for U+202C.

Google very quickly tells us that the offending character is U+202C POP DIRECTIONAL FORMATTING and that it's a member of the "Other, Format" category of Unicode characters. Wikipedia says of this category:

Includes the soft hyphen, joining control characters (zwnj and zwj), control characters to support bi-directional text, and language tag characters

It also tells us that the "value" or code for the category is "Cf". If these sound like characters you want to remove from your string along with U+202C, you can use the \p{Cf} property in a Ruby regular expression. You can also use \P{Print} (note the capital P) as an equivalent to [^[:print]]:

str = "Kanha‬"
p str.length # => 6

p str.gsub(/\P{Print}|\p{Cf}/, '') # => "Kahna"
p str.gsub(/\P{Print}|\p{Cf}/, '').length # => 5

See it on repl.it: https://repl.it/@jrunning/DutifulRashTag

Deleting all special characters from a string - ruby

You can do this

a.gsub!(/[^0-9A-Za-z]/, '')

Whats the easiest way to replace all non ASCII characters with their ASCII equivalents in Ruby?

This is called transliteration. An approximation of this (see examples) can be performed using the Iconv class.

Try one of the following (require 'iconv' first):

Iconv.iconv('ascii//ignore//translit', 'utf-8', string).to_s
Iconv.iconv('ascii//translit', 'utf-8', string).to_s

irb(main):013:0> Iconv.iconv('ascii//translit', 'utf-8', 'spaß').to_s
=> "spass"
irb(main):014:0> Iconv.iconv('ascii//translit', 'utf-8', 'crêpes').to_s
=> "crepes"
irb(main):017:0> Iconv.iconv('ascii//translit', 'utf-8', 'über').to_s
=> "uber"

There's also an iconv command line utility. More information on that and some Ruby examples (search for 'ruby') here.

An alternative to this is Unidecode, which I guess was inspired by the original Perl implementation. I haven't used it in its Ruby incarnation, but it should do multi-char expansions (which apparently you want) better.

Finally, if you're running Rails, you may find this thread interesting. It details some differences between alternative approaches to transliteration, and shows a way to do this within the Rails core (ActiveSupport::Inflector.transliterate)



Related Topics



Leave a reply



Submit