Removing Hidden Characters from Within Strings

Removing hidden characters from within strings

You can remove all control characters from your input string with something like this:

string input; // this is your input string
string output = new string(input.Where(c => !char.IsControl(c)).ToArray());

Here is the documentation for the IsControl() method.

Or if you want to keep letters and digits only, you can also use the IsLetter and IsDigit function:

string output = new string(input.Where(c => char.IsLetter(c) || char.IsDigit(c)).ToArray());

How to remove non-printable characters

Foreword: I released this utility in my github.com/icza/gox library, see stringsx.Clean().

You could remove runes where unicode.IsGraphic() or unicode.IsPrint() reports false. To remove certain runes from a string, you may use strings.Map().

For example:

invisibleChars := "Douglas"
fmt.Printf("%q\n", invisibleChars)
fmt.Println(len(invisibleChars))

clean := strings.Map(func(r rune) rune {
    if unicode.IsGraphic(r) {
        return r
    }
    return -1
}, invisibleChars)

fmt.Printf("%q\n", clean)
fmt.Println(len(clean))

clean = strings.Map(func(r rune) rune {
    if unicode.IsPrint(r) {
        return r
    }
    return -1
}, invisibleChars)

fmt.Printf("%q\n", clean)
fmt.Println(len(clean))

This outputs (try it on the Go Playground):

"Douglas\u200b"
10
"Douglas"
7
"Douglas"
7

Can't remove hidden characters from string

The HH in the format string refers to the 24-hour clock hours, which doesn't work when using AM/PM in the format string for PM times.

Change HH to hh.

How to remove hidden characters in R from string imported from Excel?

There may be some non-ascii characters in your data. If you're happy to remove them, you can use textclean, like so (this example uses the first 4 values of your data):

vec <- c("Check Outside", "Check Plot", "Check Plot ", 
         "Check Plot  (between treatments)")
unique(vec) 
# [1] "Check Outside"        "Check Plot"  
# [3] "Check Plot "          "Check Plot  (between treatments)"

library(textclean)
vec2 <- replace_non_ascii(vec)
unique(vec2)
# [1] "Check Outside"    "Check Plot"  "Check Plot (between treatments)"

So tl;dr this should do what you’re after


library(textclean)

moths <- moths %>%
  mutate(Details = replace_non_ascii(str_trim(Details)))

delete weird hidden characters from string in Python

Both are UTF-8, but there are different ways of rendering the same visual character. The first string contains U+00E4 — LATIN SMALL LETTER A WITH DIAERESIS. Your second string contains “a” followed by U+0308 — COMBINING DIAERESIS ( ̈ ), which, in combination, is rendered as “ä”.

You can inspect the strings yourself using unicodedata:

import unicodedata

for c in string:
    print(unicodedata.name(c))

Both of the above are valid ways of representing “ä”, and they count as equivalent under a suitable Unicode normalisation. You can use unicodedata.normalize to normalise different representations. For instance, you could transform both strings into normal form C (though the first one already happens to be in NFC):

a = 'kommunikationsfähigkeit'
b = 'kommunikationsfähigkeit'
print(f'len(a) = {len(a)}')
# len(a) = 23
print(f'len(b) = {len(b)}')
# len(b) = 24
print(f'a == b: {a == b}')
# a == b: False

norm_a = unicodedata.normalize('NFC', a)
norm_b = unicodedata.normalize('NFC', b)
print(f'len(norm_a) = {len(norm_a)}')
# len(norm_a) = 23
print(f'len(norm_b) = {len(norm_b)}')
# len(norm_b) = 23
print(f'norm_a == norm_b: {norm_a == norm_b}')
# norm_a == norm_b: True

Removing invisible characters from the end of a Java String

You can use the trim() method of the String class for removing trailing (and leading) white space and line breaks:

String trimmed = original.trim();

How can I remove non-printable invisible characters from string?

First, let's figure out what the offending character is:

str = "Kanha‬"
p str.codepoints
# => [75, 97, 110, 104, 97, 8236]

The first five codepoints are between 0 and 127, meaning they're ASCII characters. It's safe to assume they're the letters K-a-n-h-a, although this is easy to verify if you want:

p [75, 97, 110, 104, 97].map(&:ord)
# => ["K", "a", "n", "h", "a"]

That means the offending character is the last one, codepoint 8236. That's a decimal (base 10) number, though, and Unicode characters are usually listed by their hexadecimal (base 16) number. 8236 in hexadecimal is 202C (8236.to_s(16) # => "202c"), so we just have to google for U+202C.

Google very quickly tells us that the offending character is U+202C POP DIRECTIONAL FORMATTING and that it's a member of the "Other, Format" category of Unicode characters. Wikipedia says of this category:

Includes the soft hyphen, joining control characters (zwnj and zwj), control characters to support bi-directional text, and language tag characters

It also tells us that the "value" or code for the category is "Cf". If these sound like characters you want to remove from your string along with U+202C, you can use the \p{Cf} property in a Ruby regular expression. You can also use \P{Print} (note the capital P) as an equivalent to [^[:print]]:

str = "Kanha‬"
p str.length # => 6

p str.gsub(/\P{Print}|\p{Cf}/, '') # => "Kahna"
p str.gsub(/\P{Print}|\p{Cf}/, '').length # => 5

See it on repl.it: https://repl.it/@jrunning/DutifulRashTag

can't remove hidden characters in text?

If you try to encode the two strings as utf8:

str_1 = u"tác toàn diện giữa Việt Nam và Ukraine ."
str_2 = u"tác toàn diện giữa Việt Nam và Ukraine ."
print(str_1.encode('utf8'))
>> b'ta\xcc\x81c toa\xcc\x80n di\xc3\xaa\xcc\xa3n gi\xc6\xb0\xcc\x83a Vi\xc3\xaa\xcc\xa3t Nam va\xcc\x80 Ukraine .'
print(str_2.encode('utf8'))
>> b't\xc3\xa1c to\xc3\xa0n di\xe1\xbb\x87n gi\xe1\xbb\xafa Vi\xe1\xbb\x87t Nam v\xc3\xa0 Ukraine .'

You can see that it in fact the two strings are different. If you look closely, the different between "diện" in str_1 and str_2, is that in str_1 there is a small dot under the n and in str_2 there is a small dot under the e

Removing Hidden Characters from Within Strings