Removing hidden characters from within strings
You can remove all control characters from your input string with something like this:
string input; // this is your input string
string output = new string(input.Where(c => !char.IsControl(c)).ToArray());
Here is the documentation for the IsControl()
method.
Or if you want to keep letters and digits only, you can also use the IsLetter
and IsDigit
function:
string output = new string(input.Where(c => char.IsLetter(c) || char.IsDigit(c)).ToArray());
How to remove non-printable characters
Foreword: I released this utility in my github.com/icza/gox
library, see stringsx.Clean()
.
You could remove runes where unicode.IsGraphic()
or unicode.IsPrint()
reports false. To remove certain rune
s from a string, you may use strings.Map()
.
For example:
invisibleChars := "Douglas"
fmt.Printf("%q\n", invisibleChars)
fmt.Println(len(invisibleChars))
clean := strings.Map(func(r rune) rune {
if unicode.IsGraphic(r) {
return r
}
return -1
}, invisibleChars)
fmt.Printf("%q\n", clean)
fmt.Println(len(clean))
clean = strings.Map(func(r rune) rune {
if unicode.IsPrint(r) {
return r
}
return -1
}, invisibleChars)
fmt.Printf("%q\n", clean)
fmt.Println(len(clean))
This outputs (try it on the Go Playground):
"Douglas\u200b"
10
"Douglas"
7
"Douglas"
7
Can't remove hidden characters from string
The HH
in the format string refers to the 24-hour clock hours, which doesn't work when using AM/PM in the format string for PM times.
Change HH
to hh
.
How to remove hidden characters in R from string imported from Excel?
There may be some non-ascii characters in your data. If you're happy to remove them, you can use textclean, like so (this example uses the first 4 values of your data):
vec <- c("Check Outside", "Check Plot", "Check Plot ",
"Check Plot (between treatments)")
unique(vec)
# [1] "Check Outside" "Check Plot"
# [3] "Check Plot " "Check Plot (between treatments)"
library(textclean)
vec2 <- replace_non_ascii(vec)
unique(vec2)
# [1] "Check Outside" "Check Plot" "Check Plot (between treatments)"
So tl;dr this should do what you’re after
library(textclean)
moths <- moths %>%
mutate(Details = replace_non_ascii(str_trim(Details)))
delete weird hidden characters from string in Python
Both are UTF-8, but there are different ways of rendering the same visual character. The first string contains U+00E4 — LATIN SMALL LETTER A WITH DIAERESIS. Your second string contains “a” followed by U+0308 — COMBINING DIAERESIS ( ̈ ), which, in combination, is rendered as “ä”.
You can inspect the strings yourself using unicodedata:
import unicodedata
for c in string:
print(unicodedata.name(c))
Both of the above are valid ways of representing “ä”, and they count as equivalent under a suitable Unicode normalisation. You can use unicodedata.normalize
to normalise different representations. For instance, you could transform both strings into normal form C (though the first one already happens to be in NFC):
a = 'kommunikationsfähigkeit'
b = 'kommunikationsfähigkeit'
print(f'len(a) = {len(a)}')
# len(a) = 23
print(f'len(b) = {len(b)}')
# len(b) = 24
print(f'a == b: {a == b}')
# a == b: False
norm_a = unicodedata.normalize('NFC', a)
norm_b = unicodedata.normalize('NFC', b)
print(f'len(norm_a) = {len(norm_a)}')
# len(norm_a) = 23
print(f'len(norm_b) = {len(norm_b)}')
# len(norm_b) = 23
print(f'norm_a == norm_b: {norm_a == norm_b}')
# norm_a == norm_b: True
Removing invisible characters from the end of a Java String
You can use the trim()
method of the String
class for removing trailing (and leading) white space and line breaks:
String trimmed = original.trim();
How can I remove non-printable invisible characters from string?
First, let's figure out what the offending character is:
str = "Kanha"
p str.codepoints
# => [75, 97, 110, 104, 97, 8236]
The first five codepoints are between 0 and 127, meaning they're ASCII characters. It's safe to assume they're the letters K-a-n-h-a, although this is easy to verify if you want:
p [75, 97, 110, 104, 97].map(&:ord)
# => ["K", "a", "n", "h", "a"]
That means the offending character is the last one, codepoint 8236. That's a decimal (base 10) number, though, and Unicode characters are usually listed by their hexadecimal (base 16) number. 8236 in hexadecimal is 202C (8236.to_s(16) # => "202c"
), so we just have to google for U+202C.
Google very quickly tells us that the offending character is U+202C POP DIRECTIONAL FORMATTING and that it's a member of the "Other, Format" category of Unicode characters. Wikipedia says of this category:
Includes the soft hyphen, joining control characters (zwnj and zwj), control characters to support bi-directional text, and language tag characters
It also tells us that the "value" or code for the category is "Cf". If these sound like characters you want to remove from your string along with U+202C, you can use the \p{Cf}
property in a Ruby regular expression. You can also use \P{Print}
(note the capital P
) as an equivalent to [^[:print]]
:
str = "Kanha"
p str.length # => 6
p str.gsub(/\P{Print}|\p{Cf}/, '') # => "Kahna"
p str.gsub(/\P{Print}|\p{Cf}/, '').length # => 5
See it on repl.it: https://repl.it/@jrunning/DutifulRashTag
can't remove hidden characters in text?
If you try to encode the two strings as utf8:
str_1 = u"tác toàn diện giữa Việt Nam và Ukraine ."
str_2 = u"tác toàn diện giữa Việt Nam và Ukraine ."
print(str_1.encode('utf8'))
>> b'ta\xcc\x81c toa\xcc\x80n di\xc3\xaa\xcc\xa3n gi\xc6\xb0\xcc\x83a Vi\xc3\xaa\xcc\xa3t Nam va\xcc\x80 Ukraine .'
print(str_2.encode('utf8'))
>> b't\xc3\xa1c to\xc3\xa0n di\xe1\xbb\x87n gi\xe1\xbb\xafa Vi\xe1\xbb\x87t Nam v\xc3\xa0 Ukraine .'
You can see that it in fact the two strings are different. If you look closely, the different between "diện" in str_1 and str_2, is that in str_1 there is a small dot under the n and in str_2 there is a small dot under the e
Related Topics
Ghost-Borders ('Ringing') When Resizing in Gdi+
Using Extension Methods in .Net 2.0
Azure Key Vault: Access Denied
Why Is 16 Byte the Recommended Size for Struct in C#
The Type Initializer for 'Emgu.Cv.Cvinvoke' Threw an Exception
Use a Custom Thousand Separator in C#
Child Actions Are Not Allowed to Perform Redirect Actions, After Setting the Site on Https
Unable to Load Cvextern in a C# Project
Visual Studio 2005 Designer Moves Controls and Resizes Form
How to Format 07/03/2012 to March 7Th,2012 in C#
How to Upload Files Using Ajax to ASP.NET MVC Controller Action
Comparing 2 Objects and Retrieve a List of Fields with Different Values
Update Requires a Valid Updatecommand When Passed Datarow Collection with Modified Rows
How to Implement the Sieve of Eratosthenes Using Multithreaded C#