How to Remove All Non Printable Characters in a String

How to remove all non printable characters in a string?

7 bit ASCII?

If your Tardis just landed in 1963, and you just want the 7 bit printable ASCII chars, you can rip out everything from 0-31 and 127-255 with this:

$string = preg_replace('/[\x00-\x1F\x7F-\xFF]/', '', $string);

It matches anything in range 0-31, 127-255 and removes it.

8 bit extended ASCII?

You fell into a Hot Tub Time Machine, and you're back in the eighties.
If you've got some form of 8 bit ASCII, then you might want to keep the chars in range 128-255. An easy adjustment - just look for 0-31 and 127

$string = preg_replace('/[\x00-\x1F\x7F]/', '', $string);

UTF-8?

Ah, welcome back to the 21st century. If you have a UTF-8 encoded string, then the /u modifier can be used on the regex

$string = preg_replace('/[\x00-\x1F\x7F]/u', '', $string);

This just removes 0-31 and 127. This works in ASCII and UTF-8 because both share the same control set range (as noted by mgutt below). Strictly speaking, this would work without the /u modifier. But it makes life easier if you want to remove other chars...

If you're dealing with Unicode, there are potentially many non-printing elements, but let's consider a simple one: NO-BREAK SPACE (U+00A0)

In a UTF-8 string, this would be encoded as 0xC2A0. You could look for and remove that specific sequence, but with the /u modifier in place, you can simply add \xA0 to the character class:

$string = preg_replace('/[\x00-\x1F\x7F\xA0]/u', '', $string);

Addendum: What about str_replace?

preg_replace is pretty efficient, but if you're doing this operation a lot, you could build an array of chars you want to remove, and use str_replace as noted by mgutt below, e.g.

//build an array we can re-use across several operations
$badchar=array(
// control characters
chr(0), chr(1), chr(2), chr(3), chr(4), chr(5), chr(6), chr(7), chr(8), chr(9), chr(10),
chr(11), chr(12), chr(13), chr(14), chr(15), chr(16), chr(17), chr(18), chr(19), chr(20),
chr(21), chr(22), chr(23), chr(24), chr(25), chr(26), chr(27), chr(28), chr(29), chr(30),
chr(31),
// non-printing characters
chr(127)
);

//replace the unwanted chars
$str2 = str_replace($badchar, '', $str);

Intuitively, this seems like it would be fast, but it's not always the case, you should definitely benchmark to see if it saves you anything. I did some benchmarks across a variety string lengths with random data, and this pattern emerged using php 7.0.12

     2 chars str_replace     5.3439ms preg_replace     2.9919ms preg_replace is 44.01% faster
4 chars str_replace 6.0701ms preg_replace 1.4119ms preg_replace is 76.74% faster
8 chars str_replace 5.8119ms preg_replace 2.0721ms preg_replace is 64.35% faster
16 chars str_replace 6.0401ms preg_replace 2.1980ms preg_replace is 63.61% faster
32 chars str_replace 6.0320ms preg_replace 2.6770ms preg_replace is 55.62% faster
64 chars str_replace 7.4198ms preg_replace 4.4160ms preg_replace is 40.48% faster
128 chars str_replace 12.7239ms preg_replace 7.5412ms preg_replace is 40.73% faster
256 chars str_replace 19.8820ms preg_replace 17.1330ms preg_replace is 13.83% faster
512 chars str_replace 34.3399ms preg_replace 34.0221ms preg_replace is 0.93% faster
1024 chars str_replace 57.1141ms preg_replace 67.0300ms str_replace is 14.79% faster
2048 chars str_replace 94.7111ms preg_replace 123.3189ms str_replace is 23.20% faster
4096 chars str_replace 227.7029ms preg_replace 258.3771ms str_replace is 11.87% faster
8192 chars str_replace 506.3410ms preg_replace 555.6269ms str_replace is 8.87% faster
16384 chars str_replace 1116.8811ms preg_replace 1098.0589ms preg_replace is 1.69% faster
32768 chars str_replace 2299.3128ms preg_replace 2222.8632ms preg_replace is 3.32% faster

The timings themselves are for 10000 iterations, but what's more interesting is the relative differences. Up to 512 chars, I was seeing preg_replace alway win. In the 1-8kb range, str_replace had a marginal edge.

I thought it was interesting result, so including it here. The important thing is not to take this result and use it to decide which method to use, but to benchmark against your own data and then decide.

How to remove non-printable characters

Foreword: I released this utility in my github.com/icza/gox library, see stringsx.Clean().


You could remove runes where unicode.IsGraphic() or unicode.IsPrint() reports false. To remove certain runes from a string, you may use strings.Map().

For example:

invisibleChars := "Douglas​"
fmt.Printf("%q\n", invisibleChars)
fmt.Println(len(invisibleChars))

clean := strings.Map(func(r rune) rune {
if unicode.IsGraphic(r) {
return r
}
return -1
}, invisibleChars)

fmt.Printf("%q\n", clean)
fmt.Println(len(clean))

clean = strings.Map(func(r rune) rune {
if unicode.IsPrint(r) {
return r
}
return -1
}, invisibleChars)

fmt.Printf("%q\n", clean)
fmt.Println(len(clean))

This outputs (try it on the Go Playground):

"Douglas\u200b"
10
"Douglas"
7
"Douglas"
7

Stripping non printable characters from a string in python

Iterating over strings is unfortunately rather slow in Python. Regular expressions are over an order of magnitude faster for this kind of thing. You just have to build the character class yourself. The unicodedata module is quite helpful for this, especially the unicodedata.category() function. See Unicode Character Database for descriptions of the categories.

import unicodedata, re, itertools, sys

all_chars = (chr(i) for i in range(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(chr, itertools.chain(range(0x00,0x20), range(0x7f,0xa0))))

control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
return control_char_re.sub('', s)

For Python2

import unicodedata, re, sys

all_chars = (unichr(i) for i in xrange(sys.maxunicode))
categories = {'Cc'}
control_chars = ''.join(c for c in all_chars if unicodedata.category(c) in categories)
# or equivalently and much more efficiently
control_chars = ''.join(map(unichr, range(0x00,0x20) + range(0x7f,0xa0)))

control_char_re = re.compile('[%s]' % re.escape(control_chars))

def remove_control_chars(s):
return control_char_re.sub('', s)

For some use-cases, additional categories (e.g. all from the control group might be preferable, although this might slow down the processing time and increase memory usage significantly. Number of characters per category:

  • Cc (control): 65
  • Cf (format): 161
  • Cs (surrogate): 2048
  • Co (private-use): 137468
  • Cn (unassigned): 836601

Edit Adding suggestions from the comments.

Remove non-printable character from a string in flutter/dart

If you wanted to only keep base ascii characters, you could try something like this:

  var c =
"Maintain central project files (hard copy and electronic) for administration.â¢Perform a wide variety of administrative duties";
var clean = c.replaceAll(RegExp(r'[^A-Za-z0-9().,;?]'), ' ');
print(clean);

and you get:

Maintain central project files (hard copy and electronic) for administration.  Perform a wide variety of administrative duties

Tweak the regex to include more or less characters, depending on how much cleanup you want (say you could remove all the punctuation marks, etc...)

Remove non printable character from a string in Java

Try using:

s.replaceAll("[^\\x00-\\xFF]", " ");

Your problem is, pound sign is a part of Latin-1 Supplement Unicode block, which is not included when you filter upto 7F.

How to remove all non printable characters in a string and keep some?

The range \x00-\x1F contains \x0A.

You have to split this range.

$string = preg_replace('/[\x00-\x09\x0B-\x1F\x7F\xA0]/u', '', $string);

How can I replace non-printable Unicode characters in Java?

my_string.replaceAll("\\p{C}", "?");

See more about Unicode regex. java.util.regexPattern/String.replaceAll supports them.

VBA: How to remove non-printable characters from data

This is the top google result when I search for a quick function to use, I've had a good old google but nothing that solves my issue fully has really come up.

The main issue is that all of these functions touch the original string even if there's no issue. Which slows stuff down.

I've rewritten it so that only amends if bad character, also expanded to all non-printable characters and characters beyond standard ascii.

Public Function Clean_NonPrintableCharacters(Str As String) As String

'Removes non-printable characters from a string

Dim cleanString As String
Dim i As Integer

cleanString = Str

For i = Len(cleanString) To 1 Step -1
'Debug.Print Asc(Mid(Str, i, 1))

Select Case Asc(Mid(Str, i, 1))
Case 1 To 31, Is >= 127
'Bad stuff
'https://www.ionos.com/digitalguide/server/know-how/ascii-codes-overview-of-all-characters-on-the-ascii-table/
cleanString = Left(cleanString, i - 1) & Mid(cleanString, i + 1)

Case Else
'Keep

End Select
Next i

Clean_NonPrintableCharacters = cleanString

End Function


Related Topics



Leave a reply



Submit