How to filter only printable characters in a file on Bash (linux) or Python?
The hexdump shows that the dot in .[16D
is actually an escape character, \x1b
.Esc[
nD
is an ANSI escape code to delete n
characters. So Esc[16D
tells the terminal to delete 16 characters, which explains the cat
output.
There are various ways to remove ANSI escape codes from a file, either using Bash commands (eg using sed
, as in Anubhava's answer) or Python.
However, in cases like this, it may be better to run the file through a terminal emulator to interpret any existing editing control sequences in the file, so you get the result the file's author intended after they applied those editing sequences.
One way to do that in Python is to use pyte, a Python module that implements a simple VTXXX compatible terminal emulator. You can easily install it using pip
, and here are its docs on readthedocs.
Here's a simple demo program that interprets the data given in the question. It's written for Python 2, but it's easy to adapt to Python 3. pyte
is Unicode-aware, and its standard Stream class expects Unicode strings, but this example uses a ByteStream, so I can pass it a plain byte string.
#!/usr/bin/env python
''' pyte VTxxx terminal emulator demo
Interpret a byte string containing text and ANSI / VTxxx control sequences
Code adapted from the demo script in the pyte tutorial at
http://pyte.readthedocs.org/en/latest/tutorial.html#tutorial
Posted to http://stackoverflow.com/a/30571342/4014959
Written by PM 2Ring 2015.06.02
'''
import pyte
#hex dump of data
#00000000 48 45 4c 4c 4f 20 54 48 49 53 20 49 53 20 54 48 |HELLO THIS IS TH|
#00000010 45 20 54 45 53 54 1b 5b 31 36 44 20 20 20 20 20 |E TEST.[16D |
#00000020 20 20 20 20 20 20 20 20 20 20 20 1b 5b 31 36 44 | .[16D|
#00000030 20 20 | |
data = 'HELLO THIS IS THE TEST\x1b[16D \x1b[16D '
#Create a default sized screen that tracks changed lines
screen = pyte.DiffScreen(80, 24)
screen.dirty.clear()
stream = pyte.ByteStream()
stream.attach(screen)
stream.feed(data)
#Get index of last line containing text
last = max(screen.dirty)
#Gather lines, stripping trailing whitespace
lines = [screen.display[i].rstrip() for i in range(last + 1)]
print '\n'.join(lines)
output
HELLO
hex dump of output
00000000 48 45 4c 4c 4f 0a |HELLO.|
Trying to remove non-printable characters (junk values) from a UNIX file
Perhaps you could go with the complement of [:print:]
, which contains all printable characters:
tr -cd '[:print:]' < file > newfile
If your version of tr
doesn't support multi-byte characters (it seems that many don't), this works for me with GNU sed (with UTF-8 locale settings):
sed 's/[^[:print:]]//g' file
How to filter all words, which contain N or more characters?
egrep -o '[^ ]{N,}' <filename>
Find all non-space constructs at least N
characters long. If you're concerned about "words" you might try [a-zA-Z]
.
How do I grep for all non-ASCII characters?
You can use the command:
grep --color='auto' -P -n "[\x80-\xFF]" file.xml
This will give you the line number, and will highlight non-ascii chars in red.
In some systems, depending on your settings, the above will not work, so you can grep by the inverse
grep --color='auto' -P -n "[^\x00-\x7F]" file.xml
Note also, that the important bit is the -P
flag which equates to --perl-regexp
: so it will interpret your pattern as a Perl regular expression. It also says that
this is highly experimental and grep -P may warn of unimplemented
features.
Removing non-displaying characters from a file
It looks like your file is encoded in UTF-16 rather than an 8-bit character set. The '^@' is a notation for ASCII NUL '\0', which usually spoils string matching.
One technique for loss-less handling of this would be to use a filter to convert UTF-16 to UTF-8, and then using grep
on the output - hypothetically, if the command was 'utf16-utf8', you'd write:
utf16-utf8 weirdo | grep Lunch
As an appallingly crude approximation to 'utf16-utf8', you could consider:
tr -d '\0' < weirdo | grep Lunch
This deletes ASCII NUL characters from the input file and lets grep
operate on the 'cleaned up' output. In theory, it might give you false positives; in practice, it probably won't.
Replace non-ASCII characters with a single space
Your ''.join()
expression is filtering, removing anything non-ASCII; you could use a conditional expression instead:
return ''.join([i if ord(i) < 128 else ' ' for i in text])
This handles characters one by one and would still use one space per character replaced.
Your regular expression should just replace consecutive non-ASCII characters with a space:
re.sub(r'[^\x00-\x7F]+',' ', text)
Note the +
there.
Remove non-ASCII characters from CSV
# -i (inplace)
sed -i 's/[\d128-\d255]//g' FILENAME
Remove non-ASCII characters in a file
If you want to use Perl, do it like this:
perl -pi -e 's/[^[:ascii:]]//g' filename
Detailed Explanation
The following explanation covers every part of the above command assuming the reader is unfamiliar with anything in the solution...
perl
run the perl interpreter. Perl is a programming language that is typically available on all unix like systems. This command needs to be run at a shell prompt.
-p
The
-p
flag tells perl to iterate over every line in the input file, run the specified commands (described later) on each line, and then print the result. It is equivalent to wrapping your perl program inwhile(<>) { /* program... */; } continue { print; }
. There's a similar-n
flag that does the same but omits thecontinue { print; }
block, so you'd use that if you wanted to do your own printing.-i
The
-i
flag tells perl that the input file is to be edited in place and output should go back into that file. This is important to actually modify the file. Omitting this flag will write the output toSTDOUT
which you can then redirect to a new file.Note that you cannot omit
-i
and redirectSTDOUT
to the input file as this will clobber the input file before it has been read. This is just how the shell works and has nothing to do with perl. The-i
flag works around this intelligently.Perl and the shell allow you to combine multiple single character parameters into one which is why we can use
-pi
instead of-p -i
The
-i
flag takes a single argument, which is a file extension to use if you want to make a backup of the original file, so if you used-i.bak
, then perl would copy the input file tofilename.bak
before making changes. In this example I've omitted creating a backup because I expect you'll be using version control anyway :)-e
The
-e
flag tells perl that the next argument is a complete perl program encapsulated in a string. This is not always a good idea if you have a very long program as that can get unreadable, but with a single command program as we have here, its terseness can improve legibility.Note that we cannot combine the
-e
flag with the-i
flag as both of them take in a single argument, and perl would assume that the second flag is the argument, so, for example, if we used-ie <program> <filename>
, perl would assume<program>
and<filename>
are both input files and try to create<program>e
and<filename>e
assuming thate
is the extension you want to use for the backup. This will fail as<program>
is not really a file. The other way around (-ei
) would also not work as perl would try to executei
as a program, which would fail compilation.s/.../.../
This is perl's regex based substitution operator. It takes in four arguments. The first comes before the operator, and if not specified, uses the default of
$_
. The second and third are between the/
symbols. The fourth is after the final/
and isg
in this case.$_
In our code, the first argument is$_
which is the default loop variable in perl. As mentioned above, the-p
flag wraps our program inwhile(<>)
, which creates awhile
loop that reads one line at a time (<>
) from the input. It implicitly assigns this line to$_
, and all commands that take in a single argument will use this if not specified (eg: just callingprint;
will actually translate toprint $_;
). So, in our code, thes/.../.../
operator operates once on each line of the input file.[^[:ascii:]]
The second argument is the pattern to search for in the input string. This pattern is a regular expression, so anything enclosed within[]
is a bracket expression. This section is probably the most complex part of this example, so we will discuss it in detail at the end.<empty string>
The third argument is the replacement string, which in our case is the empty string since we want to remove all non-ascii characters.g
The fourth argument is a modifier flag for the substitution operator. Theg
flag specifies that the substitution should be global across all matches in the input. Without this flag, only the first instance will be replaced. Other possible flags arei
for case insensitive matches,s
andm
which are only relevant for multi-line strings (we have single line strings here),o
which specifies that the pattern should be precompiled (which could be useful here for long files), andx
which specifies that the pattern could include whitespace and comments to make it more readable (but we should not write our program on a single line if that is the case).
filename
This is the input file that contains non-ascii characters that we'd like to strip out.
[^[:ascii:]]
So now let's discuss [^[:ascii:]]
in more detail.
As mentioned above, []
in a regular expression specifies a bracket expression, which tells the regex engine to match a single character in the input that matches any one of the characters in the set of characters inside the expression. So, for example, [abc]
will match either an a
, or a b
or a c
, and it will match only a single character. Using ^
as the first character inverts the match, so [^abc]
will match any one character that is not an a
, b
, or c
.
But what about [:ascii:]
inside the bracket expression?
If you have a unix based system available, run man 7 re_format
at the command line to read the man page. If not, read the online version
[:ascii:]
is a character class that represents the entire set of ascii
characters, but this kind of a character class may only be used inside a bracket expression. The correct way to use this is [[:ascii:]]
and it may be negated as with the abc
case above or combined within a bracket expression with other characters, so, for example, [éç[:ascii:]]
will match all ascii characters and also é
and ç
which are not ascii, and [^éç[:ascii:]]
will match all characters that are not ascii and also not é
or ç
.
Removing all special characters from a string in Bash
You can use tr
to print only the printable characters from a string like below. Just use the below command on your input file.
tr -cd "[:print:]\n" < file1
The flag -d
is meant to the delete the character sets defined in the arguments on the input stream, and -c
is for complementing those (invert what's provided). So without -c
the command would delete all printable characters from the input stream and using it complements it by removing the non-printable characters. We also keep the newline character \n
to preserve the line endings in the input file. Removing it would just produce the final output in one big line.
The [:print:]
is just a POSIX bracket expression which is a combination of expressions [:alnum:]
, [:punct:]
and space. The [:alnum:]
is same as [0-9A-Za-z]
and [:punct:]
includes characters !
"
#
$
%
&
'
(
)
*
+
,
-
.
/
:
;
<
=
>
?
@
[
\
]
^
_
`
{
|
}
~
Removing a small number of lines from a large file
You can use grep: "-v" keeps the opposite, -P uses perl regex syntax, and [\x80-\xFF] is the character range for non-ascii.
grep -vP "[\x80-\xFF]" data.tsv > data-ASCII-only.tsv
See this question How do I grep for all non-ASCII characters in UNIX for more about search for ascii characters with grep.
Related Topics
Problem Running Python from Crontab - "Invalid Python Installation"
Lxml Error "Ioerror: Error Reading File" When Parsing Facebook Mobile in a Python Scraper Script
Multi Platform Portable Python
Problems Adding Path and Calling External Program from Python
Running a Bash Script from Python
Python Multiprocessing + Subprocess Issues
Importing Orange Returns "Importerror: No Module Named Orange"
Executable Python Program with All Dependencies for Linux
Set Bash Variable from Python Script
Python 3.5 Asyncio and Multiple Websocket Servers
Checking Running Python Script Within the Python Script
Unicodedecodeerror Reading Binary Input
Run Linux Grep Command from Python Subprocess
I Have a Problem with Sending Mail:Typeerror: _Init_() Got an Unexpected Keyword Argument 'Context'