find and delete files with non-ascii names
Non-ASCII characters
ASCII character codes range from 0x00
to 0x7F
in hex. Therefore, any character with a code greater than 0x7F
is a non-ASCII character. This includes the bulk of the characters in UTF-8 (ASCII codes are essentially a subset of UTF-8). For example, the Japanese character
あ
is encoded in hex in UTF-8 as
E3 81 82
UTF-8 has been the default character encoding on, among others, Red Hat Linux since version 8.0 (2002), SuSE Linux since version 9.1 (2004), and Ubuntu Linux since version 5.04 (2005).
ASCII control characters
Out of the ASCII codes, 0x00
through 0x1F
and 0x7F
represent control characters such as ESC
(0x1B
). These control characters were not originally intended to be printable even though some of them, like the line feed character 0x0A
, can be interpreted and displayed.
On my system, ls
displays all control characters as ?
by default, unless I pass the --show-control-chars
option. I'm guessing that the files you want to delete contain ASCII control characters, as opposed to non-ASCII characters. This is an important distinction: if you delete filenames containing non-ASCII characters, you may blow away legitimate files that just happen to be named in another language.
Regular expressions for character codes
POSIX
POSIX provides a very handy collection of character classes for dealing with these types of characters (thanks to bashophil for pointing this out):
[:cntrl:] Control characters
[:graph:] Graphic printable characters (same as [:print:] minus the space character)
[:print:] Printable characters (same as [:graph:] plus the space character)
PCRE
Perl Compatible Regular Expressions allow hexadecimal character codes using the syntax
\x00
For example, a PCRE regex for the Japanese character あ
would be
\xE3\x81\x82
In addition to the POSIX character classes listed above, PCRE also provides the [:ascii:]
character class, which is a convenient shorthand for [\x00-\x7F]
.
GNU's version of grep
supports PCRE using the -P
flag, but BSD grep
(on Mac OS X, for example) does not. Neither GNU nor BSD find
supports PCRE regexes.
Finding the files
GNU find
supports POSIX regexes (thanks to iscfrc for pointing out the pure find
solution to avoid spawning additional processes). The following command will list all filenames (but not directory names) below the current directory that contain non-printable control characters:
find -type f -regextype posix-basic -regex '^.*/[^/]*[[:cntrl:]][^/]*$'
The regex is a little complicated because the -regex
option has to match the entire file path, not just the filename, and because I'm assuming that we don't want to blow away files with normal names simply because they are inside directories with names containing control characters.
To delete the matching files, simply pass the -delete
option to find
, after all other options (this is critical; passing -delete
as the first option will blow away everything in your current directory):
find -type f -regextype posix-basic -regex '^.*/[^/]*[[:cntrl:]][^/]*$' -delete
I highly recommend running the command without the -delete
first, so you can see what will be deleted before it's too late.
If you also pass the -print
option, you can see what is being deleted as the command runs:
find -type f -regextype posix-basic -regex '^.*/[^/]*[[:cntrl:]][^/]*$' -print -delete
To blow away any paths (files or directories) that contain control characters, the regex can be simplified and you can drop the -type
option:
find -regextype posix-basic -regex '.*[[:cntrl:]].*' -print -delete
With this command, if a directory name contains control characters, even if none of the filenames inside the directory do, they will all be deleted.
Update: Finding both non-ASCII and control characters
It looks like your files contain both non-ASCII characters and ASCII control characters. As it turns out, [:ascii:]
is not a POSIX character class, but it is provided by PCRE. I couldn't find a POSIX regex to do this, so it's Perl to the rescue. We'll still use find
to traverse our directory tree, but we'll pass the results to Perl for processing.
To make sure we can handle filenames containing newlines (which seems likely in this case), we need to use the -print0
argument to find
(supported on both GNU and BSD versions); this separates records with a null character (0x00
) instead of a newline, since the null character is the only character that can't be in a valid filename on Linux. We need to pass the corresponding flag -0
to our Perl code so it knows how records are separated. The following command will print every path inside the current directory, recursively:
find . -print0 | perl -n0e 'print $_, "\n"'
Note that this command only spawns a single instance of the Perl interpreter, which is good for performance. The starting path argument (in this case, .
for CWD
) is optional in GNU find
but is required in BSD find
on Mac OS X, so I've included it for the sake of portability.
Now for our regex. Here is a PCRE regex matching names that contain either non-ASCII or non-printable (i.e. control) characters (or both):
[[:^ascii:][:cntrl:]]
The following command will print all paths (directories or files) in the current directory that match this regex:
find . -print0 | perl -n0e 'chomp; print $_, "\n" if /[[:^ascii:][:cntrl:]]/'
The chomp
is necessary because it strips off the trailing null character from each path, which would otherwise match our regex. To delete the matching files and directories, we can use the following:
find . -print0 | perl -MFile::Path=remove_tree -n0e 'chomp; remove_tree($_, {verbose=>1}) if /[[:^ascii:][:cntrl:]]/'
This will also print out what is being deleted as the command runs (although control characters are interpreted so the output will not quite match the output of ls
).
find files with non-ascii chars in file name
This seems to work for me in both default and posix-extended mode:
LC_COLLATE=C find . -regex '.*[^ -~].*'
There could be locale-related issues, though, and I don't have a large corpus of non-ascii filenames to test it on, but it catches the ones I have.
Delete space and replace non-ASCII characters in filenames via a loop with a makefile
You said you wanted to change all non-ASCII characters to -
. However based on your attempt, it seems you only want to transform to -
those characters which are not digits or "plain" letters (by plain I mean non accented, non fancy, ...).
cleanfigures:
for f in *; \
do \
ext="$${f##*.}" ; \
base="$${f%.*}" ; \
newbase="$${base//[^a-zA-Z0-9 ]/-}" ; \
echo "$$f" "$${newbase// /}.$$ext" ; \
done
Using find to locate files that have non-printable characters in their names
Yes, at least with GNU find
, you can search for a name that contains non-printable characters.
The set of non-printable characters depends on your locale. If you specify that you are working with the C locale, non-printable characters are those with an ASCII code < 32 or an ASCII code >= 127.
LC_ALL=C find -name '*[^[:print:]]*'
Here [^[:print:]]
represents any non-printable character.
Remove non-ASCII characters in a file
If you want to use Perl, do it like this:
perl -pi -e 's/[^[:ascii:]]//g' filename
Detailed Explanation
The following explanation covers every part of the above command assuming the reader is unfamiliar with anything in the solution...
perl
run the perl interpreter. Perl is a programming language that is typically available on all unix like systems. This command needs to be run at a shell prompt.
-p
The
-p
flag tells perl to iterate over every line in the input file, run the specified commands (described later) on each line, and then print the result. It is equivalent to wrapping your perl program inwhile(<>) { /* program... */; } continue { print; }
. There's a similar-n
flag that does the same but omits thecontinue { print; }
block, so you'd use that if you wanted to do your own printing.-i
The
-i
flag tells perl that the input file is to be edited in place and output should go back into that file. This is important to actually modify the file. Omitting this flag will write the output toSTDOUT
which you can then redirect to a new file.Note that you cannot omit
-i
and redirectSTDOUT
to the input file as this will clobber the input file before it has been read. This is just how the shell works and has nothing to do with perl. The-i
flag works around this intelligently.Perl and the shell allow you to combine multiple single character parameters into one which is why we can use
-pi
instead of-p -i
The
-i
flag takes a single argument, which is a file extension to use if you want to make a backup of the original file, so if you used-i.bak
, then perl would copy the input file tofilename.bak
before making changes. In this example I've omitted creating a backup because I expect you'll be using version control anyway :)-e
The
-e
flag tells perl that the next argument is a complete perl program encapsulated in a string. This is not always a good idea if you have a very long program as that can get unreadable, but with a single command program as we have here, its terseness can improve legibility.Note that we cannot combine the
-e
flag with the-i
flag as both of them take in a single argument, and perl would assume that the second flag is the argument, so, for example, if we used-ie <program> <filename>
, perl would assume<program>
and<filename>
are both input files and try to create<program>e
and<filename>e
assuming thate
is the extension you want to use for the backup. This will fail as<program>
is not really a file. The other way around (-ei
) would also not work as perl would try to executei
as a program, which would fail compilation.s/.../.../
This is perl's regex based substitution operator. It takes in four arguments. The first comes before the operator, and if not specified, uses the default of
$_
. The second and third are between the/
symbols. The fourth is after the final/
and isg
in this case.$_
In our code, the first argument is$_
which is the default loop variable in perl. As mentioned above, the-p
flag wraps our program inwhile(<>)
, which creates awhile
loop that reads one line at a time (<>
) from the input. It implicitly assigns this line to$_
, and all commands that take in a single argument will use this if not specified (eg: just callingprint;
will actually translate toprint $_;
). So, in our code, thes/.../.../
operator operates once on each line of the input file.[^[:ascii:]]
The second argument is the pattern to search for in the input string. This pattern is a regular expression, so anything enclosed within[]
is a bracket expression. This section is probably the most complex part of this example, so we will discuss it in detail at the end.<empty string>
The third argument is the replacement string, which in our case is the empty string since we want to remove all non-ascii characters.g
The fourth argument is a modifier flag for the substitution operator. Theg
flag specifies that the substitution should be global across all matches in the input. Without this flag, only the first instance will be replaced. Other possible flags arei
for case insensitive matches,s
andm
which are only relevant for multi-line strings (we have single line strings here),o
which specifies that the pattern should be precompiled (which could be useful here for long files), andx
which specifies that the pattern could include whitespace and comments to make it more readable (but we should not write our program on a single line if that is the case).
filename
This is the input file that contains non-ascii characters that we'd like to strip out.
[^[:ascii:]]
So now let's discuss [^[:ascii:]]
in more detail.
As mentioned above, []
in a regular expression specifies a bracket expression, which tells the regex engine to match a single character in the input that matches any one of the characters in the set of characters inside the expression. So, for example, [abc]
will match either an a
, or a b
or a c
, and it will match only a single character. Using ^
as the first character inverts the match, so [^abc]
will match any one character that is not an a
, b
, or c
.
But what about [:ascii:]
inside the bracket expression?
If you have a unix based system available, run man 7 re_format
at the command line to read the man page. If not, read the online version
[:ascii:]
is a character class that represents the entire set of ascii
characters, but this kind of a character class may only be used inside a bracket expression. The correct way to use this is [[:ascii:]]
and it may be negated as with the abc
case above or combined within a bracket expression with other characters, so, for example, [éç[:ascii:]]
will match all ascii characters and also é
and ç
which are not ascii, and [^éç[:ascii:]]
will match all characters that are not ascii and also not é
or ç
.
Trying to delete non-ASCII characters only
The suggested solutions may fail with specific version of sed, e.g. GNU sed 4.2.1.
Using tr
:
tr -cd '[:print:]' < yourfile.txt
This will remove any characters not in [\x20-\x7e]
.
If you want to keep e.g. line feeds, just add \n
:
tr -cd '[:print:]\n' < yourfile.txt
If you really want to keep all ASCII characters (even the control codes):
tr -cd '[:print:][:cntrl:]' < yourfile.txt
This will remove any characters not in [\x00-\x7f]
.
How to delete non-ASCII characters in a text file?
In Python you can specify the input encoding.
with open('trendx.log', 'r', encoding='utf-16le') as reader, \
open('trendx.txt', 'w') as writer:
for line in reader:
if "ROW" in line:
writer.write(line)
I have obviously copied over some stuff from your earlier questions. Kudos for finally identifying the actual problem.
Notice in particular how we avoid reading the entire file into memory and instead processing a line at a time.
How do I grep for all non-ASCII characters?
You can use the command:
grep --color='auto' -P -n "[\x80-\xFF]" file.xml
This will give you the line number, and will highlight non-ascii chars in red.
In some systems, depending on your settings, the above will not work, so you can grep by the inverse
grep --color='auto' -P -n "[^\x00-\x7F]" file.xml
Note also, that the important bit is the -P
flag which equates to --perl-regexp
: so it will interpret your pattern as a Perl regular expression. It also says that
this is highly experimental and grep -P may warn of unimplemented
features.
Remove non-ASCII characters from CSV
# -i (inplace)
sed -i 's/[\d128-\d255]//g' FILENAME
Related Topics
Using the Universal Chess Interface
Using Find with -Exec {}, How to Count the Total
Grep String Inside Double Quotes
How to Read the Last Line of a Text File into a Variable Using Bash
Shell Script to Copy and Prepend Folder Name to Files from Multiple Subdirectories
How to Redirect from Audio Output to Mic Input Using Pulseaudio
Cannot --Enable-Pcregrep-Libbz2 Because Bzlib.H Was Not Found
Sed Command:How to Replace If Exists Else Just Insert
How to Make a Bash Script Portable Between Linux and Freebsd
Linux Join Utility Complains About Input File Not Being Sorted
Source Code of Pthread Library
Systemd: "Environment" Directive to Set Path
How to Calculate the Total Size of Certain Files Only, Recursive, in Linux
What Unit Is Used to Display Redis CPU Usage
Retrieve Plain Text Script from Compiled Bash Script
Compile Swift Script with Static Swift Core Library