Find and Delete Files with Non-Ascii Names

find and delete files with non-ascii names


Non-ASCII characters

ASCII character codes range from 0x00 to 0x7F in hex. Therefore, any character with a code greater than 0x7F is a non-ASCII character. This includes the bulk of the characters in UTF-8 (ASCII codes are essentially a subset of UTF-8). For example, the Japanese character

is encoded in hex in UTF-8 as

E3 81 82

UTF-8 has been the default character encoding on, among others, Red Hat Linux since version 8.0 (2002), SuSE Linux since version 9.1 (2004), and Ubuntu Linux since version 5.04 (2005).

ASCII control characters

Out of the ASCII codes, 0x00 through 0x1F and 0x7F represent control characters such as ESC (0x1B). These control characters were not originally intended to be printable even though some of them, like the line feed character 0x0A, can be interpreted and displayed.

On my system, ls displays all control characters as ? by default, unless I pass the --show-control-chars option. I'm guessing that the files you want to delete contain ASCII control characters, as opposed to non-ASCII characters. This is an important distinction: if you delete filenames containing non-ASCII characters, you may blow away legitimate files that just happen to be named in another language.

Regular expressions for character codes

POSIX

POSIX provides a very handy collection of character classes for dealing with these types of characters (thanks to bashophil for pointing this out):

[:cntrl:] Control characters
[:graph:] Graphic printable characters (same as [:print:] minus the space character)
[:print:] Printable characters (same as [:graph:] plus the space character)

PCRE

Perl Compatible Regular Expressions allow hexadecimal character codes using the syntax

\x00

For example, a PCRE regex for the Japanese character would be

\xE3\x81\x82

In addition to the POSIX character classes listed above, PCRE also provides the [:ascii:] character class, which is a convenient shorthand for [\x00-\x7F].

GNU's version of grep supports PCRE using the -P flag, but BSD grep (on Mac OS X, for example) does not. Neither GNU nor BSD find supports PCRE regexes.

Finding the files

GNU find supports POSIX regexes (thanks to iscfrc for pointing out the pure find solution to avoid spawning additional processes). The following command will list all filenames (but not directory names) below the current directory that contain non-printable control characters:

find -type f -regextype posix-basic -regex '^.*/[^/]*[[:cntrl:]][^/]*$'

The regex is a little complicated because the -regex option has to match the entire file path, not just the filename, and because I'm assuming that we don't want to blow away files with normal names simply because they are inside directories with names containing control characters.

To delete the matching files, simply pass the -delete option to find, after all other options (this is critical; passing -delete as the first option will blow away everything in your current directory):

find -type f -regextype posix-basic -regex '^.*/[^/]*[[:cntrl:]][^/]*$' -delete

I highly recommend running the command without the -delete first, so you can see what will be deleted before it's too late.

If you also pass the -print option, you can see what is being deleted as the command runs:

find -type f -regextype posix-basic -regex '^.*/[^/]*[[:cntrl:]][^/]*$' -print -delete

To blow away any paths (files or directories) that contain control characters, the regex can be simplified and you can drop the -type option:

find -regextype posix-basic -regex '.*[[:cntrl:]].*' -print -delete

With this command, if a directory name contains control characters, even if none of the filenames inside the directory do, they will all be deleted.


Update: Finding both non-ASCII and control characters

It looks like your files contain both non-ASCII characters and ASCII control characters. As it turns out, [:ascii:] is not a POSIX character class, but it is provided by PCRE. I couldn't find a POSIX regex to do this, so it's Perl to the rescue. We'll still use find to traverse our directory tree, but we'll pass the results to Perl for processing.

To make sure we can handle filenames containing newlines (which seems likely in this case), we need to use the -print0 argument to find (supported on both GNU and BSD versions); this separates records with a null character (0x00) instead of a newline, since the null character is the only character that can't be in a valid filename on Linux. We need to pass the corresponding flag -0 to our Perl code so it knows how records are separated. The following command will print every path inside the current directory, recursively:

find . -print0 | perl -n0e 'print $_, "\n"'

Note that this command only spawns a single instance of the Perl interpreter, which is good for performance. The starting path argument (in this case, . for CWD) is optional in GNU find but is required in BSD find on Mac OS X, so I've included it for the sake of portability.

Now for our regex. Here is a PCRE regex matching names that contain either non-ASCII or non-printable (i.e. control) characters (or both):

[[:^ascii:][:cntrl:]]

The following command will print all paths (directories or files) in the current directory that match this regex:

find . -print0 | perl -n0e 'chomp; print $_, "\n" if /[[:^ascii:][:cntrl:]]/'

The chomp is necessary because it strips off the trailing null character from each path, which would otherwise match our regex. To delete the matching files and directories, we can use the following:

find . -print0 | perl -MFile::Path=remove_tree -n0e 'chomp; remove_tree($_, {verbose=>1}) if /[[:^ascii:][:cntrl:]]/'

This will also print out what is being deleted as the command runs (although control characters are interpreted so the output will not quite match the output of ls).

find files with non-ascii chars in file name

This seems to work for me in both default and posix-extended mode:

LC_COLLATE=C find . -regex '.*[^ -~].*'

There could be locale-related issues, though, and I don't have a large corpus of non-ascii filenames to test it on, but it catches the ones I have.

Delete space and replace non-ASCII characters in filenames via a loop with a makefile

You said you wanted to change all non-ASCII characters to -. However based on your attempt, it seems you only want to transform to - those characters which are not digits or "plain" letters (by plain I mean non accented, non fancy, ...).

cleanfigures:
for f in *; \
do \
ext="$${f##*.}" ; \
base="$${f%.*}" ; \
newbase="$${base//[^a-zA-Z0-9 ]/-}" ; \
echo "$$f" "$${newbase// /}.$$ext" ; \
done

Using find to locate files that have non-printable characters in their names

Yes, at least with GNU find, you can search for a name that contains non-printable characters.

The set of non-printable characters depends on your locale. If you specify that you are working with the C locale, non-printable characters are those with an ASCII code < 32 or an ASCII code >= 127.

LC_ALL=C find -name '*[^[:print:]]*'

Here [^[:print:]] represents any non-printable character.

Remove non-ASCII characters in a file

If you want to use Perl, do it like this:

perl -pi -e 's/[^[:ascii:]]//g' filename

Detailed Explanation

The following explanation covers every part of the above command assuming the reader is unfamiliar with anything in the solution...

  • perl

    run the perl interpreter. Perl is a programming language that is typically available on all unix like systems. This command needs to be run at a shell prompt.

  • -p

    The -p flag tells perl to iterate over every line in the input file, run the specified commands (described later) on each line, and then print the result. It is equivalent to wrapping your perl program in while(<>) { /* program... */; } continue { print; }. There's a similar -n flag that does the same but omits the continue { print; } block, so you'd use that if you wanted to do your own printing.

  • -i

    The -i flag tells perl that the input file is to be edited in place and output should go back into that file. This is important to actually modify the file. Omitting this flag will write the output to STDOUT which you can then redirect to a new file.

    Note that you cannot omit -i and redirect STDOUT to the input file as this will clobber the input file before it has been read. This is just how the shell works and has nothing to do with perl. The -i flag works around this intelligently.

    Perl and the shell allow you to combine multiple single character parameters into one which is why we can use -pi instead of -p -i

    The -i flag takes a single argument, which is a file extension to use if you want to make a backup of the original file, so if you used -i.bak, then perl would copy the input file to filename.bak before making changes. In this example I've omitted creating a backup because I expect you'll be using version control anyway :)

  • -e

    The -e flag tells perl that the next argument is a complete perl program encapsulated in a string. This is not always a good idea if you have a very long program as that can get unreadable, but with a single command program as we have here, its terseness can improve legibility.

    Note that we cannot combine the -e flag with the -i flag as both of them take in a single argument, and perl would assume that the second flag is the argument, so, for example, if we used -ie <program> <filename>, perl would assume <program> and <filename> are both input files and try to create <program>e and <filename>e assuming that e is the extension you want to use for the backup. This will fail as <program> is not really a file. The other way around (-ei) would also not work as perl would try to execute i as a program, which would fail compilation.

  • s/.../.../

    This is perl's regex based substitution operator. It takes in four arguments. The first comes before the operator, and if not specified, uses the default of $_. The second and third are between the / symbols. The fourth is after the final / and is g in this case.

    • $_ In our code, the first argument is $_ which is the default loop variable in perl. As mentioned above, the -p flag wraps our program in while(<>), which creates a while loop that reads one line at a time (<>) from the input. It implicitly assigns this line to $_, and all commands that take in a single argument will use this if not specified (eg: just calling print; will actually translate to print $_;). So, in our code, the s/.../.../ operator operates once on each line of the input file.

    • [^[:ascii:]] The second argument is the pattern to search for in the input string. This pattern is a regular expression, so anything enclosed within [] is a bracket expression. This section is probably the most complex part of this example, so we will discuss it in detail at the end.

    • <empty string> The third argument is the replacement string, which in our case is the empty string since we want to remove all non-ascii characters.

    • g The fourth argument is a modifier flag for the substitution operator. The g flag specifies that the substitution should be global across all matches in the input. Without this flag, only the first instance will be replaced. Other possible flags are i for case insensitive matches, s and m which are only relevant for multi-line strings (we have single line strings here), o which specifies that the pattern should be precompiled (which could be useful here for long files), and x which specifies that the pattern could include whitespace and comments to make it more readable (but we should not write our program on a single line if that is the case).

  • filename

    This is the input file that contains non-ascii characters that we'd like to strip out.

[^[:ascii:]]

So now let's discuss [^[:ascii:]] in more detail.

As mentioned above, [] in a regular expression specifies a bracket expression, which tells the regex engine to match a single character in the input that matches any one of the characters in the set of characters inside the expression. So, for example, [abc] will match either an a, or a b or a c, and it will match only a single character. Using ^ as the first character inverts the match, so [^abc] will match any one character that is not an a, b, or c.

But what about [:ascii:] inside the bracket expression?

If you have a unix based system available, run man 7 re_format at the command line to read the man page. If not, read the online version

[:ascii:] is a character class that represents the entire set of ascii characters, but this kind of a character class may only be used inside a bracket expression. The correct way to use this is [[:ascii:]] and it may be negated as with the abc case above or combined within a bracket expression with other characters, so, for example, [éç[:ascii:]] will match all ascii characters and also é and ç which are not ascii, and [^éç[:ascii:]] will match all characters that are not ascii and also not é or ç.

Trying to delete non-ASCII characters only

The suggested solutions may fail with specific version of sed, e.g. GNU sed 4.2.1.

Using tr:

tr -cd '[:print:]' < yourfile.txt

This will remove any characters not in [\x20-\x7e].

If you want to keep e.g. line feeds, just add \n:

tr -cd '[:print:]\n' < yourfile.txt

If you really want to keep all ASCII characters (even the control codes):

tr -cd '[:print:][:cntrl:]' < yourfile.txt

This will remove any characters not in [\x00-\x7f].

How to delete non-ASCII characters in a text file?

In Python you can specify the input encoding.

with open('trendx.log', 'r', encoding='utf-16le') as reader, \
open('trendx.txt', 'w') as writer:
for line in reader:
if "ROW" in line:
writer.write(line)

I have obviously copied over some stuff from your earlier questions. Kudos for finally identifying the actual problem.

Notice in particular how we avoid reading the entire file into memory and instead processing a line at a time.

How do I grep for all non-ASCII characters?

You can use the command:

grep --color='auto' -P -n "[\x80-\xFF]" file.xml

This will give you the line number, and will highlight non-ascii chars in red.

In some systems, depending on your settings, the above will not work, so you can grep by the inverse

grep --color='auto' -P -n "[^\x00-\x7F]" file.xml

Note also, that the important bit is the -P flag which equates to --perl-regexp: so it will interpret your pattern as a Perl regular expression. It also says that

this is highly experimental and grep -P may warn of unimplemented
features.

Remove non-ASCII characters from CSV


# -i (inplace)

sed -i 's/[\d128-\d255]//g' FILENAME


Related Topics



Leave a reply



Submit