Removing Text Containing Non-English Character

Removing text containing non-english character

I would check out this related Stack Overflow post for doing the same thing in javascript. Regular expression to match non-English characters?

To translate this into R, you could do (to match non-ASCII):

res <- data[which(!grepl("[^\x01-\x7F]+", data$Name)),]

res
# A tibble: 1 × 2
# Name Rank
# <chr> <dbl>
#1 apple firm 1

And to match non-unicode per that same SO post:

  res <- data[which(!grepl("[^\u0001-\u007F]+", data$Name)),]

res
# A tibble: 1 × 2
# Name Rank
# <chr> <dbl>
#1 apple firm 1

Note - we had to take out the NUL character for this to work. So instead of starting at \u0000 or x00 we start at \u0001 and \x01.

How to delete some strings with non-English letters?

You have a list and want to filter it to only contain elements that match some condition, list comprehensions with an if are perfect for that:

my_list = [1, 2, 3, 4, 5, 6]
# just even numbers:
print([x for x in my_list if x % 2 == 0])

And you want to filter for anything that consists of only letters 'a' through 'z' and 'A' through 'Z', which is where a regex is easy to use:

my_try = ['Aas','1Aasdf','cc)','ASD','.ASD','aaaa1','A']
print([x for x in my_try if re.match('^[a-zA-Z]+$', x)])

The regex starts with ^ and ends in $ to tell re.match() that it should match the entire string, from start to end. [a-zA-Z] defines a character class containing the letters you're after. Often you'd use \w but that also includes numbers. And finally, the + means there needs to be 1 or more of the characters in the string (as opposed to 0 or more if you use *)

Pandas: How to remove character that include non english characters?

You can use regex to remove designated characters from your strings:

import re
import pandas as pd

records = [{'name':'Foo الÙجيرة'}, {'name':'Battery ÁÁÁ'}]
df = pd.DataFrame.from_records(records)

# Allow alpha numeric and spaces (add additional characters as needed)
pattern = re.compile('[^A-z0-9 ]+')
def clean_text(string):
return pattern.search('', string)

# Apply to your df
df['clean_name'] = df['name'].apply(clean_text)

name clean_name
0 Foo الÙجيرة Foo
1 Battery ÁÁÁ Battery

For more solutions, you can read this SO Q: Python, remove all non-alphabet chars from string

Remove rows that have Non-English characters in Powershell

Try the following:

PS> 'english only', 'mixed 多発性硬化', '多発性硬化', 'mixed склероз', 'склероз'  | 
Where-Object { $_ -cnotmatch '\P{IsBasicLatin}' }

english only
  • \p{IsBasicLatin} matches any ASCII-range character (any character in the 7-bit Unicode code-point range, 0x0 - 0x7f), and \P{IsBasicLatin} is its negation, i.e. matches any character outside that range.

  • -cnotmatch '\P{IsBasicLatin}' therefore only matches strings that contain no non-ASCII characters, in other words: strings that contain only ASCII-range characters.

    • NoteTip of the hat to js2010 for the pointer.:
      • -cnotmatch, the case-sensitive variant of the case-insensitive -notmatch operator is deliberately used, so as to rule out false positives that would occur with case-insensitive matching, namely with the lowercase ASCII-range letters i and k.

      • The reason is that these characters are also considered the lowercase counterparts to non-ASCII-range characters, namely İ (LATIN CAPITAL LETTER I WITH DOT ABOVE, U+0130) (as used in Turkic languages), and
        and (KELVIN SIGN, U+212A); therefore, with case-insensitive matching via -match, i and k report $true for both \p{IsBasicLatin} (falling into the ASCII block) and \P{IsBasicLatin} (falling outside the ASCII block); that is, all of the following expressions return $true:

        # !! All return $true; use -cmatch for the expected behavior.
        'i' -match '\p{IsBasicLatin}'; 'i' -match '\P{IsBasicLatin}'
        'k' -match '\p{IsBasicLatin}'; 'k' -match '\P{IsBasicLatin}'

Removing non-English text from Corpus in R using tm()

Here's a method to remove words with non-ASCII characters before making a corpus:

# remove words with non-ASCII characters
# assuming you read your txt file in as a vector, eg.
# dat <- readLines('~/temp/dat.txt')
dat <- "Special, satisfação, Happy, Sad, Potential, für"
# convert string to vector of words
dat2 <- unlist(strsplit(dat, split=", "))
# find indices of words with non-ASCII characters
dat3 <- grep("dat2", iconv(dat2, "latin1", "ASCII", sub="dat2"))
# subset original vector of words to exclude words with non-ASCII char
dat4 <- dat2[-dat3]
# convert vector back to a string
dat5 <- paste(dat4, collapse = ", ")
# make corpus
require(tm)
words1 <- Corpus(VectorSource(dat5))
inspect(words1)

A corpus with 1 text document

The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID

[[1]]
Special, Happy, Sad, Potential

Removing rows contains non-english words in Pandas dataframe

If using Python >= 3.7:

df[df['col'].map(lambda x: x.isascii())]

where col is your target column.


Data:

df = pd.DataFrame({
'colA': ['**She’s the Hollywood Power Behind Those ...**',
'Hello, world!', 'Cainã', 'another value', 'test123*', 'âbc']
})

print(df.to_markdown())
|    | colA                                                  |
|---:|:------------------------------------------------------|
| 0 | **She’s the Hollywood Power Behind Those ...** |
| 1 | Hello, world! |
| 2 | Cainã |
| 3 | another value |
| 4 | test123* |
| 5 | âbc |

Identifying and filtering strings with non-English characters (see the ASCII printable characters):

df[df.colA.map(lambda x: x.isascii())]

Output:

            colA
1 Hello, world!
3 another value
4 test123*

Original approach was to use a user-defined function like this:

def is_ascii(s):
try:
s.encode(encoding='utf-8').decode('ascii')
except UnicodeDecodeError:
return False
else:
return True

Remove lines that contain non-english (Ascii) characters from a file

Perl supports an [:ascii:] character class.

perl -nle 'print if m{^[[:ascii:]]+$}' inputfile


Related Topics



Leave a reply



Submit