Removing text containing non-english character
I would check out this related Stack Overflow post for doing the same thing in javascript. Regular expression to match non-English characters?
To translate this into R, you could do (to match non-ASCII):
res <- data[which(!grepl("[^\x01-\x7F]+", data$Name)),]
res
# A tibble: 1 × 2
# Name Rank
# <chr> <dbl>
#1 apple firm 1
And to match non-unicode per that same SO post:
res <- data[which(!grepl("[^\u0001-\u007F]+", data$Name)),]
res
# A tibble: 1 × 2
# Name Rank
# <chr> <dbl>
#1 apple firm 1
Note - we had to take out the NUL
character for this to work. So instead of starting at \u0000
or x00
we start at \u0001
and \x01
.
How to delete some strings with non-English letters?
You have a list and want to filter it to only contain elements that match some condition, list comprehensions with an if
are perfect for that:
my_list = [1, 2, 3, 4, 5, 6]
# just even numbers:
print([x for x in my_list if x % 2 == 0])
And you want to filter for anything that consists of only letters 'a' through 'z' and 'A' through 'Z', which is where a regex is easy to use:
my_try = ['Aas','1Aasdf','cc)','ASD','.ASD','aaaa1','A']
print([x for x in my_try if re.match('^[a-zA-Z]+$', x)])
The regex starts with ^
and ends in $
to tell re.match()
that it should match the entire string, from start to end. [a-zA-Z]
defines a character class containing the letters you're after. Often you'd use \w
but that also includes numbers. And finally, the +
means there needs to be 1 or more of the characters in the string (as opposed to 0 or more if you use *
)
Pandas: How to remove character that include non english characters?
You can use regex to remove designated characters from your strings:
import re
import pandas as pd
records = [{'name':'Foo الÙجيرة'}, {'name':'Battery ÁÁÁ'}]
df = pd.DataFrame.from_records(records)
# Allow alpha numeric and spaces (add additional characters as needed)
pattern = re.compile('[^A-z0-9 ]+')
def clean_text(string):
return pattern.search('', string)
# Apply to your df
df['clean_name'] = df['name'].apply(clean_text)
name clean_name
0 Foo الÙجيرة Foo
1 Battery ÁÁÁ Battery
For more solutions, you can read this SO Q: Python, remove all non-alphabet chars from string
Remove rows that have Non-English characters in Powershell
Try the following:
PS> 'english only', 'mixed 多発性硬化', '多発性硬化', 'mixed склероз', 'склероз' |
Where-Object { $_ -cnotmatch '\P{IsBasicLatin}' }
english only
\p{IsBasicLatin}
matches any ASCII-range character (any character in the 7-bit Unicode code-point range,0x0 - 0x7f
), and\P{IsBasicLatin}
is its negation, i.e. matches any character outside that range.-cnotmatch '\P{IsBasicLatin}'
therefore only matches strings that contain no non-ASCII characters, in other words: strings that contain only ASCII-range characters.- NoteTip of the hat to js2010 for the pointer.:
-cnotmatch
, the case-sensitive variant of the case-insensitive-notmatch
operator is deliberately used, so as to rule out false positives that would occur with case-insensitive matching, namely with the lowercase ASCII-range lettersi
andk
.The reason is that these characters are also considered the lowercase counterparts to non-ASCII-range characters, namely
İ
(LATIN CAPITAL LETTER I WITH DOT ABOVE,U+0130
) (as used in Turkic languages), and
andK
(KELVIN SIGN,U+212A
); therefore, with case-insensitive matching via-match
,i
andk
report$true
for both\p{IsBasicLatin}
(falling into the ASCII block) and\P{IsBasicLatin}
(falling outside the ASCII block); that is, all of the following expressions return$true
:# !! All return $true; use -cmatch for the expected behavior.
'i' -match '\p{IsBasicLatin}'; 'i' -match '\P{IsBasicLatin}'
'k' -match '\p{IsBasicLatin}'; 'k' -match '\P{IsBasicLatin}'
- NoteTip of the hat to js2010 for the pointer.:
Removing non-English text from Corpus in R using tm()
Here's a method to remove words with non-ASCII characters before making a corpus:
# remove words with non-ASCII characters
# assuming you read your txt file in as a vector, eg.
# dat <- readLines('~/temp/dat.txt')
dat <- "Special, satisfação, Happy, Sad, Potential, für"
# convert string to vector of words
dat2 <- unlist(strsplit(dat, split=", "))
# find indices of words with non-ASCII characters
dat3 <- grep("dat2", iconv(dat2, "latin1", "ASCII", sub="dat2"))
# subset original vector of words to exclude words with non-ASCII char
dat4 <- dat2[-dat3]
# convert vector back to a string
dat5 <- paste(dat4, collapse = ", ")
# make corpus
require(tm)
words1 <- Corpus(VectorSource(dat5))
inspect(words1)
A corpus with 1 text document
The metadata consists of 2 tag-value pairs and a data frame
Available tags are:
create_date creator
Available variables in the data frame are:
MetaID
[[1]]
Special, Happy, Sad, Potential
Removing rows contains non-english words in Pandas dataframe
If using Python >= 3.7:
df[df['col'].map(lambda x: x.isascii())]
where col
is your target column.
Data:
df = pd.DataFrame({
'colA': ['**She’s the Hollywood Power Behind Those ...**',
'Hello, world!', 'Cainã', 'another value', 'test123*', 'âbc']
})
print(df.to_markdown())
| | colA |
|---:|:------------------------------------------------------|
| 0 | **She’s the Hollywood Power Behind Those ...** |
| 1 | Hello, world! |
| 2 | Cainã |
| 3 | another value |
| 4 | test123* |
| 5 | âbc |
Identifying and filtering strings with non-English characters (see the ASCII printable characters):
df[df.colA.map(lambda x: x.isascii())]
Output:
colA
1 Hello, world!
3 another value
4 test123*
Original approach was to use a user-defined function like this:
def is_ascii(s):
try:
s.encode(encoding='utf-8').decode('ascii')
except UnicodeDecodeError:
return False
else:
return True
Remove lines that contain non-english (Ascii) characters from a file
Perl supports an [:ascii:]
character class.
perl -nle 'print if m{^[[:ascii:]]+$}' inputfile
Related Topics
R 'Inf' When It Has Class 'Date' Is Printing 'Na'
Replicate a List to Create a List-Of-Lists
Can You Pass a Vector to a Vararg: Vector to Sprintf
Freezing Header and First Column Using Data.Table in Shiny
Ggplot2 Add a Legend for Several Stat_Functions
R: Replacing Foreign Characters in a String
Ggplot2': Label Values of Barplot That Uses 'Fun.Y="Mean"' of 'Stat_Summary'
R Output Without [1], How to Nicely Format
How to Create a Variable of Rownames
Navlistpanel: Make Tabs Sequentially Active in Shiny App
Using R to Connect to a Sharepoint List
How to Get Environment of a Variable in R
Displaying Image on Point Hover in Plotly
How to Efficiently Read the First Character from Each Line of a Text File