Removing unicode symbols from column names
iconv
is one option ...
names(foulbycountry1) <- iconv(names(foulbycountry1), to='ASCII', sub='')
names(foulbycountry1)
# [1] "Teams" "Teams" "Matches Played"
# [4] "Yellow Card" "Second yellow card and red card" "Red Cards"
# [7] "Fouls Committed" "Fouls Suffered\r\n" "Fouls causing a penalty"
This will remove any non-ASCII characters. One of the columns has linebreaks at the end of it. To remove these, too, you can use
gsub('\r|\n', '', iconv(names(foulbycountry1), to='ASCII', sub=''))
How to remove non-ASCII characters and space from column names
One way using pandas.Series.str.replace
and findall
:
df.columns = ["".join(l) for l in df.columns.str.replace("\s", "_").str.findall("[\w\d]+")]
print(df)
Output:
Empty DataFrame
Columns: [Col1name, Col_2_name, Col3__name, Col4__name]
Index: []
Remove non-ASCII characters from DataFrame column headers
Here's one possible option: Fix those headers after loading them in:
df.columns = [x.encode('utf-8').decode('ascii', 'ignore') for x in df.columns]
The str.encode
followed by the str.decode
call will drop those special characters, leaving only the ones in ASCII range behind:
>>> 'aSA'.encode('utf-8').decode('ascii', 'ignore')
'aSA'
Remove non-ASCII characters from pandas column
You code fails as you are not applying it on each character, you are applying it per word and ord errors as it takes a single character, you would need:
df['DB_user'] = df["DB_user"].apply(lambda x: ''.join([" " if ord(i) < 32 or ord(i) > 126 else i for i in x]))
You can also simplify the join using a chained comparison:
''.join([i if 32 < ord(i) < 126 else " " for i in x])
You could also use string.printable
to filter the chars:
from string import printable
st = set(printable)
df["DB_user"] = df["DB_user"].apply(lambda x: ''.join([" " if i not in st else i for i in x]))
The fastest is to use translate:
from string import maketrans
del_chars = " ".join(chr(i) for i in range(32) + range(127, 256))
trans = maketrans(t, " "*len(del_chars))
df['DB_user'] = df["DB_user"].apply(lambda s: s.translate(trans))
Interestingly that is faster than:
df['DB_user'] = df["DB_user"].str.translate(trans)
Delete specific symbols (unicode) from Pandas DataFrame Column
You can use unicode RegEx (?u)
:
Source DF:
In [30]: df
Out[30]:
col
0 привет, Вася
1 как дела?
2 уиии 23 45!!
3 давай Вася, до свидания!
Solution (removing all digits, all trailing spaces and all non-characters, except spaces and question mark):
In [36]: df.replace(['\d+', r'(?u)[^\w\s\?]+', '\s*$'], ['','',''], regex=True)
Out[36]:
col
0 привет Вася
1 как дела?
2 уиии
3 давай Вася до свидания
RegEx explained ...
Unicode in column names
I got the characters to stick by bypassing colnames<-
:
attr(df,"names") <- nm
print(df)
xβ y₂
1 1 4
2 2 5
3 3 6
colnames(df)
[1] "xβ" "y₂"
Use at your own risk.
sessionInfo()
#R version 4.0.2 (2020-06-22)
#Platform: x86_64-apple-darwin17.0 (64-bit)
#Running under: macOS Catalina 10.15.7
#
#Matrix products: default
#BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecL#ib.framework/Versions/A/libBLAS.dylib
#LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#
#locale:
#[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
PySpark remove special characters in all column names for all special characters
You can substitute any character except A-z and 0-9
import pyspark.sql.functions as F
import re
df = df.select([F.col(col).alias(re.sub("[^0-9a-zA-Z$]+","",col)) for col in df.columns])
How can I remove u- unicode character from my data frame column which is string consisting of a dict?
If you are reading this from a csv , just do :
df['CC'].replace('u','',regex=True)
Since the dtype is object, the general string replace method should work.
Removing non-ascii and special character in pyspark dataframe column
This should work.
First creating a temporary example dataframe:
df = spark.createDataFrame([
(0, "This is Spark"),
(1, "I wish Java could use case classes"),
(2, "Data science is cool"),
(3, "This is aSA")
], ["id", "words"])
df.show()
Output
+---+--------------------+
| id| words|
+---+--------------------+
| 0| This is Spark|
| 1|I wish Java could...|
| 2|Data science is ...|
| 3| This is aSA|
+---+--------------------+
Now to write a UDF because those functions that you use cannot be directly performed on a column type and you will get the Column object not callable error
Solution
from pyspark.sql.functions import udf
def ascii_ignore(x):
return x.encode('ascii', 'ignore').decode('ascii')
ascii_udf = udf(ascii_ignore)
df.withColumn("foo", ascii_udf('words')).show()
Output
+---+--------------------+--------------------+
| id| words| foo|
+---+--------------------+--------------------+
| 0| This is Spark| This is Spark|
| 1|I wish Java could...|I wish Java could...|
| 2|Data science is ...|Data science is ...|
| 3| This is aSA| This is aSA|
+---+--------------------+--------------------+
Related Topics
R: Ggplot2 Setting the Last Plot in the Midle with Facet_Wrap
Axis Does Not Plot with Date Labels
Efficient Way to Fill Time-Series Per Group
R Packages Fail to Compile with Gcc
How to Add Rows with 0 Counts to Summarised Output
Ggplot2: Adding Lines in a Loop and Retaining Colour Mappings
How to Pass Multiple Group_By Arguments and a Dynamic Variable Argument to a Dplyr Function
R Shiny - Checkboxes and Action Button Combination Issue
How to Create a Dropdown List in a Shiny Table Using Datatable When Editing the Table
Why Isn't the R Function Sink() Writing a Summary Output to My Results File
Arranging Ggally Plots with Gridextra
Selecting Max Column Values in R
Dependent Inputs in Shiny Application with R
Split on Factor, Sapply, and Lm