Removing Unicode Symbols from Column Names

Removing unicode symbols from column names

iconv is one option ...

names(foulbycountry1) <- iconv(names(foulbycountry1), to='ASCII', sub='')
names(foulbycountry1)
# [1] "Teams"                           "Teams"                           "Matches Played"                 
# [4] "Yellow Card"                     "Second yellow card and red card" "Red Cards"                      
# [7] "Fouls Committed"                 "Fouls Suffered\r\n"              "Fouls causing a penalty"

This will remove any non-ASCII characters. One of the columns has linebreaks at the end of it. To remove these, too, you can use

gsub('\r|\n', '', iconv(names(foulbycountry1), to='ASCII', sub=''))

How to remove non-ASCII characters and space from column names

One way using pandas.Series.str.replace and findall:

df.columns = ["".join(l) for l in df.columns.str.replace("\s", "_").str.findall("[\w\d]+")]
print(df)

Output:

Empty DataFrame
Columns: [Col1name, Col_2_name, Col3__name, Col4__name]
Index: []

Remove non-ASCII characters from DataFrame column headers

Here's one possible option: Fix those headers after loading them in:

df.columns = [x.encode('utf-8').decode('ascii', 'ignore') for x in df.columns]

The str.encode followed by the str.decode call will drop those special characters, leaving only the ones in ASCII range behind:

>>> 'ï»¿aSA'.encode('utf-8').decode('ascii', 'ignore')
'aSA'

Remove non-ASCII characters from pandas column

You code fails as you are not applying it on each character, you are applying it per word and ord errors as it takes a single character, you would need:

  df['DB_user'] = df["DB_user"].apply(lambda x: ''.join([" " if ord(i) < 32 or ord(i) > 126 else i for i in x]))

You can also simplify the join using a chained comparison:

   ''.join([i if 32 < ord(i) < 126 else " " for i in x])

You could also use string.printable to filter the chars:

from string import printable
st = set(printable)
df["DB_user"] = df["DB_user"].apply(lambda x: ''.join([" " if  i not in  st else i for i in x]))

The fastest is to use translate:

from string import maketrans

del_chars =  " ".join(chr(i) for i in range(32) + range(127, 256))
trans = maketrans(t, " "*len(del_chars))

df['DB_user'] = df["DB_user"].apply(lambda s: s.translate(trans))

Interestingly that is faster than:

  df['DB_user'] = df["DB_user"].str.translate(trans)

Delete specific symbols (unicode) from Pandas DataFrame Column

You can use unicode RegEx (?u):

Source DF:

In [30]: df
Out[30]:
                        col
0              привет, Вася
1                 как дела?
2              уиии 23 45!!
3  давай Вася, до свидания!

Solution (removing all digits, all trailing spaces and all non-characters, except spaces and question mark):

In [36]: df.replace(['\d+', r'(?u)[^\w\s\?]+', '\s*$'], ['','',''], regex=True)
Out[36]:
                      col
0             привет Вася
1               как дела?
2                    уиии
3  давай Вася до свидания

RegEx explained ...

Unicode in column names

I got the characters to stick by bypassing colnames<-:

attr(df,"names") <- nm
print(df)
  xβ y₂
1  1  4
2  2  5
3  3  6

colnames(df)
[1] "xβ" "y₂"

Use at your own risk.

sessionInfo()
#R version 4.0.2 (2020-06-22)
#Platform: x86_64-apple-darwin17.0 (64-bit)
#Running under: macOS Catalina 10.15.7
#
#Matrix products: default
#BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecL#ib.framework/Versions/A/libBLAS.dylib
#LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#
#locale:
#[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

PySpark remove special characters in all column names for all special characters

You can substitute any character except A-z and 0-9

import pyspark.sql.functions as F
import re

df = df.select([F.col(col).alias(re.sub("[^0-9a-zA-Z$]+","",col)) for col in df.columns])

How can I remove u- unicode character from my data frame column which is string consisting of a dict?

If you are reading this from a csv , just do :

df['CC'].replace('u','',regex=True)

Since the dtype is object, the general string replace method should work.

Removing non-ascii and special character in pyspark dataframe column

This should work.

First creating a temporary example dataframe:

df = spark.createDataFrame([
    (0, "This is Spark"),
    (1, "I wish Java could use case classes"),
    (2, "Data science is  cool"),
    (3, "This is ï»¿aSA")
], ["id", "words"])

df.show()

Output

+---+--------------------+
| id|               words|
+---+--------------------+
|  0|       This is Spark|
|  1|I wish Java could...|
|  2|Data science is  ...|
|  3|      This is ï»¿aSA|
+---+--------------------+

Now to write a UDF because those functions that you use cannot be directly performed on a column type and you will get the Column object not callable error

Solution

from pyspark.sql.functions import udf

def ascii_ignore(x):
    return x.encode('ascii', 'ignore').decode('ascii')

ascii_udf = udf(ascii_ignore)

df.withColumn("foo", ascii_udf('words')).show()

Output

+---+--------------------+--------------------+
| id|               words|                 foo|
+---+--------------------+--------------------+
|  0|       This is Spark|       This is Spark|
|  1|I wish Java could...|I wish Java could...|
|  2|Data science is  ...|Data science is  ...|
|  3|      This is ï»¿aSA|         This is aSA|
+---+--------------------+--------------------+

Removing Unicode Symbols from Column Names