Text Pre-Processing + Python + Csv:Removing Special Characters from a Column of a Csv

Simple way to remove special characters and alpha numerical from dataframe

I import lot of files and many a times columns names are dirty, they get Unwanted special characters and I don't know which all characters might come. I only want Underscores in column names and no spaces

df.columns = df.columns.str.strip()     
df.columns = df.columns.str.replace(' ', '_')
df.columns = df.columns.str.replace(r"[^a-zA-Z\d\_]+", "")
df.columns = df.columns.str.replace(r"[^a-zA-Z\d\_]+", "")

Remove Special Chars from a TSV file using Regex

You could use read_csv() to help with loading the TSV file. You could then specify the columns you want to keep and for it to use \t as the delimiter:

import pandas as pd
import re

def normalise(text):
text = re.sub('[{}]'.format(re.escape('",$!@#$%^&*()')), ' ', text.strip()) # Remove special characters
text = re.sub(r'\s+', ' ', text) # Convert multiple whitespace into a single space
return text

fieldnames = ['title', 'abstract', 'keywords', 'general_terms', 'acm_classification']
df = pd.read_csv('xa.tsv', delimiter='\t', usecols=fieldnames, dtype='object', na_filter=False)
df = df.applymap(normalise)
print(df)

You can then use df.applymap() to apply a function to each cell to format it as you need. In this example it first removes any leading or trailing spaces, converts multiple whitespace characters into a single space and also removes your list of special characters.

The resulting dataframe could then be further processed using your all_subsets() function before saving.

Remove new line from CSV file

If you are using pyspark then I would suggest you to go with sparkContext's wholeTextFiles function to read the file, since your file needs to be read as whole text for parsing appropriately.

After reading it using wholeTextFiles, you should parse by replacing end of line characters by , and do some additional formattings so that whole text can be broken down into groups of eight strings.

import re
rdd = sc.wholeTextFiles("path to your csv file")\
.map(lambda x: re.sub(r'(?!(([^"]*"){2})*[^"]*$),', ' ', x[1].replace("\r\n", ",").replace(",,", ",")).split(","))\
.flatMap(lambda x: [x[k:k+8] for k in range(0, len(x), 8)])

You should get output as

[u'playerID', u'yearID', u'gameNum', u'gameName', u'teamName', u'lgID', u'GP', u'startingPos']
[u'gomezle01', u'1933', u'1', u'Cricket', u'Team1', u'NYA', u'AL', u'1']
[u'ferreri01', u'1933', u'2', u'Hockey', u'"This is Team2"', u'BOS', u'AL', u'1']
[u'gehrilo01', u'1933', u'3', u'"Game name is Cricket"', u'Team3', u'NYA', u'AL', u'1']
[u'gehrich01', u'1933', u'4', u'Hockey', u'"Here it is Team4"', u'DET', u'AL', u'1']
[u'dykesji01', u'1933', u'5', u'"Game name is Hockey"', u'"Team name Team5"', u'CHA', u'AL', u'1']

If you would like to convert all the array rdd rows into strings of rows then you can add

.map(lambda x: ", ".join(x))

and you should get

playerID, yearID, gameNum, gameName, teamName, lgID, GP, startingPos
gomezle01, 1933, 1, Cricket, Team1, NYA, AL, 1
ferreri01, 1933, 2, Hockey, "This is Team2", BOS, AL, 1
gehrilo01, 1933, 3, "Game name is Cricket", Team3, NYA, AL, 1
gehrich01, 1933, 4, Hockey, "Here it is Team4", DET, AL, 1
dykesji01, 1933, 5, "Game name is Hockey", "Team name Team5", CHA, AL, 1

Is using temporary placeholders for CSV special characters a bad practice?

It is not very bad, but also not the best approach.

Use standard libraries wherever possible. Here is a list of fine libraries, of which SuperCSV is particularly strong in supporting CSV variants. These libraries follow best practices: Special characters are escaped when used inside a field, or the field is wrapped (usually with quotes).

If the CSV is already malformed so that special characters appear inside the fields without proper escaping or wrapping, then you have a data-cleaning problem on your hands, to be solved in some other way. Replacing the character with your temporary placeholder will not fix that, as the placeholder will likewise appear both inside the fields and between them.



Related Topics



Leave a reply



Submit