Simple way to remove special characters and alpha numerical from dataframe
I import lot of files and many a times columns names are dirty, they get Unwanted special characters and I don't know which all characters might come. I only want Underscores in column names and no spaces
df.columns = df.columns.str.strip()
df.columns = df.columns.str.replace(' ', '_')
df.columns = df.columns.str.replace(r"[^a-zA-Z\d\_]+", "")
df.columns = df.columns.str.replace(r"[^a-zA-Z\d\_]+", "")
Remove Special Chars from a TSV file using Regex
You could use read_csv()
to help with loading the TSV file. You could then specify the columns you want to keep and for it to use \t
as the delimiter:
import pandas as pd
import re
def normalise(text):
text = re.sub('[{}]'.format(re.escape('",$!@#$%^&*()')), ' ', text.strip()) # Remove special characters
text = re.sub(r'\s+', ' ', text) # Convert multiple whitespace into a single space
return text
fieldnames = ['title', 'abstract', 'keywords', 'general_terms', 'acm_classification']
df = pd.read_csv('xa.tsv', delimiter='\t', usecols=fieldnames, dtype='object', na_filter=False)
df = df.applymap(normalise)
print(df)
You can then use df.applymap()
to apply a function to each cell to format it as you need. In this example it first removes any leading or trailing spaces, converts multiple whitespace characters into a single space and also removes your list of special characters.
The resulting dataframe could then be further processed using your all_subsets()
function before saving.
Remove new line from CSV file
If you are using pyspark then I would suggest you to go with sparkContext's wholeTextFiles
function to read the file, since your file needs to be read as whole text for parsing appropriately.
After reading it using wholeTextFiles
, you should parse by replacing end of line characters by , and do some additional formattings so that whole text can be broken down into groups of eight strings.
import re
rdd = sc.wholeTextFiles("path to your csv file")\
.map(lambda x: re.sub(r'(?!(([^"]*"){2})*[^"]*$),', ' ', x[1].replace("\r\n", ",").replace(",,", ",")).split(","))\
.flatMap(lambda x: [x[k:k+8] for k in range(0, len(x), 8)])
You should get output as
[u'playerID', u'yearID', u'gameNum', u'gameName', u'teamName', u'lgID', u'GP', u'startingPos']
[u'gomezle01', u'1933', u'1', u'Cricket', u'Team1', u'NYA', u'AL', u'1']
[u'ferreri01', u'1933', u'2', u'Hockey', u'"This is Team2"', u'BOS', u'AL', u'1']
[u'gehrilo01', u'1933', u'3', u'"Game name is Cricket"', u'Team3', u'NYA', u'AL', u'1']
[u'gehrich01', u'1933', u'4', u'Hockey', u'"Here it is Team4"', u'DET', u'AL', u'1']
[u'dykesji01', u'1933', u'5', u'"Game name is Hockey"', u'"Team name Team5"', u'CHA', u'AL', u'1']
If you would like to convert all the array rdd rows into strings of rows then you can add
.map(lambda x: ", ".join(x))
and you should get
playerID, yearID, gameNum, gameName, teamName, lgID, GP, startingPos
gomezle01, 1933, 1, Cricket, Team1, NYA, AL, 1
ferreri01, 1933, 2, Hockey, "This is Team2", BOS, AL, 1
gehrilo01, 1933, 3, "Game name is Cricket", Team3, NYA, AL, 1
gehrich01, 1933, 4, Hockey, "Here it is Team4", DET, AL, 1
dykesji01, 1933, 5, "Game name is Hockey", "Team name Team5", CHA, AL, 1
Is using temporary placeholders for CSV special characters a bad practice?
It is not very bad, but also not the best approach.
Use standard libraries wherever possible. Here is a list of fine libraries, of which SuperCSV is particularly strong in supporting CSV variants. These libraries follow best practices: Special characters are escaped when used inside a field, or the field is wrapped (usually with quotes).
If the CSV is already malformed so that special characters appear inside the fields without proper escaping or wrapping, then you have a data-cleaning problem on your hands, to be solved in some other way. Replacing the character with your temporary placeholder will not fix that, as the placeholder will likewise appear both inside the fields and between them.
Related Topics
Paramiko Capturing Command Output
Python: Printing Horizontally Rather Than Current Default Printing
Valueerror: Feature_Names Mismatch: in Xgboost in the Predict() Function
Python List - Only Keep Only-Positive or Only-Negative Values
How to Change a Two Dimensional Array to One Dimensional
Pandas Series With Different Lengths
How to Find Consecutive Numbers in a Python List
How to Print a String Multiple Times
How to Copy a File to a Remote Server in Python Using Scp or Ssh
Python Comparing List Values to Keys in List of Dicts
Pandas Get Frequency of Item Occurrences in a Column as Percentage
How to Find the Closest Values in a Pandas Series to an Input Number
How to Find the Longest Word in a Text File
Use Cumcount on Pandas Dataframe With a Conditional Increment
Python Regex - How to Get Positions and Values of Matches