Read data from CSV file and transform from string to correct data-type, including a list-of-integer column
As the docs explain, the CSV reader doesn't perform automatic data conversion. You have the QUOTE_NONNUMERIC format option, but that would only convert all non-quoted fields into floats. This is a very similar behaviour to other csv readers.
I don't believe Python's csv module would be of any help for this case at all. As others have already pointed out, literal_eval()
is a far better choice.
The following does work and converts:
- strings
- int
- floats
- lists
- dictionaries
You may also use it for booleans and NoneType, although these have to be formatted accordingly for literal_eval()
to pass. LibreOffice Calc displays booleans in capital letters, when in Python booleans are Capitalized. Also, you would have to replace empty strings with None
(without quotes)
I'm writing an importer for mongodb that does all this. The following is part of the code I've written so far.
[NOTE: My csv uses tab as field delimiter. You may want to add some exception handling too]
def getFieldnames(csvFile):
"""
Read the first row and store values in a tuple
"""
with open(csvFile) as csvfile:
firstRow = csvfile.readlines(1)
fieldnames = tuple(firstRow[0].strip('\n').split("\t"))
return fieldnames
def writeCursor(csvFile, fieldnames):
"""
Convert csv rows into an array of dictionaries
All data types are automatically checked and converted
"""
cursor = [] # Placeholder for the dictionaries/documents
with open(csvFile) as csvFile:
for row in islice(csvFile, 1, None):
values = list(row.strip('\n').split("\t"))
for i, value in enumerate(values):
nValue = ast.literal_eval(value)
values[i] = nValue
cursor.append(dict(zip(fieldnames, values)))
return cursor
Convert the first column data type from float to int, and write back to the original csv file
You can do so very easily by using pandas:
import pandas as pd
df = pd.read_csv("data.csv")
df["col"] = df["col"].astype(int)
df.to_csv("data.csv", index=False)
if you don't know the name of the first column and want to using integer based indexing, then do this:
df.iloc[:, 0] = df.iloc[:, 0].astype(int)
Is there a way to find and set the DataType from CSV file, without specifying it prior?
The parsed
variable is not really special to csv.reader
(there is no CSV Reader library, csv
is a module in the Python std lib). This syntax is a Python generator expression
. While this code works, compressing multiple concepts down into a single statement is not always the best when trying to illustrate a concept.
A more beginner-friendly form of this code might look like:
reader = csv.reader(StringIO.StringIO(data), delimiter=",")
# use a conventional for loop to build up the getBackData list
getBackData = []
for row in reader:
converted_row = (
row[0] == 'True',
row[1],
int(row[2]),
float(row[3]),
row[4],
)
getBackData.append(converted_row)
Even cleaner would be to push all those converters into a convert_row
function, and then build up getBackData
with a list comprehension:
def convert_row(raw):
return (
raw[0] == 'True',
raw[1],
int(raw[2]),
float(raw[3]),
raw[4],
)
reader = csv.reader(StringIO.StringIO(data), delimiter=",")
getBackData = [convert_row(row) for row in reader]
Then you can modify the convert_row
function however you like, but the reader
and getBackData
construction stay the same.
EDIT: getting the types (not very well tested, but this is the idea)
def try_bool(s):
# will convert strings "True" and "False" to bools,
# and raise an exception otherwise
try:
return {"True": True, "False": False}[s]
except KeyError:
raise ValueError("{!r} is not a valid bool".format(s))
def get_column_types(raw):
types = []
for col in raw:
for test_type in (int, float, try_bool, str):
try:
test_type(col)
except ValueError:
# fail! not data of this type
pass
else:
# it worked! add test_type to list of converters
types.append(test_type)
break
return types
# read the first row and get the types of each column
first_row = next(csv.reader(input_file))
col_types = get_column_types(first_row)
# now create a list of new rows with converted data items
converted = []
for row in csv.reader(input_file):
# use zip to walk list of converters and list of columns at the same time
converted_row = [converter(raw_value)
for converter, raw_value in zip(col_types, row)]
converted.append(converted_row)
Read data from CSV file and transform from string to correct data-type, including a list-of-integer column
As the docs explain, the CSV reader doesn't perform automatic data conversion. You have the QUOTE_NONNUMERIC format option, but that would only convert all non-quoted fields into floats. This is a very similar behaviour to other csv readers.
I don't believe Python's csv module would be of any help for this case at all. As others have already pointed out, literal_eval()
is a far better choice.
The following does work and converts:
- strings
- int
- floats
- lists
- dictionaries
You may also use it for booleans and NoneType, although these have to be formatted accordingly for literal_eval()
to pass. LibreOffice Calc displays booleans in capital letters, when in Python booleans are Capitalized. Also, you would have to replace empty strings with None
(without quotes)
I'm writing an importer for mongodb that does all this. The following is part of the code I've written so far.
[NOTE: My csv uses tab as field delimiter. You may want to add some exception handling too]
def getFieldnames(csvFile):
"""
Read the first row and store values in a tuple
"""
with open(csvFile) as csvfile:
firstRow = csvfile.readlines(1)
fieldnames = tuple(firstRow[0].strip('\n').split("\t"))
return fieldnames
def writeCursor(csvFile, fieldnames):
"""
Convert csv rows into an array of dictionaries
All data types are automatically checked and converted
"""
cursor = [] # Placeholder for the dictionaries/documents
with open(csvFile) as csvFile:
for row in islice(csvFile, 1, None):
values = list(row.strip('\n').split("\t"))
for i, value in enumerate(values):
nValue = ast.literal_eval(value)
values[i] = nValue
cursor.append(dict(zip(fieldnames, values)))
return cursor
Related Topics
Selenium Webdriver: How to Download a PDF File with Python
Python Matplotlib Framework Under MACosx
Differencebetween a Pandas Series and a Single-Column Dataframe
How to Normalize a Url in Python
Why Apply Sometimes Isn't Faster Than For-Loop in a Pandas Dataframe
How to "Zip Sort" Parallel Numpy Arrays
Does Flask Support Regular Expressions in Its Url Routing
Import CSV with Different Number of Columns Per Row Using Pandas
Pip Broke. How to Fix Distributionnotfound Error
How to Create Collapsible Box in Pyqt
Best Way to Determine If a Sequence Is in Another Sequence
What's 0Xff for in Cv2.Waitkey(1)
Convert Pandas Series to Dataframe
Understanding Time.Perf_Counter() and Time.Process_Time()
Return Result from Python to Vba