Read Data from CSV File and Transform from String to Correct Data-Type, Including a List-Of-Integer Column

Read data from CSV file and transform from string to correct data-type, including a list-of-integer column

As the docs explain, the CSV reader doesn't perform automatic data conversion. You have the QUOTE_NONNUMERIC format option, but that would only convert all non-quoted fields into floats. This is a very similar behaviour to other csv readers.

I don't believe Python's csv module would be of any help for this case at all. As others have already pointed out, literal_eval() is a far better choice.

The following does work and converts:

  • strings
  • int
  • floats
  • lists
  • dictionaries

You may also use it for booleans and NoneType, although these have to be formatted accordingly for literal_eval() to pass. LibreOffice Calc displays booleans in capital letters, when in Python booleans are Capitalized. Also, you would have to replace empty strings with None (without quotes)

I'm writing an importer for mongodb that does all this. The following is part of the code I've written so far.

[NOTE: My csv uses tab as field delimiter. You may want to add some exception handling too]

def getFieldnames(csvFile):
"""
Read the first row and store values in a tuple
"""
with open(csvFile) as csvfile:
firstRow = csvfile.readlines(1)
fieldnames = tuple(firstRow[0].strip('\n').split("\t"))
return fieldnames

def writeCursor(csvFile, fieldnames):
"""
Convert csv rows into an array of dictionaries
All data types are automatically checked and converted
"""
cursor = [] # Placeholder for the dictionaries/documents
with open(csvFile) as csvFile:
for row in islice(csvFile, 1, None):
values = list(row.strip('\n').split("\t"))
for i, value in enumerate(values):
nValue = ast.literal_eval(value)
values[i] = nValue
cursor.append(dict(zip(fieldnames, values)))
return cursor

Convert the first column data type from float to int, and write back to the original csv file

You can do so very easily by using pandas:

import pandas as pd

df = pd.read_csv("data.csv")

df["col"] = df["col"].astype(int)

df.to_csv("data.csv", index=False)

if you don't know the name of the first column and want to using integer based indexing, then do this:

df.iloc[:, 0] = df.iloc[:, 0].astype(int)

Is there a way to find and set the DataType from CSV file, without specifying it prior?

The parsed variable is not really special to csv.reader (there is no CSV Reader library, csv is a module in the Python std lib). This syntax is a Python generator expression. While this code works, compressing multiple concepts down into a single statement is not always the best when trying to illustrate a concept.

A more beginner-friendly form of this code might look like:

reader = csv.reader(StringIO.StringIO(data), delimiter=",")

# use a conventional for loop to build up the getBackData list
getBackData = []
for row in reader:
converted_row = (
row[0] == 'True',
row[1],
int(row[2]),
float(row[3]),
row[4],
)
getBackData.append(converted_row)

Even cleaner would be to push all those converters into a convert_row function, and then build up getBackData with a list comprehension:

def convert_row(raw):
return (
raw[0] == 'True',
raw[1],
int(raw[2]),
float(raw[3]),
raw[4],
)

reader = csv.reader(StringIO.StringIO(data), delimiter=",")
getBackData = [convert_row(row) for row in reader]

Then you can modify the convert_row function however you like, but the reader and getBackData construction stay the same.

EDIT: getting the types (not very well tested, but this is the idea)

def try_bool(s):
# will convert strings "True" and "False" to bools,
# and raise an exception otherwise
try:
return {"True": True, "False": False}[s]
except KeyError:
raise ValueError("{!r} is not a valid bool".format(s))

def get_column_types(raw):
types = []
for col in raw:
for test_type in (int, float, try_bool, str):
try:
test_type(col)
except ValueError:
# fail! not data of this type
pass
else:
# it worked! add test_type to list of converters
types.append(test_type)
break
return types

# read the first row and get the types of each column
first_row = next(csv.reader(input_file))
col_types = get_column_types(first_row)

# now create a list of new rows with converted data items
converted = []
for row in csv.reader(input_file):
# use zip to walk list of converters and list of columns at the same time
converted_row = [converter(raw_value)
for converter, raw_value in zip(col_types, row)]
converted.append(converted_row)

Read data from CSV file and transform from string to correct data-type, including a list-of-integer column

As the docs explain, the CSV reader doesn't perform automatic data conversion. You have the QUOTE_NONNUMERIC format option, but that would only convert all non-quoted fields into floats. This is a very similar behaviour to other csv readers.

I don't believe Python's csv module would be of any help for this case at all. As others have already pointed out, literal_eval() is a far better choice.

The following does work and converts:

  • strings
  • int
  • floats
  • lists
  • dictionaries

You may also use it for booleans and NoneType, although these have to be formatted accordingly for literal_eval() to pass. LibreOffice Calc displays booleans in capital letters, when in Python booleans are Capitalized. Also, you would have to replace empty strings with None (without quotes)

I'm writing an importer for mongodb that does all this. The following is part of the code I've written so far.

[NOTE: My csv uses tab as field delimiter. You may want to add some exception handling too]

def getFieldnames(csvFile):
"""
Read the first row and store values in a tuple
"""
with open(csvFile) as csvfile:
firstRow = csvfile.readlines(1)
fieldnames = tuple(firstRow[0].strip('\n').split("\t"))
return fieldnames

def writeCursor(csvFile, fieldnames):
"""
Convert csv rows into an array of dictionaries
All data types are automatically checked and converted
"""
cursor = [] # Placeholder for the dictionaries/documents
with open(csvFile) as csvFile:
for row in islice(csvFile, 1, None):
values = list(row.strip('\n').split("\t"))
for i, value in enumerate(values):
nValue = ast.literal_eval(value)
values[i] = nValue
cursor.append(dict(zip(fieldnames, values)))
return cursor


Related Topics



Leave a reply



Submit