How to Read a Column of CSV as Dtype List Using Pandas

Pandas read csv column values as list

You can try using pickle

Ex:

import pandas as pd

df = pd.DataFrame({"Col": [[1,2,3], [4,5,6]]})
df.to_pickle(filename)

#Read the pickle file
df = pd.read_pickle(filename)
print(df["Col"])
print(df["Col"][0][0])

Output:

0    [1, 2, 3]
1 [4, 5, 6]
Name: Col, dtype: object
1

MoreInfo

Pandas read_csv dtype read all columns but few as string

For Pandas 1.5.0+, there's an easy way to do this. If you use a defaultdict instead of a normal dict for the dtype argument, any columns which aren't explicitly listed in the dictionary will use the default as their type. E.g.

from collections import defaultdict
types = defaultdict(str, A="int", B="float")
df = pd.read_csv("/path/to/file.csv", dtype=types, keep_default_na=False)

(I haven't tested this, but I assume you still need keep_default_na=False)


For older versions of Pandas:

You can read the entire csv as strings then convert your desired columns to other types afterwards like this:

df = pd.read_csv('/path/to/file.csv', dtype=str, keep_default_na=False)
# example df; yours will be from pd.read_csv() above
df = pd.DataFrame({'A': ['1', '3', '5'], 'B': ['2', '4', '6'], 'C': ['x', 'y', 'z']})
types_dict = {'A': int, 'B': float}
for col, col_type in types_dict.items():
df[col] = df[col].astype(col_type)

keep_default_na=False is necessary if some of the columns are empty strings or something like NA which pandas convert to NA of type float by default, which would make you end up with a mixed datatype of str/float

Another approach, if you really want to specify the proper types for all columns when reading the file in and not change them after: read in just the column names (no rows), then use those to fill in which columns should be strings

col_names = pd.read_csv('file.csv', nrows=0).columns
types_dict = {'A': int, 'B': float}
types_dict.update({col: str for col in col_names if col not in types_dict})
pd.read_csv('file.csv', dtype=types_dict)

Reading csv containing a list in Pandas

One option is to use ast.literal_eval as converter:

>>> import ast
>>> df = pd.read_clipboard(header=None, quotechar='"', sep=',',
... converters={1:ast.literal_eval})
>>> df
0 1
0 HK [5328.1, 5329.3, 2013-12-27 13:58:57.973614]
1 HK [5328.1, 5329.3, 2013-12-27 13:58:59.237387]
2 HK [5328.1, 5329.3, 2013-12-27 13:59:00.346325]

And convert those lists to a DataFrame if needed, for example with:

>>> df = pd.DataFrame.from_records(df[1].tolist(), index=df[0],
... columns=list('ABC')).reset_index()
>>> df['C'] = pd.to_datetime(df['C'])
>>> df
0 A B C
0 HK 5328.1 5329.3 2013-12-27 13:58:57.973614
1 HK 5328.1 5329.3 2013-12-27 13:58:59.237387
2 HK 5328.1 5329.3 2013-12-27 13:59:00.346325

Pandas reading csv as string type

Update: this has been fixed: from 0.11.1 you passing str/np.str will be equivalent to using object.

Use the object dtype:

In [11]: pd.read_csv('a', dtype=object, index_col=0)
Out[11]:
A B
1A 0.35633069074776547 0.745585398803751
1B 0.20037376323337375 0.013921830784260236

or better yet, just don't specify a dtype:

In [12]: pd.read_csv('a', index_col=0)
Out[12]:
A B
1A 0.356331 0.745585
1B 0.200374 0.013922

but bypassing the type sniffer and truly returning only strings requires a hacky use of converters:

In [13]: pd.read_csv('a', converters={i: str for i in range(100)})
Out[13]:
A B
1A 0.35633069074776547 0.745585398803751
1B 0.20037376323337375 0.013921830784260236

where 100 is some number equal or greater than your total number of columns.

It's best to avoid the str dtype, see for example here.

How to drop a specific column of csv file while reading it using pandas?

If you know the column names prior, you can do it by setting usecols parameter

When you know which columns to use

Suppose you have csv file with columns ['id','name','last_name'] and you want just ['name','last_name']. You can do it as below:

import pandas as pd
df = pd.read_csv("sample.csv", usecols = ['name','last_name'])

when you want first N columns

If you don't know the column names but you want first N columns from dataframe. You can do it by

import pandas as pd
df = pd.read_csv("sample.csv", usecols = [i for i in range(n)])

Edit

When you know name of the column to be dropped

# Read column names from file
cols = list(pd.read_csv("sample_data.csv", nrows =1))
print(cols)

# Use list comprehension to remove the unwanted column in **usecol**
df= pd.read_csv("sample_data.csv", usecols =[i for i in cols if i != 'name'])

Setting column types while reading csv with pandas

In your loop you are doing:

for col in dp.columns:
print 'column', col,':', type(col[0])

and you are correctly seeing str as the output everywhere because col[0] is the first letter of the name of the column, which is a string.

For example, if you run this loop:

for col in dp.columns:
print 'column', col,':', col[0]

you will see the first letter of the string of each column name is printed out - this is what col[0] is.

Your loop only iterates on the column names, not on the series data.

What you really want is to check the type of each column's data (not its header or part of its header) in a loop.

So do this instead to get the types of the column data (non-header data):

for col in dp.columns:
print 'column', col,':', type(dp[col][0])

This is similar to what you did when printing the type of the rating column separately.



Related Topics



Leave a reply



Submit