Using numpy.genfromtxt to read a csv file with strings containing commas
You can use pandas (the becoming default library for working with dataframes (heterogeneous data) in scientific python) for this. It's read_csv
can handle this. From the docs:
quotechar : string
The character to used to denote the start and end of a quoted item. Quoted items
can include the delimiter and it will be ignored.
The default value is "
. An example:
In [1]: import pandas as pd
In [2]: from StringIO import StringIO
In [3]: s="""year, city, value
...: 2012, "Louisville KY", 3.5
...: 2011, "Lexington, KY", 4.0"""
In [4]: pd.read_csv(StringIO(s), quotechar='"', skipinitialspace=True)
Out[4]:
year city value
0 2012 Louisville KY 3.5
1 2011 Lexington, KY 4.0
The trick here is that you also have to use skipinitialspace=True
to deal with the spaces after the comma-delimiter.
Apart from a powerful csv reader, I can also strongly advice to use pandas with the heterogeneous data you have (the example output in numpy you give are all strings, although you could use structured arrays).
Reading CSV file with numpy.genfromtxt() - delimiter as a part of a row name
The best way, in my opinion, to read a csv or any other character delimited file is to use the DataFrame
class from Pandas. You won't have to deal with the presence of commas since DataFrame
s follow all commons CSV specs.
import pandas as pd
data = pd.read_csv(source)
numpy read CSV file where some fields have commas?
It turns out the easiest way to do this is to use the standard library module, csv
to read in the file into a tuple, then use the tuple as input to a numpy array. I wish I could just read it in with numpy, but that doesn't seem to work.
How to use NumPy to read in a CSV file containing strings and float values into a 2-D array
For consistent number of columns and mixed datatype use :
import numpy as np
np.genfromtxt('filename', dtype= None, delimiter=",")
dtype = none
results in a recarry. so to access the field you must use the attributes.
When should I use the numpy.genfromtxt instead of pandas.read_csv to read a csv file?
As I understand it , the pandas reader is a optimized program written in C and is faster in much situation. genfromtext
is an old python fonction with less inferring skills, that you can forget if you have pandas.
In [45]: df=pd.DataFrame(np.arange(10**6).reshape(1000,1000))
In [46]: df.to_csv("data.csv")
In [47]: %time v=np.genfromtxt("data.csv",delimiter=',',dtype=int,skip_header=1)
Wall time: 5.62 s
In [48]: %time u=pd.read_csv("data.csv",engine='python')
Wall time: 3.97 s
In [49]: %time u=pd.read_csv("data.csv")
Wall time: 781 ms
The docs describe the engine
option :
engine : {‘c’, ‘python’}, optional
Parser engine to use. The C engine is faster while the python engine
is currently more feature-complete.
Python : Reading CSV using np.genfromtxt resulting in different number of columns
You're passing a ,
as a delimiter while a lot of your column values contain elements themselves. You'd need to specify an explicit quotechar to get this to work.
Fortunately, pandas
handles this really well without much handholding. You could try loading your data with read_csv
and then convert the loaded dataframe to an array.
import pandas as pd
array = pd.read_csv(name, index_col=[0]).values
The loaded dataframe (what you get before calling .values
) looks like this:
df = pd.read_csv(name, index_col=[0])
print(df)
mean_fit_time mean_score_time mean_test_score mean_train_score \
0 0.341662 0.001036 0.842927 0.846898
1 0.554314 0.001825 0.846525 0.852755
2 0.526688 0.001368 0.843761 0.847841
3 0.494591 0.001116 0.840646 0.845428
4 0.617542 0.002490 0.844902 0.850814
param_NN__alpha param_NN__hidden_layer_sizes \
0 0.1 (7,)
1 0.1 (25, 7)
2 0.1 (11, 7)
3 0.1 (7, 5)
4 0.1 (25, 11, 7)
params rank_test_score \
0 {'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (... 25
1 {'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (... 5
2 {'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (... 17
3 {'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (... 32
4 {'NN__alpha': 0.1, 'NN__hidden_layer_sizes': (... 11
split0_test_score split0_train_score ... split2_test_score \
0 0.842071 0.847529 ... 0.845361
1 0.846019 0.853014 ... 0.847993
2 0.842509 0.847968 ... 0.845580
3 0.838342 0.848462 ... 0.846896
4 0.841413 0.849394 ... 0.850186
split2_train_score split3_test_score split3_train_score \
0 0.846158 0.838526 0.848689
1 0.849668 0.840061 0.851486
2 0.852027 0.843352 0.851596
3 0.851478 0.831286 0.838105
4 0.851972 0.845985 0.856477
split4_test_score split4_train_score std_fit_time std_score_time \
0 0.848804 0.845736 0.050932 0.000182
1 0.852535 0.850617 0.108354 0.000189
2 0.851876 0.844420 0.104162 0.000323
3 0.843757 0.838936 0.103976 0.000189
4 0.844196 0.841568 0.194023 0.000476
std_test_score std_train_score
0 0.003738 0.001075
1 0.004014 0.003307
2 0.005278 0.003603
3 0.005422 0.005727
4 0.003050 0.005209
[5 rows x 22 columns
And yes, columns are automatically converted to the appropriate datatypes.
print(df.dtypes)
mean_fit_time float64
mean_score_time float64
mean_test_score float64
mean_train_score float64
param_NN__alpha float64
param_NN__hidden_layer_sizes object
params object
rank_test_score int64
split0_test_score float64
split0_train_score float64
split1_test_score float64
split1_train_score float64
split2_test_score float64
split2_train_score float64
split3_test_score float64
split3_train_score float64
split4_test_score float64
split4_train_score float64
std_fit_time float64
std_score_time float64
std_test_score float64
std_train_score float64
dtype: object
Statutory warning: This data, owing to its nature, will probably be more useful to you as a python list, than a numpy array (which is optimised to work with scalars).
Read CSV file to numpy array, first row as strings, rest as float
You can keep the column names if you use the names=True
argument in the function np.genfromtxt
data = np.genfromtxt(path_to_csv, dtype=float, delimiter=',', names=True)
Please note the dtype=float
, that will convert your data to float. This is more efficient than using dtype=None
, that asks np.genfromtxt
to guess the datatype for you.
The output will be a structured array, where you can access individual columns by their name. The names will be taken from your first row. Some modifications may occur, spaces in a column name will be changed to _
for example. The documentation should cover most questions you could have.
Related Topics
What Does a Leading '\X' Mean in a Python String '\Xaa'
Passing a Data Frame Column and External List to Udf Under Withcolumn
Default Filter in Django Admin
Why Do Two Identical Lists Have a Different Memory Footprint
Serving Dynamically Generated Zip Archives in Django
Pandas: Change Data Type of Series to String
Finding a Substring Within a List in Python
Handling Urllib2's Timeout? - Python
Using Psycopg2 with Lambda to Update Redshift (Python)
How to Annotate Types of Multiple Return Values
Define a Method Outside of Class Definition
How to Assign a Variable in an If Condition, and Then Return It
Flask App: Update Progress Bar While Function Runs
Finding Moving Average from Data Points in Python