How to Read CSV Data into a Record Array in Numpy

How do I read CSV data into a record array in NumPy?

Use numpy.genfromtxt() by setting the delimiter kwarg to a comma:

from numpy import genfromtxt
my_data = genfromtxt('my_file.csv', delimiter=',')

Load data from csv into numpy array

This should give you your required output.

import numpy as np
def load_data(path):
return np.loadtxt(path,delimiter=',')

How to import CSV file as NumPy array?

numpy.genfromtxt() is the best thing to use here

import numpy as np
csv = np.genfromtxt ('file.csv', delimiter=",")
second = csv[:,1]
third = csv[:,2]

>>> second
Out[1]: array([ 432., 300., 432.])

>>> third
Out[2]: array([ 1., 1., 0.])

Reading Two Numpy Arrays from csv file

This should work, there may be a more numpy way, but this is how i made the array:

import numpy as np
import pandas as pd
A=[]
B=[]
df1=pd.read_csv('numpy.csv', sep=";")
for x in range(len(df1.A)):
A.append(df1.A[x].split(','))
for x in range(len(df1.B)):
B.append(df1.B[x].split(','))
A=np.array(A).astype(np.int)
B=np.array(B).astype(np.int)

A
#array([[1, 1, 2, 2],
# [6, 7, 3, 7],
# [1, 8, 5, 3]])

B
Out[251]:
#array([[3, 3, 4, 4],
# [3, 5, 3, 5],
# [6, 1, 7, 5]])

CSV data to Numpy structured array?

Using latest numpy (1.14) on Py3.

Your sample, cleaned up:

In [93]: txt = """Name --- Class --- Numbers
...: a ---------- 1 -------- 3
...: b ---------- 2 -------- 4
...: c ---------- 3 -------- 2
...: a ---------- 1 -------- 3
...: b ---------- 2 ------- 1
...: c ---------- 3 --------- 2"""
In [94]: data = np.genfromtxt(txt.splitlines(), dtype=None, names=True, encoding=None)
In [95]: data
Out[95]:
array([('a', '----------', 1, '--------', 3),
('b', '----------', 2, '--------', 4),
('c', '----------', 3, '--------', 2),
('a', '----------', 1, '--------', 3),
('b', '----------', 2, '-------', 1),
('c', '----------', 3, '---------', 2)],
dtype=[('Name', '<U1'), ('f0', '<U10'), ('Class', '<i8'), ('f1', '<U9'), ('Numbers', '<i8')])

Or skipping the dashed columns:

In [96]: data = np.genfromtxt(txt.splitlines(), dtype=None, names=True, encoding=None, usecols=[0,2,4])
In [97]: data
Out[97]:
array([('a', 1, 3),
('b', 2, 4),
('c', 3, 2),
('a', 1, 3),
('b', 2, 1),
('c', 3, 2)],
dtype=[('Name', '<U1'), ('Class', '<i8'), ('Numbers', '<i8')])

What is a fast way to read a matrix from a CSV file to NumPy if the size is known in advance?

Parsing CSV files correctly while supporting several data types (eg. floating-point numbers, integers, strings) and possibly ill-formed input files is clearly not easy, and doing so efficiently is actually pretty hard. Moreover, decoding UTF-8 strings is also much slower than reading directly ASCII strings. This is the reasons why most CSV libraries are pretty slow. Not to mention wrapping library in Python could introduce pretty big overheads regarding the input types (especially string).

Hopefully, if you need to read a CSV file containing a square matrix of integers that is assumed to be correctly formed, then you can write a much faster specific code dedicated to your needs (which does not care about floating-point numbers, strings, UTF-8, header decoding, error handling, etc.).

That being said, any call to a basic CPython function tends to introduce a huge overhead. Even a simple call to open+read is relatively slow (the binary mode is significantly faster than the text mode but unfortunately not so fast). The trick is to use Numpy to load the whole binary file in RAM with np.fromfile. This function is extremely fast: it just read the whole file at once, put its binary content in a raw memory buffer and return a view on it. When the file is in the operating system cache or a high-throughput NVMe SSD storage device, it can load the file at the speed of several GiB/s.

One the file is loaded, you can decode it with Numba (or Cython) so the decoding can be nearly as fast as a native code. Note that Numba does not support well/efficiently strings/bytes. Hopefully, the function np.fromfile produces a contiguous byte array and Numba can compute it very quickly. You can know the size of the matrix by just reading the first line and counting the number of comma. Then you can fill the matrix very efficiently by decoding integer on-the-fly, packing them in a flatten matrix and just consider end-of-line characters as regular separators. Note that \r and \n can both appear in the file since the file is read in binary mode.

Here is the resulting implementation:

import numba as nb
import numpy as np

@nb.njit('int32[:,:](uint8[::1],)', cache=True)
def decode_csv_buffer(rawData):
COMMA = np.uint8(ord(','))
CR = np.uint8(ord('\r'))
LF = np.uint8(ord('\n'))
ZERO = np.uint8(ord('0'))

# Find the size of the matrix (`n`)

n = 0
lineSize = 0

for i in range(rawData.size):
c = rawData[i]
if c == CR or c == LF:
break
n += rawData[i] == COMMA
lineSize += 1

n += 1

# Empty matrix
if lineSize == 0:
return np.empty((0, 0), dtype=np.int32)

# Initialization

res = np.empty(n * n, dtype=np.int32)

# Fill the matrix

curInt = 0
curPos = 0
lastCharIsDigit = True

for i in range(len(rawData)):
c = rawData[i]
if c == CR or c == LF or c == COMMA:
if lastCharIsDigit:
# Write the last int in the flatten matrix
res[curPos] = curInt
curPos += 1
curInt = 0
lastCharIsDigit = False
else:
curInt = curInt * 10 + (c - ZERO)
lastCharIsDigit = True

return res.reshape(n, n)

def load_numba(filename):
# Load fully the file in a raw memory buffer
rawData = np.fromfile(filename, dtype=np.uint8)

# Decode the buffer using the Numba JIT
# This method only work for your specific needs and
# can simply crash if the file content is invalid.
return decode_csv_buffer(rawData)

Be aware that the code is not robust (any bad input results in an undefined behaviour, including a crash), but it is very fast.

Here are the results on my machine:

Sample Image

As you can see, the above Numba implementation is at least one order of magnitude faster than all others. Note that you can write an even faster code using multiple threads during the decoding, but this makes the code significantly more complex.

How to read csv file into numpy arrays without quotation mark

Use X = np.array(line[1].split(" ")).astype(np.int)



Related Topics



Leave a reply



Submit