How do I read CSV data into a record array in NumPy?
Use numpy.genfromtxt()
by setting the delimiter
kwarg to a comma:
from numpy import genfromtxt
my_data = genfromtxt('my_file.csv', delimiter=',')
Load data from csv into numpy array
This should give you your required output.
import numpy as np
def load_data(path):
return np.loadtxt(path,delimiter=',')
How to import CSV file as NumPy array?
numpy.genfromtxt()
is the best thing to use here
import numpy as np
csv = np.genfromtxt ('file.csv', delimiter=",")
second = csv[:,1]
third = csv[:,2]
>>> second
Out[1]: array([ 432., 300., 432.])
>>> third
Out[2]: array([ 1., 1., 0.])
Reading Two Numpy Arrays from csv file
This should work, there may be a more numpy way, but this is how i made the array:
import numpy as np
import pandas as pd
A=[]
B=[]
df1=pd.read_csv('numpy.csv', sep=";")
for x in range(len(df1.A)):
A.append(df1.A[x].split(','))
for x in range(len(df1.B)):
B.append(df1.B[x].split(','))
A=np.array(A).astype(np.int)
B=np.array(B).astype(np.int)
A
#array([[1, 1, 2, 2],
# [6, 7, 3, 7],
# [1, 8, 5, 3]])
B
Out[251]:
#array([[3, 3, 4, 4],
# [3, 5, 3, 5],
# [6, 1, 7, 5]])
CSV data to Numpy structured array?
Using latest numpy (1.14) on Py3.
Your sample, cleaned up:
In [93]: txt = """Name --- Class --- Numbers
...: a ---------- 1 -------- 3
...: b ---------- 2 -------- 4
...: c ---------- 3 -------- 2
...: a ---------- 1 -------- 3
...: b ---------- 2 ------- 1
...: c ---------- 3 --------- 2"""
In [94]: data = np.genfromtxt(txt.splitlines(), dtype=None, names=True, encoding=None)
In [95]: data
Out[95]:
array([('a', '----------', 1, '--------', 3),
('b', '----------', 2, '--------', 4),
('c', '----------', 3, '--------', 2),
('a', '----------', 1, '--------', 3),
('b', '----------', 2, '-------', 1),
('c', '----------', 3, '---------', 2)],
dtype=[('Name', '<U1'), ('f0', '<U10'), ('Class', '<i8'), ('f1', '<U9'), ('Numbers', '<i8')])
Or skipping the dashed columns:
In [96]: data = np.genfromtxt(txt.splitlines(), dtype=None, names=True, encoding=None, usecols=[0,2,4])
In [97]: data
Out[97]:
array([('a', 1, 3),
('b', 2, 4),
('c', 3, 2),
('a', 1, 3),
('b', 2, 1),
('c', 3, 2)],
dtype=[('Name', '<U1'), ('Class', '<i8'), ('Numbers', '<i8')])
What is a fast way to read a matrix from a CSV file to NumPy if the size is known in advance?
Parsing CSV files correctly while supporting several data types (eg. floating-point numbers, integers, strings) and possibly ill-formed input files is clearly not easy, and doing so efficiently is actually pretty hard. Moreover, decoding UTF-8 strings is also much slower than reading directly ASCII strings. This is the reasons why most CSV libraries are pretty slow. Not to mention wrapping library in Python could introduce pretty big overheads regarding the input types (especially string).
Hopefully, if you need to read a CSV file containing a square matrix of integers that is assumed to be correctly formed, then you can write a much faster specific code dedicated to your needs (which does not care about floating-point numbers, strings, UTF-8, header decoding, error handling, etc.).
That being said, any call to a basic CPython function tends to introduce a huge overhead. Even a simple call to open
+read
is relatively slow (the binary mode is significantly faster than the text mode but unfortunately not so fast). The trick is to use Numpy to load the whole binary file in RAM with np.fromfile
. This function is extremely fast: it just read the whole file at once, put its binary content in a raw memory buffer and return a view on it. When the file is in the operating system cache or a high-throughput NVMe SSD storage device, it can load the file at the speed of several GiB/s.
One the file is loaded, you can decode it with Numba (or Cython) so the decoding can be nearly as fast as a native code. Note that Numba does not support well/efficiently strings/bytes. Hopefully, the function np.fromfile
produces a contiguous byte array and Numba can compute it very quickly. You can know the size of the matrix by just reading the first line and counting the number of comma. Then you can fill the matrix very efficiently by decoding integer on-the-fly, packing them in a flatten matrix and just consider end-of-line characters as regular separators. Note that \r
and \n
can both appear in the file since the file is read in binary mode.
Here is the resulting implementation:
import numba as nb
import numpy as np
@nb.njit('int32[:,:](uint8[::1],)', cache=True)
def decode_csv_buffer(rawData):
COMMA = np.uint8(ord(','))
CR = np.uint8(ord('\r'))
LF = np.uint8(ord('\n'))
ZERO = np.uint8(ord('0'))
# Find the size of the matrix (`n`)
n = 0
lineSize = 0
for i in range(rawData.size):
c = rawData[i]
if c == CR or c == LF:
break
n += rawData[i] == COMMA
lineSize += 1
n += 1
# Empty matrix
if lineSize == 0:
return np.empty((0, 0), dtype=np.int32)
# Initialization
res = np.empty(n * n, dtype=np.int32)
# Fill the matrix
curInt = 0
curPos = 0
lastCharIsDigit = True
for i in range(len(rawData)):
c = rawData[i]
if c == CR or c == LF or c == COMMA:
if lastCharIsDigit:
# Write the last int in the flatten matrix
res[curPos] = curInt
curPos += 1
curInt = 0
lastCharIsDigit = False
else:
curInt = curInt * 10 + (c - ZERO)
lastCharIsDigit = True
return res.reshape(n, n)
def load_numba(filename):
# Load fully the file in a raw memory buffer
rawData = np.fromfile(filename, dtype=np.uint8)
# Decode the buffer using the Numba JIT
# This method only work for your specific needs and
# can simply crash if the file content is invalid.
return decode_csv_buffer(rawData)
Be aware that the code is not robust (any bad input results in an undefined behaviour, including a crash), but it is very fast.
Here are the results on my machine:
As you can see, the above Numba implementation is at least one order of magnitude faster than all others. Note that you can write an even faster code using multiple threads during the decoding, but this makes the code significantly more complex.
How to read csv file into numpy arrays without quotation mark
Use X = np.array(line[1].split(" ")).astype(np.int)
Related Topics
Python Requests Throwing Sslerror
Finding and Replacing Elements in a List
Set Value For Particular Cell in Pandas Dataframe Using Index
Why Should We Not Use Sys.Setdefaultencoding("Utf-8") in a Py Script
How to Fill Out a Python String With Spaces
How to Test If a String Contains One of the Substrings in a List, in Pandas
Refering to a Directory in a Flask App Doesn't Work Unless the Path Is Absolute
How to Programmatically Set an Attribute
Multiple Prints on the Same Line in Python
Pygame Doesn't Let Me Use Float For Rect.Move, But I Need It
Text Progress Bar in Terminal With Block Characters
What Is an Alternative to Execfile in Python 3
How to Write to an Excel Spreadsheet Using Python
Count the Number of Occurrences of a Character in a String