Sklearn Error Valueerror: Input Contains Nan, Infinity or a Value Too Large for Dtype('Float64')

sklearn error ValueError: Input contains NaN, infinity or a value too large for dtype('float64')

This might happen inside scikit, and it depends on what you're doing. I recommend reading the documentation for the functions you're using. You might be using one which depends e.g. on your matrix being positive definite and not fulfilling that criteria.

EDIT: How could I miss that:

np.isnan(mat.any()) #and gets False
np.isfinite(mat.all()) #and gets True

is obviously wrong. Right would be:

np.any(np.isnan(mat))

and

np.all(np.isfinite(mat))

You want to check whether any of the elements are NaN, and not whether the return value of the any function is a number...

Input contains NaN, infinity or a value too large for dtype('float32') when I train a DecisionTreeClassifier

There are some infinite values in your mass_error_min column:

data_new_2.describe()

               mass       mass_error_min
count   1425.000000       1425.0000
mean    6.060956          inf
std     13.568726         NaN
min     0.000002          0.0000
25%     0.054750          0.0116
50%     0.725000          0.0700
75%     3.213000          0.5300
max     135.300000        inf

So, you have to fill those inf with some value, use this code:

value = data_new_2['mass_error_min'].quantile(0.98)
data_new_2 = data_new_2.replace(np.inf, value)

Scikit-learn : Input contains NaN, infinity or a value too large for dtype ('float64')

The problem with your regression is that somehow NaN's have sneaked into your data. This could be easily checked with the following code snippet:

import pandas as pd
import numpy as np
from  sklearn import linear_model
from sklearn.cross_validation import train_test_split

reader = pd.io.parsers.read_csv("./data/all-stocks-cleaned.csv")
stock = np.array(reader)

openingPrice = stock[:, 1]
closingPrice = stock[:, 5]

openingPriceTrain, openingPriceTest, closingPriceTrain, closingPriceTest = \
    train_test_split(openingPrice, closingPrice, test_size=0.25, random_state=42)

openingPriceTrain = openingPriceTrain.reshape(openingPriceTrain.size,1)
openingPriceTrain = openingPriceTrain.astype(np.float64, copy=False)

closingPriceTrain = closingPriceTrain.reshape(closingPriceTrain.size,1)
closingPriceTrain = closingPriceTrain.astype(np.float64, copy=False)

openingPriceTest = openingPriceTest.reshape(openingPriceTest.size,1)
openingPriceTest = openingPriceTest.astype(np.float64, copy=False)

np.isnan(openingPriceTrain).any(), np.isnan(closingPriceTrain).any(), np.isnan(openingPriceTest).any()

(True, True, True)

If you try imputing missing values like below:

openingPriceTrain[np.isnan(openingPriceTrain)] = np.median(openingPriceTrain[~np.isnan(openingPriceTrain)])
closingPriceTrain[np.isnan(closingPriceTrain)] = np.median(closingPriceTrain[~np.isnan(closingPriceTrain)])
openingPriceTest[np.isnan(openingPriceTest)] = np.median(openingPriceTest[~np.isnan(openingPriceTest)])

your regression will run smoothly without a problem:

regression = linear_model.LinearRegression()

regression.fit(openingPriceTrain, closingPriceTrain)

predicted = regression.predict(openingPriceTest)

predicted[:5]

array([[ 13598.74748173],
       [ 53281.04442146],
       [ 18305.4272186 ],
       [ 50753.50958453],
       [ 14937.65782778]])

In short: you have missing values in your data, as the error message said.

EDIT::

perhaps an easier and more straightforward approach would be to check if you have any missing data right after you read the data with pandas:

data = pd.read_csv('./data/all-stocks-cleaned.csv')
data.isnull().any()
Date                    False
Open                     True
High                     True
Low                      True
Last                     True
Close                    True
Total Trade Quantity     True
Turnover (Lacs)          True

and then impute the data with any of the two lines below:

data = data.fillna(lambda x: x.median())

data = data.fillna(method='ffill')

ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). sklearn

Looks like the column hours_worked_each_week contains nulls.

Do you get the same error if you drop that column:

X = df.drop(['infected', 'hours_worked_each_week'], axis=1).values

Alternatively, you can replace nulls with 0

df.fillna(0,inplace=True)

Sklearn Error Valueerror: Input Contains Nan, Infinity or a Value Too Large for Dtype('Float64')