Error in a function for Replacing all nan values in a dataframe
You can define the type of null values when you read the file using pd.read_csv()
. Per the docs:
na_values : scalar, str, list-like, or dict, optional
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.
In your case, you can try:
data=pd.read_csv("diabetes.csv", na_values=["_","-","?","","na","n/a"])
Fill in missing values by group in data.table
There is now a native data.table
way of filling missing values (as of 1.12.4
).
This question spawned a github issue which was recently closed with the creation of functions nafill
and setnafill
. You can now use
DT[, value_filled_in := nafill(value, type = "locf")]
It is also possible to fill NA
with a constant value or next observation carried back.
One difference to the approach in the question is that these functions currently only work on NA
not NaN
whereas is.na
is TRUE
for NaN
- this is planned to be fixed in the next release through an extra argument.
I have no involvement with the project but I saw that although the github issue links here, there was no link the other way so I'm answering on behalf of future visitors.
Update: By default NaN
is now treated same as NA
.
pandas replace np.nan based on multiple conditions
Try rewriting your np.where
statement:
df['is_less'] = np.where( (df['A'].isnull()) | (df['B'].isnull() ),np.nan, # check if A or B are np.nan
np.where(df['B'].ge(df['A']),'no','yes')) # check if B >= A
prints:
A B is_less
0 NaN 10.0 nan
1 10.0 NaN nan
2 1.0 5.0 no
3 5.0 1.0 yes
Greater than or equal
pandas.ge
Is there a python function to fill missing data with consecutive value
One way is to use loc
with an array:
df.loc[df['b'].isnull(), 'b'] = [1, 2]
What you're attempting is possible but cumbersome with fillna
:
nulls = df['b'].isnull()
df['b'] = df['b'].fillna(pd.Series([1, 2], index=nulls[nulls].index))
You may be looking for interpolate
but the above solutions are generic given an input list or array.
If, on the other hand, you want to fill nulls with a sequence 1, 2, 3, etc
, you can use cumsum
:
# fillna solution
df['b'] = df['b'].fillna(df['b'].isnull().cumsum())
# loc solution
nulls = df['b'].isnull()
df.loc[nulls, 'b'] = nulls.cumsum()
How to find the difference between elements, ignoring NA values
you can filter out the nan then use diff
s = pd.Series([np.nan, np.nan, np.nan, '2019-12-11', np.nan, '2019-12-14', np.nan, np.nan, '2019-12-20', '2019-12-23'])
s = pd.to_datetime(s)
s[~s.isna()].diff()
# 3 NaT
# 5 3 days
# 8 6 days
# 9 3 days
# dtype: timedelta64[ns]
another option would be
s.ffill().diff()
# 0 NaT
# 1 NaT
# 2 NaT
# 3 NaT
# 4 0 days
# 5 3 days
# 6 0 days
# 7 0 days
# 8 6 days
# 9 3 days
# dtype: timedelta64[ns]
Replace NaN Values with the Means of other Cols based on Condition
You could implement the function like this:
def replace_missing_with_conditional_mean(df, condition_cols, cols):
s = df.groupby(condition_cols)[cols].transform('mean')
return df.fillna(s.to_dict('series'))
res = replace_missing_with_conditional_mean(df, ['Col1', 'Col2'], ['Col3'])
print(res)
Output
Col1 Col2 Col3
0 A c 1.0
1 A c 3.0
2 B c 5.0
3 A d 6.0
4 A c 2.0
Cannot assign nan /empty value in np.where
I was able to fix it using pandas.NA
instead which fillna
, for some reason, recognizes as blanks to fill with ffill
Fix:
df['x'] = np.where(df['y']>0.05,1,pd.NA)
df['x'] = df['x'].fillna(method="ffill")
Related Topics
Assign a Value, If a Number Is in Between Two Numbers
Identify Duplicates and Mark First Occurrence and All Others
Merging a Large List of Xts Objects
How to Use R Plotly Library in R Script Visual of Power Bi
Calculate Euclidean Distance Matrix Using a Big.Matrix Object
Combining Duplicated Rows in R and Adding New Column Containing Ids of Duplicates
Mutate Multiple Variable to Create Multiple New Variables
How to Generalize Outer to N Dimensions
R Sum a Variable by Two Groups
Force Ggplot Legend to Show All Categories When No Values Are Present
Modifying Ggplot Objects After Creation
Split a String Column into Several Dummy Variables
How to Create a Bipartite Network in R with Igraph or Tnet