Replace negative values by zero
Thanks for the reproducible example. This is pretty basic R stuff. You can assign to selected elements of a vector (note an array has dimensions, and what you've given is a vector not an array):
> pred_precipitation[pred_precipitation<0] <- 0
> pred_precipitation
[1] 1.2091281 0.0000000 7.7665555 0.0000000 0.0000000 0.0000000 0.5151504 0.0000000 1.8281251
[10] 0.5098688 2.8370263 0.4895606 1.5152191 4.1740177 7.1527742 2.8992215 4.5322934 6.7180530
[19] 0.0000000 1.1914052 3.6152333 0.0000000 0.3778717 0.0000000 1.4940469
Benchmark wars!
@James has found an even faster method and left it in a comment. I upvoted him, if only because I know his victory will be short-lived.
First, I try compiling, but that doesn't seem to help anyone:
p <- rnorm(10000)
gsk3 <- function(x) { x[x<0] <- 0; x }
jmsigner <- function(x) ifelse(x<0, 0, x)
joshua <- function(x) pmin(x,0)
james <- function(x) (abs(x)+x)/2
library(compiler)
gsk3.c <- cmpfun(gsk3)
jmsigner.c <- cmpfun(jmsigner)
joshua.c <- cmpfun(joshua)
james.c <- cmpfun(james)
microbenchmark(joshua(p),joshua.c(p),gsk3(p),gsk3.c(p),jmsigner(p),james(p),jmsigner.c(p),james.c(p))
expr min lq median uq max
1 gsk3.c(p) 251.782 255.0515 266.8685 269.5205 457.998
2 gsk3(p) 256.262 261.6105 270.7340 281.3560 2940.486
3 james.c(p) 38.418 41.3770 43.3020 45.6160 132.342
4 james(p) 38.934 42.1965 43.5700 47.2085 4524.303
5 jmsigner.c(p) 2047.739 2145.9915 2198.6170 2291.8475 4879.418
6 jmsigner(p) 2047.502 2169.9555 2258.6225 2405.0730 5064.334
7 joshua.c(p) 237.008 244.3570 251.7375 265.2545 376.684
8 joshua(p) 237.545 244.8635 255.1690 271.9910 430.566
But wait! Dirk wrote this Rcpp thing. Can a complete C++ incompetent read his JSS paper, adapt his example, and write the fastest function of them all? Stay tuned, dear listeners.
library(inline)
cpp_if_src <- '
Rcpp::NumericVector xa(a);
int n_xa = xa.size();
for(int i=0; i < n_xa; i++) {
if(xa[i]<0) xa[i] = 0;
}
return xa;
'
cpp_if <- cxxfunction(signature(a="numeric"), cpp_if_src, plugin="Rcpp")
microbenchmark(joshua(p),joshua.c(p),gsk3(p),gsk3.c(p),jmsigner(p),james(p),jmsigner.c(p),james.c(p), cpp_if(p))
expr min lq median uq max
1 cpp_if(p) 8.233 10.4865 11.6000 12.4090 69.512
2 gsk3(p) 170.572 172.7975 175.0515 182.4035 2515.870
3 james(p) 37.074 39.6955 40.5720 42.1965 2396.758
4 jmsigner(p) 1110.313 1118.9445 1133.4725 1164.2305 65942.680
5 joshua(p) 237.135 240.1655 243.3990 250.3660 2597.429
That's affirmative, captain.
This modifies the input p
even if you don't assign to it. If you want to avoid that behavior, you have to clone:
cpp_ifclone_src <- '
Rcpp::NumericVector xa(Rcpp::clone(a));
int n_xa = xa.size();
for(int i=0; i < n_xa; i++) {
if(xa[i]<0) xa[i] = 0;
}
return xa;
'
cpp_ifclone <- cxxfunction(signature(a="numeric"), cpp_ifclone_src, plugin="Rcpp")
Which unfortunately kills the speed advantage.
How to replace negative numbers in Pandas Data Frame by zero
If all your columns are numeric, you can use boolean indexing:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'a': [0, -1, 2], 'b': [-3, 2, 1]})
In [3]: df
Out[3]:
a b
0 0 -3
1 -1 2
2 2 1
In [4]: df[df < 0] = 0
In [5]: df
Out[5]:
a b
0 0 0
1 0 2
2 2 1
For the more general case, this answer shows the private method _get_numeric_data
:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'a': [0, -1, 2], 'b': [-3, 2, 1],
'c': ['foo', 'goo', 'bar']})
In [3]: df
Out[3]:
a b c
0 0 -3 foo
1 -1 2 goo
2 2 1 bar
In [4]: num = df._get_numeric_data()
In [5]: num[num < 0] = 0
In [6]: df
Out[6]:
a b c
0 0 0 foo
1 0 2 goo
2 2 1 bar
With timedelta
type, boolean indexing seems to work on separate columns, but not on the whole dataframe. So you can do:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'a': pd.to_timedelta([0, -1, 2], 'd'),
...: 'b': pd.to_timedelta([-3, 2, 1], 'd')})
In [3]: df
Out[3]:
a b
0 0 days -3 days
1 -1 days 2 days
2 2 days 1 days
In [4]: for k, v in df.iteritems():
...: v[v < 0] = 0
...:
In [5]: df
Out[5]:
a b
0 0 days 0 days
1 0 days 2 days
2 2 days 1 days
Update: comparison with a pd.Timedelta
works on the whole DataFrame:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'a': pd.to_timedelta([0, -1, 2], 'd'),
...: 'b': pd.to_timedelta([-3, 2, 1], 'd')})
In [3]: df[df < pd.Timedelta(0)] = 0
In [4]: df
Out[4]:
a b
0 0 days 0 days
1 0 days 2 days
2 2 days 1 days
How to replace negative values with 0 in pyspark dataframe
You can replace null values with 0 (or any value of your choice) across all columns with df.fillna(0) method. However to replace negative values across columns, I don't there is any direct approach, except using case when on each column as below.
from pyspark.sql import functions as F
df.withColumn(
"col1",
F.when(df["col1"] < 0, 0).when(F.col("col1").isNull(), 0).otherwise(F.col("col1")),
)
Power Query - Replace negative numbers with zero
Right click the column, transform .. absolute value
edit the end of the code from something that looks like this
= Table.TransformColumns(#"PriorStep",{{"Column1", Number.Abs, type number}})
to look like this
= Table.TransformColumns(#"PriorStep",{{"Column1", each if _ < 0 then 0 else _ , type number}})
What is the fastest way to replace negative values with 0 and values greater than 1 with 1 in an array using Python?
You want to use np.clip
:
>>> import numpy as np
>>> list_values = [-0.01, 0, 0.5, 0.9, 1.0, 1.01]
>>> arr = np.array(list_values)
>>> np.clip(arr, 0.0, 1.0)
array([0. , 0. , 0.5, 0.9, 1. , 1. ])
This is likely the fastest approach, if you can ignore the cost of converting to an array. Should be a lot better for larger lists/arrays.
Involving pandas
in this operation isn't the way to go unless you eventually want a pandas data structure.
Replacing all negative values in all columns by zero in python
Use pandas.DataFrame.clip
:
df.iloc[:, 1:] = df.iloc[:, 1:].clip(0)
print(df)
Output:
date T1 T2 T3 T4
0 1-1-2010 00:10 20 0 4 3
1 1-1-2010 00:20 85 0 34 21
2 1-1-2010 00:30 0 22 31 75
3 1-1-2010 00:40 0 5 7 0
Not only clip
is faster than mask
in your sample, but also in the larger dataset:
# Your sample -> 3x faster
%timeit df.iloc[:, 1:].clip(0)
# 1.74 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df.iloc[:,1:].mask(df.iloc[:,1:] < 0, 0)
# 5.25 ms ± 573 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# Large Sample -> 1,000,000 elements --> about 30x
large_df = pd.DataFrame(pd.np.random.randint(-5, 5, (1000, 1000)))
%timeit large_df.clip(0)
# 17.2 ms ± 2.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit large_df.mask(large_df< 0, 0)
# 498 ms ± 47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Related Topics
Get Values and Positions to Label a Ggplot Histogram
Seeing If Data Is Normally Distributed in R
Alternative to Expand.Grid for Data.Frames
Min for Each Row in a Data Frame
Ggplot2: Facet_Wrap Strip Color Based on Variable in Data Set
Line Break When No Data in Ggplot2
Center X and Y Axis with Ggplot2
Converting a Data Frame to Xts
How to Paste a String on Each Element of a Vector of Strings Using Apply in R
Add Nas to Make All List Elements Equal Length
Twitter, Roauth and Windows: Register Ok, But Certificate Verify Failed
How to Remove an Element from a List
Using Lists Inside Data.Table Columns
Sum Cells of Certain Columns for Each Row
How to Add Table of Contents in Rmarkdown
Data.Table - Select First N Rows Within Group