Replace Negative Values by Zero

Replace negative values by zero

Thanks for the reproducible example. This is pretty basic R stuff. You can assign to selected elements of a vector (note an array has dimensions, and what you've given is a vector not an array):

> pred_precipitation[pred_precipitation<0] <- 0
> pred_precipitation
[1] 1.2091281 0.0000000 7.7665555 0.0000000 0.0000000 0.0000000 0.5151504 0.0000000 1.8281251
[10] 0.5098688 2.8370263 0.4895606 1.5152191 4.1740177 7.1527742 2.8992215 4.5322934 6.7180530
[19] 0.0000000 1.1914052 3.6152333 0.0000000 0.3778717 0.0000000 1.4940469

Benchmark wars!

@James has found an even faster method and left it in a comment. I upvoted him, if only because I know his victory will be short-lived.

First, I try compiling, but that doesn't seem to help anyone:

p <- rnorm(10000)
gsk3 <- function(x) { x[x<0] <- 0; x }
jmsigner <- function(x) ifelse(x<0, 0, x)
joshua <- function(x) pmin(x,0)
james <- function(x) (abs(x)+x)/2
library(compiler)
gsk3.c <- cmpfun(gsk3)
jmsigner.c <- cmpfun(jmsigner)
joshua.c <- cmpfun(joshua)
james.c <- cmpfun(james)

microbenchmark(joshua(p),joshua.c(p),gsk3(p),gsk3.c(p),jmsigner(p),james(p),jmsigner.c(p),james.c(p))
expr min lq median uq max
1 gsk3.c(p) 251.782 255.0515 266.8685 269.5205 457.998
2 gsk3(p) 256.262 261.6105 270.7340 281.3560 2940.486
3 james.c(p) 38.418 41.3770 43.3020 45.6160 132.342
4 james(p) 38.934 42.1965 43.5700 47.2085 4524.303
5 jmsigner.c(p) 2047.739 2145.9915 2198.6170 2291.8475 4879.418
6 jmsigner(p) 2047.502 2169.9555 2258.6225 2405.0730 5064.334
7 joshua.c(p) 237.008 244.3570 251.7375 265.2545 376.684
8 joshua(p) 237.545 244.8635 255.1690 271.9910 430.566

compiled comparison

But wait! Dirk wrote this Rcpp thing. Can a complete C++ incompetent read his JSS paper, adapt his example, and write the fastest function of them all? Stay tuned, dear listeners.

library(inline)
cpp_if_src <- '
Rcpp::NumericVector xa(a);
int n_xa = xa.size();
for(int i=0; i < n_xa; i++) {
if(xa[i]<0) xa[i] = 0;
}
return xa;
'
cpp_if <- cxxfunction(signature(a="numeric"), cpp_if_src, plugin="Rcpp")
microbenchmark(joshua(p),joshua.c(p),gsk3(p),gsk3.c(p),jmsigner(p),james(p),jmsigner.c(p),james.c(p), cpp_if(p))
expr min lq median uq max
1 cpp_if(p) 8.233 10.4865 11.6000 12.4090 69.512
2 gsk3(p) 170.572 172.7975 175.0515 182.4035 2515.870
3 james(p) 37.074 39.6955 40.5720 42.1965 2396.758
4 jmsigner(p) 1110.313 1118.9445 1133.4725 1164.2305 65942.680
5 joshua(p) 237.135 240.1655 243.3990 250.3660 2597.429

with rcpp comparison

That's affirmative, captain.

This modifies the input p even if you don't assign to it. If you want to avoid that behavior, you have to clone:

cpp_ifclone_src <- '
Rcpp::NumericVector xa(Rcpp::clone(a));
int n_xa = xa.size();
for(int i=0; i < n_xa; i++) {
if(xa[i]<0) xa[i] = 0;
}
return xa;
'
cpp_ifclone <- cxxfunction(signature(a="numeric"), cpp_ifclone_src, plugin="Rcpp")

Which unfortunately kills the speed advantage.

How to replace negative numbers in Pandas Data Frame by zero

If all your columns are numeric, you can use boolean indexing:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'a': [0, -1, 2], 'b': [-3, 2, 1]})

In [3]: df
Out[3]:
a b
0 0 -3
1 -1 2
2 2 1

In [4]: df[df < 0] = 0

In [5]: df
Out[5]:
a b
0 0 0
1 0 2
2 2 1

For the more general case, this answer shows the private method _get_numeric_data:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'a': [0, -1, 2], 'b': [-3, 2, 1],
'c': ['foo', 'goo', 'bar']})

In [3]: df
Out[3]:
a b c
0 0 -3 foo
1 -1 2 goo
2 2 1 bar

In [4]: num = df._get_numeric_data()

In [5]: num[num < 0] = 0

In [6]: df
Out[6]:
a b c
0 0 0 foo
1 0 2 goo
2 2 1 bar

With timedelta type, boolean indexing seems to work on separate columns, but not on the whole dataframe. So you can do:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'a': pd.to_timedelta([0, -1, 2], 'd'),
...: 'b': pd.to_timedelta([-3, 2, 1], 'd')})

In [3]: df
Out[3]:
a b
0 0 days -3 days
1 -1 days 2 days
2 2 days 1 days

In [4]: for k, v in df.iteritems():
...: v[v < 0] = 0
...:

In [5]: df
Out[5]:
a b
0 0 days 0 days
1 0 days 2 days
2 2 days 1 days

Update: comparison with a pd.Timedelta works on the whole DataFrame:

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'a': pd.to_timedelta([0, -1, 2], 'd'),
...: 'b': pd.to_timedelta([-3, 2, 1], 'd')})

In [3]: df[df < pd.Timedelta(0)] = 0

In [4]: df
Out[4]:
a b
0 0 days 0 days
1 0 days 2 days
2 2 days 1 days

How to replace negative values with 0 in pyspark dataframe

You can replace null values with 0 (or any value of your choice) across all columns with df.fillna(0) method. However to replace negative values across columns, I don't there is any direct approach, except using case when on each column as below.

from pyspark.sql import functions as F

df.withColumn(
"col1",
F.when(df["col1"] < 0, 0).when(F.col("col1").isNull(), 0).otherwise(F.col("col1")),
)

Power Query - Replace negative numbers with zero

Right click the column, transform .. absolute value

edit the end of the code from something that looks like this

= Table.TransformColumns(#"PriorStep",{{"Column1", Number.Abs, type number}})

to look like this

= Table.TransformColumns(#"PriorStep",{{"Column1", each if _ < 0 then 0 else _ , type number}})

What is the fastest way to replace negative values with 0 and values greater than 1 with 1 in an array using Python?

You want to use np.clip:

>>> import numpy as np
>>> list_values = [-0.01, 0, 0.5, 0.9, 1.0, 1.01]
>>> arr = np.array(list_values)
>>> np.clip(arr, 0.0, 1.0)
array([0. , 0. , 0.5, 0.9, 1. , 1. ])

This is likely the fastest approach, if you can ignore the cost of converting to an array. Should be a lot better for larger lists/arrays.

Involving pandas in this operation isn't the way to go unless you eventually want a pandas data structure.

Replacing all negative values in all columns by zero in python

Use pandas.DataFrame.clip:

df.iloc[:, 1:] = df.iloc[:, 1:].clip(0)
print(df)

Output:

             date  T1  T2  T3  T4
0 1-1-2010 00:10 20 0 4 3
1 1-1-2010 00:20 85 0 34 21
2 1-1-2010 00:30 0 22 31 75
3 1-1-2010 00:40 0 5 7 0

Not only clip is faster than mask in your sample, but also in the larger dataset:

# Your sample -> 3x faster
%timeit df.iloc[:, 1:].clip(0)
# 1.74 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit df.iloc[:,1:].mask(df.iloc[:,1:] < 0, 0)
# 5.25 ms ± 573 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# Large Sample -> 1,000,000 elements --> about 30x
large_df = pd.DataFrame(pd.np.random.randint(-5, 5, (1000, 1000)))

%timeit large_df.clip(0)
# 17.2 ms ± 2.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit large_df.mask(large_df< 0, 0)
# 498 ms ± 47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Related Topics



Leave a reply



Submit