What Does the Diff() Function in R Do

What does the diff() function in R do?

The function calculates the differences between all consecutive values of a vector. For your example vector, the differences are:

 1 - 10 = -9
1 - 1 = 0
1 - 1 = 0
.
.
.
3 - 1 = 2
10 - 3 = 7

The argument differences allows you to specify the order of the differences.

E.g., the command

diff(temp, differences = 2) 
[1] 9 0 0 0 0 1 -2 1 0 0 0 0 0 2 5

produces the same result as

diff(diff(temp))
[1] 9 0 0 0 0 1 -2 1 0 0 0 0 0 2 5

Hence, it returns the differences of differences.


The argument lag allows you to specify the lag.

For example, if lag = 2, the differences between the third and the first value, between the fourth and the second value, between the fifth and the third value etc. are calculated.

diff(temp, lag = 2)
[1] -9 0 0 0 0 1 0 -1 0 0 0 0 0 2 9

R - apply diff() function or equivalent self-defined function on multiple columns in a data.table

The error seems because diff(any_vector) returns a vector but length one shorter than any_vector. See this

diff(1:5)
[1] 1 1 1 1

So if diff is to be applied on any variable in a table, one element has to be added in the result either at end or at start. Although I am not sure of your expected outcome, still I presume this. (I am adding NA to the starting of resulting vector. You may add 0 as well, if so desired.

library(dplyr)
df %>% mutate(across(cols, ~c(NA, diff(.)), .names = "{.col}_diff"))

ID Date Var1 Var2 Var3 Var4 Var3_diff Var4_diff
1 1 2020-03-01 AB A33 250 12 NA NA
2 1 2020-04-01 B B25 NA 14 NA 2
3 1 2020-05-01 AB A44 270 20 NA 6
4 1 2020-06-01 AC C33 9 13 -261 -7
5 2 2019-09-01 X C55 280 11 271 -2
6 2 2019-10-01 K C89 120 12 -160 1
7 2 2019-11-01 A C89 320 NA 200 NA
8 2 2019-12-01 AB A88 200 25 -120 NA

Or if grouped on ID is required

df %>% group_by(ID) %>%
mutate(across(cols, ~c(NA, diff(.)), .names = "{.col}_diff"))

# A tibble: 8 x 8
# Groups: ID [2]
ID Date Var1 Var2 Var3 Var4 Var3_diff Var4_diff
<int> <chr> <chr> <chr> <int> <int> <int> <int>
1 1 2020-03-01 AB A33 250 12 NA NA
2 1 2020-04-01 B B25 NA 14 NA 2
3 1 2020-05-01 AB A44 270 20 NA 6
4 1 2020-06-01 AC C33 9 13 -261 -7
5 2 2019-09-01 X C55 280 11 NA NA
6 2 2019-10-01 K C89 120 12 -160 1
7 2 2019-11-01 A C89 320 NA 200 NA
8 2 2019-12-01 AB A88 200 25 -120 NA

R: Conditional diff function

You would need to move diff() to inside a mutate() statement if you are using dplyr. But diff() returns a vector that's shorter by 1 than your input vector which makes it difficult to keep the same number of rows. An alternative is to use the dplyr lead() function to grab the "next" value in the group

df%>%
group_by(id)%>%
mutate(diff=lead(t1)-t1)

Using diff function in R

First, load your data with custom sep and dec (or read.csv2 for ; separator and , for decimal points) from your data:

data <- read.csv(file.choose(), header=TRUE, sep=";", dec=",")
# OR
data <- read.csv2(file.choose(), header=TRUE)

You can use names of the columns, instead of indexing. Then the second plot can be shown as:

plot(diff(as.numeric(data$Total.cases)), ylab="First Difference", col="red", font.lab=2, type="o")

Why is R diff function slow?

Here is some code to help expand and illustrate the points I made in my comment.

library(microbenchmark)

mb.diff2 <- compiler::cmpfun(function(vec) {
n <- length(vec)
vec[2:n]-vec[1:(n-1L)]
})

times.diff1 <- c()
times.diff2 <- c()
times.diff3 <- c()
vec.sizes <- c(1e1, 1e2, 1e3, 1e4, 1e5)

for (n in vec.sizes) {
set.seed(21)
vec <- runif(n)
bench <- microbenchmark(diff(vec), mb.diff2(vec), diff.default(vec))
times.median <- aggregate(bench$time, by = list(bench$expr), FUN = median)
times.diff1 <- c(times.diff1, times.median[1,2])
times.diff2 <- c(times.diff2, times.median[2,2])
times.diff3 <- c(times.diff3, times.median[3,2])
}

setNames(times.diff1/times.diff2, vec.sizes)
setNames(times.diff1/times.diff3, vec.sizes)

First, you'll notice that I compiled the mb.diff2 function. This is because diff and diff.default are byte-compiled. I also put the calculation of n inside mb.diff2, since calculating the vector length should be part of the measured function call.

Here are the results of the timings, along with my sessionInfo():

R> setNames(times.diff1/times.diff2, vec.sizes)
10 100 1000 10000 1e+05
3.5781536 2.3330988 1.2488135 0.9011312 0.9660411
R> setNames(times.diff1/times.diff3, vec.sizes)
10 100 1000 10000 1e+05
1.5945010 1.4609283 1.1021190 1.0034623 0.9987618
R> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] microbenchmark_1.4-2

loaded via a namespace (and not attached):
[1] Rcpp_0.12.9 digest_0.6.8 MASS_7.3-45 grid_3.3.2
[5] plyr_1.8.4 gtable_0.1.2 magrittr_1.5 scales_0.3.0
[9] ggplot2_1.0.1 stringi_0.4-1 reshape2_1.4.1 proto_0.3-10
[13] tools_3.3.2 stringr_1.0.0 munsell_0.4.2 compiler_3.3.2
[17] colorspace_1.2-6

Using the diff function in a for loop

You could use rowDiffs from the matrixStats package and invert the differences:

library(matrixStats)
df <- structure(list(Line_1 = c(NA, 0.4054731, 0.4048527, 0.404176,
0.4079322), Line_2 = c(NA, 0.3193632, 0.3195507, 0.3226145, 0.3264623
), Line_3 = c(NA, 0.2667026, 0.269325, 0.2731347, 0.2750645),
Line_4 = c(NA, 0.8494675, 0.8664931, 0.8756971, 0.8770746
), Line_5 = c(NA, 0.2394639, 0.2380499, 0.2338797, 0.227358
), Line_6 = c(NA, 0.2936054, 0.2931895, 0.2876017, 0.2866682
), Line_7 = c(0, 0.2453124, 0.2437657, 0.2432391, 0.2476563
)), class = "data.frame", row.names = c(NA, -5L))

-rowDiffs(as.matrix(df))
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] NA NA NA NA NA NA
#> [2,] 0.0861099 0.0526606 -0.5827649 0.6100036 -0.0541415 0.0482930
#> [3,] 0.0853020 0.0502257 -0.5971681 0.6284432 -0.0551396 0.0494238
#> [4,] 0.0815615 0.0494798 -0.6025624 0.6418174 -0.0537220 0.0443626
#> [5,] 0.0814699 0.0513978 -0.6020101 0.6497166 -0.0593102 0.0390119

Edit:

If, contrary to your question, you want the differences of Line_2 - Line_1, etc., then it would be:

setNames(data.frame(rowDiffs(as.matrix(df))), 
paste0(colnames(df)[-1], "-", colnames(df)[-ncol(df)]))
#> Line_2-Line_1 Line_3-Line_2 Line_4-Line_3 Line_5-Line_4 Line_6-Line_5
#> 1 NA NA NA NA NA
#> 2 -0.0861099 -0.0526606 0.5827649 -0.6100036 0.0541415
#> 3 -0.0853020 -0.0502257 0.5971681 -0.6284432 0.0551396
#> 4 -0.0815615 -0.0494798 0.6025624 -0.6418174 0.0537220
#> 5 -0.0814699 -0.0513978 0.6020101 -0.6497166 0.0593102
#> Line_7-Line_6
#> 1 NA
#> 2 -0.0482930
#> 3 -0.0494238
#> 4 -0.0443626
#> 5 -0.0390119

Created on 2020-07-01 by the reprex package (v0.3.0)

diff() function returns an empty object

I cannot reproduce the problem perfectly, but I have some thoughts.

TL;DR: Edit: don't use factors, use either character or Date objects before zoo-ifying things.

I hunted this down by looking at the source for zoo:::diff.zoo. Namely, it was failing at

x - lag(x, k=-1)
# Data:
# numeric(0)
# Index:
# factor(0)
# 338 Levels: 1990-01 1990-02 1990-03 1990-04 1990-05 1990-06 1990-07 1990-08 1990-09 1990-10 1990-11 1990-12 1991-01 1991-02 1991-03 1991-04 1991-05 1991-06 1991-07 1991-08 1991-09 1991-10 1991-11 1991-12 1992-01 1992-02 1992-03 1992-04 ... 2018-02

I believe that typically zoo objects are indexed based on some form of time-progression. This might be simple integers, as in

str(zoo(2:5))
# 'zoo' series from 1 to 4
# Data: int [1:4] 2 3 4 5
# Index: int [1:4] 1 2 3 4

or something more explicit/intentional, such as a Date or POSIXct timestamp. In your case, it's a factor. I don't know if zoo is trying to treat it like an integer (probably not, otherwise it should have come up with something), or like some categorical character, most likely not what you want in a time-series. (Correction: as 42- pointed out, this is actually quite fine.)

So even if zoo intelligently deals with factors, there is also the problem that the date you have listed is not perfectly unambiguous (is not a time-based object). For instance, by "1990-01" do you mean "1990-01-01"? Though it might seem intuitive and obvious to make that assumption, R typically does not follow you on that leap.

Try this:

(ind <- index(x))
# [1] 1990-01 1990-02 1990-03 1990-04 1990-05 1990-06
# 338 Levels: 1990-01 1990-02 1990-03 1990-04 1990-05 1990-06 1990-07 1990-08 1990-09 1990-10 1990-11 1990-12 ... 2018-02
(ind <- as.Date(paste0(ind, "-01"), format="%Y-%m-%d"))
# [1] "1990-01-01" "1990-02-01" "1990-03-01" "1990-04-01" "1990-05-01" "1990-06-01"
index(x) <- ind

(The surrounding parentheses are merely a shortcut to dump the output post-assignment. They can be safely removed for production.) That now allows

x - lag(x, k=-1)
# 1990-01-01 1990-02-01 1990-03-01 1990-04-01 1990-05-01 1990-06-01
# NA 2.28 7.92 -4.05 2.20 0.96

which means your spread is likely working now:

diff(x)
# 1990-02-01 1990-03-01 1990-04-01 1990-05-01 1990-06-01
# 2.28 7.92 -4.05 2.20 0.96

My guess means that your data import should instead look like:

data <- read.csv("base_form.csv",sep=",") #import .csv
indice = data$Index
dates = as.Date(paste0(data$Dates, "-01"), format="%Y-%m-%d")
spread <- zoo(indice, order.by=dates)

or more simply

data <- read.csv("base_form.csv",sep=",")
dates = as.character(data$Dates)

or even more simply

data <- read.csv("base_form.csv",sep=",", stringsAsFactors=FALSE)

Adding new column with diff() function when there is one less row in R

Here are two approaches. Both put an NA in the first row of diff_qsec and put diff(qsec) in the remaining rows:

library(dplyr)  
mtcars %>% mutate(diff_qsec = qsec - lag(qsec)) # dplyr has its own version of lag

transform(mtcars, diff_qsec = c(NA, diff(qsec)))

Also, on the general issue of padding see: How can I pad a vector with NA from the front?



Related Topics



Leave a reply



Submit