What Does the Diff() Function in R Do

What does the diff() function in R do?

The function calculates the differences between all consecutive values of a vector. For your example vector, the differences are:

 1 - 10 = -9
 1 -  1 =  0
 1 -  1 =  0
.
.
.
 3 -  1 =  2
10 -  3 =  7

The argument differences allows you to specify the order of the differences.

E.g., the command

diff(temp, differences = 2) 
[1]  9  0  0  0  0  1 -2  1  0  0  0  0  0  2  5

produces the same result as

diff(diff(temp))
[1]  9  0  0  0  0  1 -2  1  0  0  0  0  0  2  5

Hence, it returns the differences of differences.

The argument lag allows you to specify the lag.

For example, if lag = 2, the differences between the third and the first value, between the fourth and the second value, between the fifth and the third value etc. are calculated.

diff(temp, lag = 2)
[1] -9  0  0  0  0  1  0 -1  0  0  0  0  0  2  9

R - apply diff() function or equivalent self-defined function on multiple columns in a data.table

The error seems because diff(any_vector) returns a vector but length one shorter than any_vector. See this

diff(1:5)
[1] 1 1 1 1

So if diff is to be applied on any variable in a table, one element has to be added in the result either at end or at start. Although I am not sure of your expected outcome, still I presume this. (I am adding NA to the starting of resulting vector. You may add 0 as well, if so desired.

library(dplyr)
df %>% mutate(across(cols, ~c(NA, diff(.)), .names = "{.col}_diff"))

  ID       Date Var1 Var2 Var3 Var4 Var3_diff Var4_diff
1  1 2020-03-01   AB  A33  250   12        NA        NA
2  1 2020-04-01    B  B25   NA   14        NA         2
3  1 2020-05-01   AB  A44  270   20        NA         6
4  1 2020-06-01   AC  C33    9   13      -261        -7
5  2 2019-09-01    X  C55  280   11       271        -2
6  2 2019-10-01    K  C89  120   12      -160         1
7  2 2019-11-01    A  C89  320   NA       200        NA
8  2 2019-12-01   AB  A88  200   25      -120        NA

Or if grouped on ID is required

df %>% group_by(ID) %>%
  mutate(across(cols, ~c(NA, diff(.)), .names = "{.col}_diff"))

# A tibble: 8 x 8
# Groups:   ID [2]
     ID Date       Var1  Var2   Var3  Var4 Var3_diff Var4_diff
  <int> <chr>      <chr> <chr> <int> <int>     <int>     <int>
1     1 2020-03-01 AB    A33     250    12        NA        NA
2     1 2020-04-01 B     B25      NA    14        NA         2
3     1 2020-05-01 AB    A44     270    20        NA         6
4     1 2020-06-01 AC    C33       9    13      -261        -7
5     2 2019-09-01 X     C55     280    11        NA        NA
6     2 2019-10-01 K     C89     120    12      -160         1
7     2 2019-11-01 A     C89     320    NA       200        NA
8     2 2019-12-01 AB    A88     200    25      -120        NA

R: Conditional diff function

You would need to move diff() to inside a mutate() statement if you are using dplyr. But diff() returns a vector that's shorter by 1 than your input vector which makes it difficult to keep the same number of rows. An alternative is to use the dplyr lead() function to grab the "next" value in the group

df%>%
  group_by(id)%>%
  mutate(diff=lead(t1)-t1)

Using diff function in R

First, load your data with custom sep and dec (or read.csv2 for ; separator and , for decimal points) from your data:

data <- read.csv(file.choose(), header=TRUE, sep=";", dec=",")
# OR
data <- read.csv2(file.choose(), header=TRUE)

You can use names of the columns, instead of indexing. Then the second plot can be shown as:

plot(diff(as.numeric(data$Total.cases)), ylab="First Difference", col="red", font.lab=2, type="o")

Why is R diff function slow?

Here is some code to help expand and illustrate the points I made in my comment.

library(microbenchmark)

mb.diff2 <- compiler::cmpfun(function(vec) {
  n <- length(vec)
  vec[2:n]-vec[1:(n-1L)]
})

times.diff1 <- c()  
times.diff2 <- c()
times.diff3 <- c()
vec.sizes <- c(1e1, 1e2, 1e3, 1e4, 1e5)

for (n in vec.sizes) {
  set.seed(21)
  vec <- runif(n)
  bench <- microbenchmark(diff(vec), mb.diff2(vec), diff.default(vec))
  times.median <- aggregate(bench$time, by = list(bench$expr), FUN = median)
  times.diff1 <- c(times.diff1, times.median[1,2])
  times.diff2 <- c(times.diff2, times.median[2,2])
  times.diff3 <- c(times.diff3, times.median[3,2])
}

setNames(times.diff1/times.diff2, vec.sizes)
setNames(times.diff1/times.diff3, vec.sizes)

First, you'll notice that I compiled the mb.diff2 function. This is because diff and diff.default are byte-compiled. I also put the calculation of n inside mb.diff2, since calculating the vector length should be part of the measured function call.

Here are the results of the timings, along with my sessionInfo():

R> setNames(times.diff1/times.diff2, vec.sizes)
       10       100      1000     10000     1e+05 
3.5781536 2.3330988 1.2488135 0.9011312 0.9660411 
R> setNames(times.diff1/times.diff3, vec.sizes)
       10       100      1000     10000     1e+05 
1.5945010 1.4609283 1.1021190 1.0034623 0.9987618 
R> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] microbenchmark_1.4-2

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.9      digest_0.6.8     MASS_7.3-45      grid_3.3.2      
 [5] plyr_1.8.4       gtable_0.1.2     magrittr_1.5     scales_0.3.0    
 [9] ggplot2_1.0.1    stringi_0.4-1    reshape2_1.4.1   proto_0.3-10    
[13] tools_3.3.2      stringr_1.0.0    munsell_0.4.2    compiler_3.3.2  
[17] colorspace_1.2-6

Using the diff function in a for loop

You could use rowDiffs from the matrixStats package and invert the differences:

library(matrixStats)
df <- structure(list(Line_1 = c(NA, 0.4054731, 0.4048527, 0.404176, 
                                0.4079322), Line_2 = c(NA, 0.3193632, 0.3195507, 0.3226145, 0.3264623
                                ), Line_3 = c(NA, 0.2667026, 0.269325, 0.2731347, 0.2750645), 
                     Line_4 = c(NA, 0.8494675, 0.8664931, 0.8756971, 0.8770746
                     ), Line_5 = c(NA, 0.2394639, 0.2380499, 0.2338797, 0.227358
                     ), Line_6 = c(NA, 0.2936054, 0.2931895, 0.2876017, 0.2866682
                     ), Line_7 = c(0, 0.2453124, 0.2437657, 0.2432391, 0.2476563
                     )), class = "data.frame", row.names = c(NA, -5L))

-rowDiffs(as.matrix(df))
#>           [,1]      [,2]       [,3]      [,4]       [,5]      [,6]
#> [1,]        NA        NA         NA        NA         NA        NA
#> [2,] 0.0861099 0.0526606 -0.5827649 0.6100036 -0.0541415 0.0482930
#> [3,] 0.0853020 0.0502257 -0.5971681 0.6284432 -0.0551396 0.0494238
#> [4,] 0.0815615 0.0494798 -0.6025624 0.6418174 -0.0537220 0.0443626
#> [5,] 0.0814699 0.0513978 -0.6020101 0.6497166 -0.0593102 0.0390119

Edit:

If, contrary to your question, you want the differences of Line_2 - Line_1, etc., then it would be:

setNames(data.frame(rowDiffs(as.matrix(df))), 
         paste0(colnames(df)[-1], "-", colnames(df)[-ncol(df)]))
#>   Line_2-Line_1 Line_3-Line_2 Line_4-Line_3 Line_5-Line_4 Line_6-Line_5
#> 1            NA            NA            NA            NA            NA
#> 2    -0.0861099    -0.0526606     0.5827649    -0.6100036     0.0541415
#> 3    -0.0853020    -0.0502257     0.5971681    -0.6284432     0.0551396
#> 4    -0.0815615    -0.0494798     0.6025624    -0.6418174     0.0537220
#> 5    -0.0814699    -0.0513978     0.6020101    -0.6497166     0.0593102
#>   Line_7-Line_6
#> 1            NA
#> 2    -0.0482930
#> 3    -0.0494238
#> 4    -0.0443626
#> 5    -0.0390119

^{Created on 2020-07-01 by the reprex package (v0.3.0)}

diff() function returns an empty object

I cannot reproduce the problem perfectly, but I have some thoughts.

TL;DR: Edit: don't use factors, use either character or Date objects before zoo-ifying things.

I hunted this down by looking at the source for zoo:::diff.zoo. Namely, it was failing at

x - lag(x, k=-1)
# Data:
# numeric(0)
# Index:
# factor(0)
# 338 Levels: 1990-01 1990-02 1990-03 1990-04 1990-05 1990-06 1990-07 1990-08 1990-09 1990-10 1990-11 1990-12 1991-01 1991-02 1991-03 1991-04 1991-05 1991-06 1991-07 1991-08 1991-09 1991-10 1991-11 1991-12 1992-01 1992-02 1992-03 1992-04 ... 2018-02

I believe that typically zoo objects are indexed based on some form of time-progression. This might be simple integers, as in

str(zoo(2:5))
# 'zoo' series from 1 to 4
#   Data: int [1:4] 2 3 4 5
#   Index:  int [1:4] 1 2 3 4

or something more explicit/intentional, such as a Date or POSIXct timestamp. In your case, it's a factor. I don't know if zoo is trying to treat it like an integer (probably not, otherwise it should have come up with something), or like some categorical character~~, most likely not what you want in a time-series.~~ (Correction: as 42- pointed out, this is actually quite fine.)

So even if zoo intelligently deals with factors, there is also the problem that the date you have listed is not perfectly unambiguous (is not a time-based object). For instance, by "1990-01" do you mean "1990-01-01"? Though it might seem intuitive and obvious to make that assumption, R typically does not follow you on that leap.

Try this:

(ind <- index(x))
# [1] 1990-01 1990-02 1990-03 1990-04 1990-05 1990-06
# 338 Levels: 1990-01 1990-02 1990-03 1990-04 1990-05 1990-06 1990-07 1990-08 1990-09 1990-10 1990-11 1990-12 ... 2018-02
(ind <- as.Date(paste0(ind, "-01"), format="%Y-%m-%d"))
# [1] "1990-01-01" "1990-02-01" "1990-03-01" "1990-04-01" "1990-05-01" "1990-06-01"
index(x) <- ind

(The surrounding parentheses are merely a shortcut to dump the output post-assignment. They can be safely removed for production.) That now allows

x - lag(x, k=-1)
# 1990-01-01 1990-02-01 1990-03-01 1990-04-01 1990-05-01 1990-06-01 
#         NA       2.28       7.92      -4.05       2.20       0.96

which means your spread is likely working now:

diff(x)
# 1990-02-01 1990-03-01 1990-04-01 1990-05-01 1990-06-01 
#       2.28       7.92      -4.05       2.20       0.96

My guess means that your data import should instead look like:

data <- read.csv("base_form.csv",sep=",") #import .csv
indice = data$Index
dates = as.Date(paste0(data$Dates, "-01"), format="%Y-%m-%d")
spread <- zoo(indice, order.by=dates)

or more simply

data <- read.csv("base_form.csv",sep=",")
dates = as.character(data$Dates)

or even more simply

data <- read.csv("base_form.csv",sep=",", stringsAsFactors=FALSE)

Adding new column with diff() function when there is one less row in R

Here are two approaches. Both put an NA in the first row of diff_qsec and put diff(qsec) in the remaining rows:

library(dplyr)  
mtcars %>% mutate(diff_qsec = qsec - lag(qsec)) # dplyr has its own version of lag

transform(mtcars, diff_qsec = c(NA, diff(qsec)))

Also, on the general issue of padding see: How can I pad a vector with NA from the front?

What Does the Diff() Function in R Do