How to Use the 'Sweep' Function

How to use the 'sweep' function

sweep() is typically used when you operate a matrix by row or by column, and the other input of the operation is a different value for each row / column. Whether you operate by row or column is defined by MARGIN, as for apply(). The values used for what I called "the other input" is defined by STATS.
So, for each row (or column), you will take a value from STATS and use in the operation defined by FUN.

For instance, if you want to add 1 to the 1st row, 2 to the 2nd, etc. of the matrix you defined, you will do:

sweep (M, 1, c(1: 4), "+")

I frankly did not understand the definition in the R documentation either, I just learned by looking up examples.

how to use STATS in sweep function of R

I think there is a misunderstanding as to what sweep does; please take a look at the post How to use the sweep function for some great examples.

The bottom line is that you need both a summary statistic (of a suitable dimension, see below) and a mathematical operation according to which you "sweep" that summary statistic from your input matrix.

In your case, the summary statistic is a vector of length length(ak) = 3. You can therefore sweep ak from a using the mathematical operation defined in FUN; how we sweep depends on MARGIN; since a is a 3x3 matrix, we can sweep ak from a either column-wise or row-wise.

In case of the former

sweep(a, 2, ak, FUN = "+")
# [,1] [,2] [,3]
#[1,] 101 204 307
#[2,] 102 205 308
#[3,] 103 206 309

and in case of the latter

sweep(a, 1, ak, FUN = "+")
# [,1] [,2] [,3]
#[1,] 101 104 107
#[2,] 202 205 208
#[3,] 303 306 309

Here we are column/row-wise sweeping by adding (FUN = "+") ak to a.


Obviously you can define your own function. For example, if we want to column-wise add ak to the squared values of a we can use sweep in the following way

sweep(a, 2, ak, FUN = function(x, y) x^2 + y)
# [,1] [,2] [,3]
#[1,] 101 216 349
#[2,] 104 225 364
#[3,] 109 236 381

The first function argument x refers to the selected vector of a (here a column vector because MARGIN = 2); the second function argument y refers to the vector ak.


It is important to ensure that dimensions "match"; for example, if we do

a <- a <- matrix(1:9,3)
ak <- c(100, 200)
sweep(a, 2, ak, FUN = "+")

we get the warning

Warning message:
In sweep(a, 2, ak, FUN = "+") :
STATS does not recycle exactly across MARGIN

as we are trying to add ak to a column-wise, but a has 3 columns and ak only 2 entries.

Why does R sweep function use 1 for column and 2 for row?

Try thinking of the arguments of sweep in a very verbose way:

sweep( mydata, margin, stats = stats_to_combine_with_each_element_along_the_margin )

I think this makes it clearer that stats must be the same length as dim( mydata )[ margin ], and that the elements of stats will be subtracted (or applied using another function) from all the elements along the margin chosen by margin.

Like this, when margin is 1, it's (hopefully) intuitive to see that each element of stats will be subtracted from each element along the rows of mydata.

You could also picture it as a loop (which gives identical output):

mydata = matrix(rep(1:12,each=8),nrow=8,ncol=12)

# create a stat for each row:
stats_for_each_row = 1:8

# sweep with margin=1, so combining each element of stat with elements of each row:
s = sweep( mydata, 1, stats_for_each_row )

# loop over each row, changing mydata
for(row in seq_len(nrow(mydata))) { mydata[row,] = mydata[row,]-stats_for_each_row[row] }

identical( s, mydata )
# TRUE

compare rows with sweep function - how to do it properly?

Use sweep with MARGIN = 2.

sweep(gap_size_uncut[1,], 2, NA_Count[1,], `<=`)

# a b c d e f g h i j k l m
#1 FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE

If the dimensions are the same you can compare the two dataframe directly without sweep.

gap_size_uncut <= NA_Count

# a b c d e f g h i j k l m
#[1,] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE TRUE

An intuitive way to understand MARGIN in sweep and apply

The MARGIN argument means exactly the same thing in both functions and that is row-wise operation. I have been confused with sweep many times in the past but I think you are confused with apply.

I am printing the matrix below so that it is easy to visually compare with apply and sweep later on:

> m
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6

First of all the sweep function does a row-wise operation when MARGIN is 1. I will slightly change the third argument so that this is more obvious:

> sweep(m, MARGIN = 1, 1:3, "-")
[,1] [,2]
[1,] 0 3
[2,] 0 3
[3,] 0 3

In the above case number 1 was deducted from row 1, number 2 from row 2 and number 3 from row 3. So, clearly this is a row-wise operation.

Now let's see below the apply function:

> apply(m, MARGIN = 1, sum)
[1] 5 7 9

Clearly, the matrix has 3 rows and 2 columns. It is easy to imply that this is also a row-wise operation since we have 3 results i.e. the same as the number of rows. This is also confirmed if we check the numbers. Row 1 sums to 5, row 2 to 7 and row 3 to 9.

So, clearly MARGIN in both cases implies a row-wise operation.

Multidimensional STAT in sweep function of R

Using a simple for loop.

res <- array(dim=dim(a))
for (i in seq_len(dim(a)[3])) res[, , i] <- a[, , i]/a[, 1, i]

res
# , , 1
#
# [,1] [,2] [,3]
# [1,] 1 5.000000 9.000000
# [2,] 1 3.000000 5.000000
# [3,] 1 2.333333 3.666667
# [4,] 1 2.000000 3.000000
#
# , , 2
#
# [,1] [,2] [,3]
# [1,] 1 1.307692 1.615385
# [2,] 1 1.285714 1.571429
# [3,] 1 1.266667 1.533333
# [4,] 1 1.250000 1.500000

This appears to outperform *apply functions.

a <- array(rnorm(1e3), c(4, 3, 20000))
microbenchmark::microbenchmark(sapply=array(sapply(seq_len(dim(a)[3]), \(x) a[, , x] / a[, 1, x]), dim=dim(a)),
vapply=array(vapply(seq_len(dim(a)[3]), \(x) a[, , x] / a[, 1, x], vector('numeric', prod(dim(a)[1:2]))), dim=dim(a)),`for`={res <- array(dim=dim(a))
for (i in seq_len(dim(a)[3])) res[, , i] <- a[, , i] / a[, 1, i]},
apply=apply(a, 3, function(x) x / x[,1]))
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# sapply 73.22494 78.46897 88.27141 86.18902 90.08711 232.8461 100 b
# vapply 72.03503 78.24985 86.89068 84.87590 90.58196 226.0209 100 b
# for 58.87842 64.48287 74.33136 70.23162 77.25822 209.9818 100 a
# apply 110.48710 118.79229 130.51282 124.47029 134.64933 294.0801 100 c

How does the R sweep function work with a multi-dimensional array?

The dimensions of the statistics given in STATS should be the same as the dimensions resulting from MARGINalizing the data in the input array or, although not recommended, a size that is a sub-multiple of the number of elements in that result (e.g. length 2 in a 2x3 array; or a 2x4 array in a 2x4x3 array; or 2x2 array in a 2x4x3 array, etc.).

In order to understand the dimensions resulting from MARGINalizing the data, let's look at an example:

# Example data in a 3D array of size 2x3x4
set.seed(1717)
x = array(runif(2*3*4), c(2,3,4))

# We MARGINalize the data by computing the mean on all dimensions *other than*
# the stated ones: (1, 3)
# This gives a 2D result whose dimension is of size
# "length of dim 1" x "length of dim 3", i.e. 2x4
marginalize_on_dims = c(1,3)
m = apply(x, marginalize_on_dims, mean)

which results in the following 2x4 "means" array:

> m
[,1] [,2] [,3] [,4]
[1,] 0.3662613 0.2971481 0.155660 0.5121214
[2,] 0.5808111 0.7322553 0.662044 0.4984720

We now sweep out the computed means m from the original x array:

x_swept_out_of_means_m = sweep(x, STATS=m, MARGIN=marginalize_on_dims)

which results in:

> x_swept_out_of_means_m
, , 1

[,1] [,2] [,3]
[1,] -0.2934119 -0.3224825 0.6158943
[2,] -0.4540748 0.1814070 0.2726678

, , 2

[,1] [,2] [,3]
[1,] -0.1452443 0.3631910 -0.21794673
[2,] -0.1205201 0.0873856 0.03313448

, , 3

[,1] [,2] [,3]
[1,] -0.0766162667 -0.14700413 0.22362039
[2,] 0.0006661599 0.05828265 -0.05894881

, , 4

[,1] [,2] [,3]
[1,] 0.2341822 -0.4071083 0.1729261
[2,] -0.2680816 0.4772658 -0.2091843

We now note that the summary on the swept-out result shows a mean of 0 which is consistent to having substracted the mean:

> summary(x_swept_out_of_means_m)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.45407 -0.21137 -0.02914 0.00000 0.19196 0.61589

Therefore in your example, since you are marginalizing on dimensions 1 and 2, you should use a STATS value that is of dimension 2x3, for instance:

x <- array(1, dim=c(2,3,4,5))
sweep(x, STATS=matrix(nrow=c(2,3), data=c(2,3,-2,4,0,-3)), MARGIN=c(1,2), FUN='*')

where the result should be a 2x3x4x5 array with the following 2x3 array repeated 4x5 times:

         [,1] [,2] [,3]
[1,] 2 -2 0
[2,] 3 4 -3

Session Info:

> sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Can you implement 'sweep' using apply in R?

This is close:

t(apply(df, 1, `*`, c(5,10)))

The row names are lost but otherwise the output is the same

> t(apply(df, 1, '*', c(5,10)))
a b
[1,] 5 20
[2,] 10 30
[3,] 15 40

To break this down, say we were doing this by hand for the first row of df, we'd write

> df[1, ] * c(5, 10)
a b
1 5 20

which is the same as calling the '*'() function with arguments df[1, ] and c(5, 10)

> '*'(df[1, ], c(5, 10))
a b
1 5 20

From this, we have enough to set up an apply() call:

  1. we work by rows, hence MARGIN = 1,
  2. we apply the function '*'() so FUN = '*'
  3. we need to supply the second argument, c(5,10), to '*'(), which we do via the ... argument of apply().

The only extra thing to realise is how apply() sticks together the vector resulting from each "iteration"; here they are bound column-wise and hence we need to transpose the result from apply() so that we get the same output as sweep().

apply a function from different list in R using sweep function

Changing dataframe into matrix in the sweep function was able to provide the desired output.

form <- Map(function(x,y) abs(sweep(as.matrix(x),1,as.matrix(y),FUN="-"))/(sweep(abs(as.matrix(x)),1,abs(a‌​s.matrix(y)),FUN="+")),lista,listb)


Related Topics



Leave a reply



Submit