Creating 2D Bins in R

Creating 2D bins in R

You can get the binned data using the bin2 function in the ash library.

Regarding the problem of the sparsity of data in the region around the red point, one possible solution is with the average shifted histogram. It bins your data after shifting the histogram several times and averaging the bin counts. This alleviates the problem of the bin origin. e.g., imagine how the number of points in the bin containing the red point changes if the red point is the topleft of the bin or the bottom right of the bin.

library(ash)
bins <- bin2(cbind(x,y))
f <- ash2(bins, m = c(5,5))

image(f$x,f$y,f$z)
contour(f$x,f$y,f$z,add=TRUE)

If you would like smoother bins, you could try increasing the argument m, which is a vector of length 2 controlling the smoothing parameters along each variable.

f2 <- ash2(bins, m = c(10,10))
image(f2$x, f2$y, f2$z)
contour(f2$x,f2$y,f2$z,add=TRUE)

Compare f and f2
Sample Image

The binning algorithm is implemented in fortran and is very fast.

Summarise X,Y,theta data in 2D bins before passing to `geom_spoke`

Here's one approach. You'll need to determine the number / range of bins in each dimension (x & y) once, & everything else should be covered by code:

# adjust range & number of bins here
x.range <- pretty(dat$x, n = 3)
y.range <- pretty(dat$y, n = 3)

> x.range
[1] 0 50 100
> y.range
[1] 0 50 100

Automatically assign each row to a bin based on which x & y intervals it falls into:

dat <- dat %>%
rowwise() %>%
mutate(x.bin = max(which(x > x.range)),
y.bin = max(which(y > y.range)),
bin = paste(x.bin, y.bin, sep = "_")) %>%
ungroup()

> head(dat)
# A tibble: 6 x 6
x y angle x.bin y.bin bin
<dbl> <dbl> <dbl> <int> <int> <chr>
1 26.55087 65.47239 1.680804 1 2 1_2
2 37.21239 35.31973 1.373789 1 1 1_1
3 57.28534 27.02601 3.247130 2 1 2_1
4 90.82078 99.26841 1.689866 2 2 2_2
5 20.16819 63.34933 1.138314 1 2 1_2
6 89.83897 21.32081 3.258310 2 1 2_1

Calculate the mean values for each bin:

dat <- dat %>%
group_by(bin) %>%
mutate(x.mean = mean(x),
y.mean = mean(y),
angle.mean = mean(angle),
n = n()) %>%
ungroup()

> head(dat)
# A tibble: 6 x 10
x y angle x.bin y.bin bin x.mean y.mean angle.mean n
<dbl> <dbl> <dbl> <int> <int> <chr> <dbl> <dbl> <dbl> <int>
1 26.55087 65.47239 1.680804 1 2 1_2 26.66662 68.56461 2.672454 29
2 37.21239 35.31973 1.373789 1 1 1_1 33.05887 28.86027 2.173177 23
3 57.28534 27.02601 3.247130 2 1 2_1 74.71214 24.99131 3.071629 23
4 90.82078 99.26841 1.689866 2 2 2_2 77.05622 77.91031 3.007859 25
5 20.16819 63.34933 1.138314 1 2 1_2 26.66662 68.56461 2.672454 29
6 89.83897 21.32081 3.258310 2 1 2_1 74.71214 24.99131 3.071629 23

Plot without hard-coding any bin number / bin width:

ggplot(dat,
aes(x, y, fill = bin)) +
geom_bin2d(binwidth = c(diff(x.range)[1],
diff(y.range)[1])) +
geom_point(aes(x = x.mean, y = y.mean)) +
geom_spoke(aes(x = x.mean, y = y.mean, angle = angle.mean, radius = n/2),
arrow=arrow(length = unit(0.2,"cm"))) +
coord_equal()

ggplot

Other details such as the choice of fill palette, legend label, plot title, etc can be tweaked subsequently.

Packing the 2D data into same size bins

Assumptions

  • the source is an array of some dimensions, meaning we have a datum for each cell; this has not been tested on sparse matrices, though I don't see why it would not work in theory; and

  • when you say bins of the same size, I'm assuming you will accept "approximate" when the dimensions of the original matrix do not divide evenly.

Some perks of this implementation, as obscure or difficult-to-read as it may be:

  • I think it will work with arbitrary dimensionality of the source array; though I've tested it with some cases, I am not 100% confident that it is fool-proof; and

  • it will allow arbitrary summary functions, defaulting to mean; this might be useful when reducing a matrix but still desiring metrics of its central tendency (e.g., mean, median) and dispersion (e.g., sd, var, range, IQR). Obviously any number of functions can be used here, so long as they accept as their first argument an array of arbitrary dimensionality and nothing else.

The Code

reduceMatrix <- function(x, newdim, func = mean) {
if (length(newdim) == 1)
newdim <- rep(newdim, length(dim(x)))
if (length(dim(x)) != length(newdim))
stop('newdim must be of length 1 or the same length as dimensions of x')

allCuts <- lapply(1:length(newdim), function(d) {
tmp <- round(sapply(dim(x)[d], function(y) seq(1, 1 + y, len = 1 + newdim[d])),
digits = 0)
mapply(seq, head(tmp, n = -1), tmp[-1] - 1, SIMPLIFY = FALSE)
})

newIndices <- lapply(newdim, function(d) seq(1, d))
eg <- do.call(expand.grid, newIndices)

f <- function(m, cuts, ...) func(do.call(`[`, c(list(m), mapply(`[`, cuts, list(...)))))
ret <- do.call(mapply, c(list(FUN=f), list(list(x)), list(list(allCuts)), eg))

array(ret, dim = newdim)
}

Simple 2D Matrix

dim2 <- c(8,8)
mtx2 <- array(1:prod(dim2), dim = dim2)
mtx2
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 1 9 17 25 33 41 49 57
## [2,] 2 10 18 26 34 42 50 58
## [3,] 3 11 19 27 35 43 51 59
## [4,] 4 12 20 28 36 44 52 60
## [5,] 5 13 21 29 37 45 53 61
## [6,] 6 14 22 30 38 46 54 62
## [7,] 7 15 23 31 39 47 55 63
## [8,] 8 16 24 32 40 48 56 64
reduceMatrix(mtx2, newdim = 2)
## [,1] [,2]
## [1,] 14.5 46.5
## [2,] 18.5 50.5
matrix(c(mean(mtx2[1:4,1:4]), mean(mtx2[1:4,5:8]),
mean(mtx2[5:8,1:4]), mean(mtx2[5:8,5:8])),
nrow = 2, byrow = TRUE)
## [,1] [,2]
## [1,] 14.5 46.5
## [2,] 18.5 50.5

A Third Dimension

dim3 <- c(4,4,16)
mtx3 <- array(1:prod(dim3), dim = dim3)
mtx3[,,1:2]
## , , 1
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 2 6 10 14
## [3,] 3 7 11 15
## [4,] 4 8 12 16
## , , 2
## [,1] [,2] [,3] [,4]
## [1,] 17 21 25 29
## [2,] 18 22 26 30
## [3,] 19 23 27 31
## [4,] 20 24 28 32
reduceMatrix(mtx3, newdim = c(2, 4, 2))
## , , 1
## [,1] [,2] [,3] [,4]
## [1,] 57.5 61.5 65.5 69.5
## [2,] 59.5 63.5 67.5 71.5
## , , 2
## [,1] [,2] [,3] [,4]
## [1,] 185.5 189.5 193.5 197.5
## [2,] 187.5 191.5 195.5 199.5
mean(mtx3[1:2, 4, 1:8])
## [1] 69.5

This can even reduce the depth from a 3D to a 2D (in this case):

reduceMatrix(mtx3, newdim = c(2, 4, 1))
## , , 1
## [,1] [,2] [,3] [,4]
## [1,] 121.5 125.5 129.5 133.5
## [2,] 123.5 127.5 131.5 135.5

Arbitrary Functions

reduceMatrix(mtx3, newdim = c(2, 4, 1), func = min)
## , , 1
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 3 7 11 15
reduceMatrix(mtx3, newdim = c(2, 4, 1), func = sd)
## , , 1
## [,1] [,2] [,3] [,4]
## [1,] 74.93825 74.93825 74.93825 74.93825
## [2,] 74.93825 74.93825 74.93825 74.93825

In this case, the results from sd are not surprising given the data's sequential nature ...

Much Larger Matrix

hugedim <- c(100, 100, 100, 100)
## this may take a few seconds ...
hugemtx <- array(rnorm(prod(hugedim)), dim = hugedim)
reduceMatrix(hugemtx, newdim = c(4, 4, 4, 4))[,,1,1]
## [,1] [,2] [,3] [,4]
## [1,] -0.001348164 -0.0007825642 -0.000369729 -0.0021077956
## [2,] 0.001736173 0.0023445047 -0.001259591 -0.0024143239
## [3,] -0.001820090 -0.0011529691 -0.001941874 -0.0005125225
## [4,] 0.002675196 -0.0009186329 -0.002009339 -0.0009181220
reduceMatrix(hugemtx, newdim = c(4, 4, 2, 1), func = sd)
## , , 1, 1
## [,1] [,2] [,3] [,4]
## [1,] 1.000269 1.000427 1.000454 0.9999282
## [2,] 1.000533 0.999885 1.000669 1.0003171
## [3,] 1.000580 1.000232 1.000167 0.9999460
## [4,] 1.000744 1.000150 1.000510 0.9994936
## , , 2, 1
## [,1] [,2] [,3] [,4]
## [1,] 0.9997954 0.9991735 1.000129 0.9996085
## [2,] 0.9999844 1.0002422 1.000016 0.9999593
## [3,] 1.0002473 1.0001445 1.000639 0.9998599
## [4,] 1.0005914 0.9993628 0.999104 1.0002434

(I realize on re-reading your problem statement for the seventeenth time that your output is not what I have been envisioning. I wonder if this can be adapted ...)

R 2D binning of data frame with secondary complex calculations

Here's another faster way to do it, one that includes unpopulated bin combinations:

fasterWay <- function(df.data) {
a1 <- aggregate(df.data[,3:4], list(x=floor(df.data$x/3), y=floor(df.data$y/3)), sum)
a2 <- aggregate(list(count=rep(NA,nrow(df.data))), list(x=floor(df.data$x/3), y=floor(df.data$y/3)), length)
result <- merge(expand.grid(y=0:3,x=0:3), merge(a1,a2), by=c("x","y"), all=TRUE)
result[is.na(result)] <- 0
result <- result[order(result$y, result$x),]
rownames(result) <- NULL
result
}

It gives me:

   x y vx vy count
1 0 0 0 0 1
2 0 1 0 0 0
3 0 2 -1 -1 1
4 0 3 0 0 0
5 1 0 -1 -1 1
6 1 1 0 0 0
7 1 2 0 0 0
8 1 3 -1 0 2
9 2 0 -1 -1 1
10 2 1 0 0 0
11 2 2 -1 1 2
12 2 3 0 0 1
13 3 0 0 0 0
14 3 1 0 0 0
15 3 2 -1 0 1
16 3 3 0 0 0

R generate 2D histogram from raw data

The ggplot is elegant and fast and pretty, as usual. But if you want to use base graphics (image, contour, persp) and display your actual frequencies (instead of the smoothing 2D kernel), you have to first obtain the binnings yourself and create a matrix of frequencies. Here's some code (not necessarily elegant, but pretty robust) that does 2D binning and generates plots somewhat similar to the ones above:

    require(mvtnorm)
xy <- rmvnorm(1000,c(5,10),sigma=rbind(c(3,-2),c(-2,3)))

nbins <- 20
x.bin <- seq(floor(min(xy[,1])), ceiling(max(xy[,1])), length=nbins)
y.bin <- seq(floor(min(xy[,2])), ceiling(max(xy[,2])), length=nbins)

freq <- as.data.frame(table(findInterval(xy[,1], x.bin),findInterval(xy[,2], y.bin)))
freq[,1] <- as.numeric(freq[,1])
freq[,2] <- as.numeric(freq[,2])

freq2D <- diag(nbins)*0
freq2D[cbind(freq[,1], freq[,2])] <- freq[,3]

par(mfrow=c(1,2))
image(x.bin, y.bin, freq2D, col=topo.colors(max(freq2D)))
contour(x.bin, y.bin, freq2D, add=TRUE, col=rgb(1,1,1,.7))

palette(rainbow(max(freq2D)))
cols <- (freq2D[-1,-1] + freq2D[-1,-(nbins-1)] + freq2D[-(nbins-1),-(nbins-1)] + freq2D[-(nbins-1),-1])/4
persp(freq2D, col=cols)

Sample Image

For a really fun time, try making an interactive, zoomable, 3D surface:

require(rgl)
surface3d(x.bin,y.bin,freq2D/10, col="red")

Sample Image

2D irregular aggregation of a matrix

This answer provides a great starting point using tapply:

b <- melt(a)

bb <- with(b, tapply(value,
list(
y=cut(Var1, breaks=c(0, breaks, Inf), include.lowest=T),
x=cut(Var2, breaks=c(0, breaks, Inf), include.lowest=T)
),
sum)
)

bb
# x
# y [0,12] (12,14] (14,25] (25,60] (60,71] (71,89] (89,Inf]
# [0,12] 297 48 260 825 242 416 246
# (12,14] 48 3 43 141 46 59 33
# (14,25] 260 43 261 794 250 369 240
# (25,60] 825 141 794 2545 730 1303 778
# (60,71] 242 46 250 730 193 394 225
# (71,89] 416 59 369 1303 394 597 369
# (89,Inf] 246 33 240 778 225 369 230

These can then be plotted as rectangular bins using a base plot and rect — i.e.:

library("reshape2")
library("magrittr")

bsq <- melt(bb)

# convert range notation to numerics
getNum <- . %>%
# rm brackets
gsub("\\[|\\(|\\]|\\)", "", .) %>%
# split digits and convert
strsplit(",") %>%
unlist %>% as.numeric

y <- t(sapply(bsq[,1], getNum))
x <- t(sapply(bsq[,2], getNum))

# normalise bin intensity by area
bsq$size <- (y[,2] - y[,1]) * (x[,2] - x[,1])
bsq$norm <- bsq$value / bsq$size

# draw rectangles on top of empty plot
plot(1:100, 1:100, type="n", frame=F, axes=F)
rect(ybottom=y[,1], ytop=y[,2],
xleft=x[,1], xright=x[,2],
col=rgb(colorRamp(c("white", "steelblue4"))(bsq$norm / max(bsq$norm)),
alpha=255*(bsq$norm / max(bsq$norm)), max=255),
border="white")

Sample Image

Binning values in R with multiple files

There are several ways to do it, I'll provide one method using base functions. (An alternative would be to use dplyr, also well suited for this. However, the base example should be simple enough.)

Generate Data

(This is here merely because we don't have any of your data.)

n <- 10
for (ii in 1:3) {
dat <- runif(n)
writeLines(paste(dat, collapse = ','),
con = sprintf('user2062207-file%s.txt', ii))
}
readLines('user2062207-file1.txt')
## [1] "0.929472318384796,0.921938128070906,0.707776406314224,0.236701443558559,0.271322417538613,0.388766387710348,0.422867075540125,0.324589917669073,0.92406965768896,0.171326051233336"

Read the Data

This is where you'll start, assuming you have a simple pattern for finding the files.

fnames <- list.files(pattern = 'user2062207-file.*.txt')
allData <- unlist(sapply(fnames, read.table, sep = ','))
allRange <- range(allData)
df <- data.frame(x = allData)
head(df)
## x
## 1 0.9294723
## 2 0.9219381
## 3 0.7077764
## 4 0.2367014
## 5 0.2713224
## 6 0.3887664
dim(df)
## [1] 30 1

Set the Bins

The {floor,ceiling} +/- binSize below is because the bins include only one side of the range (default: right side), so the minimum value(s) will not be binned. It also ensures the bins are on rounded boundaries.

binSize <- 0.05
allBins <- seq(floor(allRange[1] / binSize) * binSize,
ceiling(allRange[2] / binSize) * binSize,
by = binSize)
## bin the data
df$bin <- cut(df$x, breaks = allBins)
head(df)
## x bin
## 1 0.9294723 (0.9,0.95]
## 2 0.9219381 (0.9,0.95]
## 3 0.7077764 (0.7,0.75]
## 4 0.2367014 (0.2,0.25]
## 5 0.2713224 (0.25,0.3]
## 6 0.3887664 (0.35,0.4]

Statistics on Each Bin

sapply(levels(df$bin), function(lvl) median(df$x[df$bin == lvl], na.rm = TRUE))
## (0,0.05] (0.05,0.1] (0.1,0.15] (0.15,0.2] (0.2,0.25] (0.25,0.3] (0.3,0.35]
## 0.03802277 NA 0.11528715 0.18195392 0.22918094 0.27132242 0.33626971
## (0.35,0.4] (0.4,0.45] (0.45,0.5] (0.5,0.55] (0.55,0.6] (0.6,0.65] (0.65,0.7]
## 0.38009637 0.42184059 NA 0.53826028 0.57820253 0.64165116 0.67825992
## (0.7,0.75] (0.75,0.8] (0.8,0.85] (0.85,0.9] (0.9,0.95] (0.95,1]
## 0.74243926 NA 0.80759621 0.88974267 0.92406966 0.95691077

This is an area where numerous other options could be advantageous. For instance, the base function by can work, though dealing with its data structure is not always intuitive even if the function call itself is easy to read:

head(by(df$x, df$bin, median, na.rm = TRUE))
## df$bin
## (0,0.05] (0.05,0.1] (0.1,0.15] (0.15,0.2] (0.2,0.25] (0.25,0.3]
## 0.03802277 NA 0.11528715 0.18195392 0.22918094 0.27132242

You could also use dplyr with ease. This example starts with the original allData and allBins:

library(dplyr)
data.frame(x = allData) %>%
mutate(bin = cut(x, breaks = allBins)) %>%
group_by(bin) %>%
summarise(median(x))
## Source: local data frame [17 x 2]
## bin median(x)
## 1 (0,0.05] 0.03802277
## 2 (0.1,0.15] 0.11528715
## 3 (0.15,0.2] 0.18195392
## 4 (0.2,0.25] 0.22918094
## 5 (0.25,0.3] 0.27132242
#### ..snip..

The first example preserves empty bins whereas the other methods are not aware of empty bins. There are possibly other ways of using by and dplyr that would include these empty bins, but this seems to suffice.

EDIT

After a bit of chat, it was determined that the range of the data was too wide with a bin width of 0.0005. A better solution was devised. (No sample data to provide, sorry, not mine to give ...) I'll use random data to mimic the process:

set.seed(42)
x <- 5e7 * runif(5e5)

library(dplyr)
binSize <- 0.0005
df <- data.frame(dat = sort(x))
df$bin <- floor(df$dat / binSize) * binSize
head(df)
## dat bin
## 1 410.9577 410.9575
## 2 456.6275 456.6270
## 3 552.3674 552.3670
## 4 875.4898 875.4895
## 5 1018.6806 1018.6805
## 6 1102.2436 1102.2435
system.time(results <- df %>% group_by(bin) %>% summarize(med = median(dat)))
## user system elapsed
## 12.08 0.00 12.11
head(results)
## Source: local data frame [6 x 2]
## bin med
## 1 410.9575 410.9577
## 2 456.6270 456.6275
## 3 552.3670 552.3674
## 4 875.4895 875.4898
## 5 1018.6805 1018.6806
## 6 1102.2435 1102.2436

R code to show the actual continuous values stored in each bin?

You already have done everything you need, except wrap it into a data.frame

head(data.frame(Values=x, Bin=binned, Rowname=seq_along(x))[order(binned), ])
# Values Bin Rowname
# 2 -66.88718 [-189,-64.7] 2
# 5 -99.08521 [-189,-64.7] 5
# 8 -95.06063 [-189,-64.7] 8
# 10 -95.04592 [-189,-64.7] 10
# 15 -78.48819 [-189,-64.7] 15
# 28 -78.49396 [-189,-64.7] 28

You don't need a column for rownames though, since data.frame keep a rowname attribute, ie rownames(yourData)



Related Topics



Leave a reply



Submit