Creating 2D bins in R
You can get the binned data using the bin2
function in the ash
library.
Regarding the problem of the sparsity of data in the region around the red point, one possible solution is with the average shifted histogram. It bins your data after shifting the histogram several times and averaging the bin counts. This alleviates the problem of the bin origin. e.g., imagine how the number of points in the bin containing the red point changes if the red point is the topleft of the bin or the bottom right of the bin.
library(ash)
bins <- bin2(cbind(x,y))
f <- ash2(bins, m = c(5,5))
image(f$x,f$y,f$z)
contour(f$x,f$y,f$z,add=TRUE)
If you would like smoother bins, you could try increasing the argument m
, which is a vector of length 2 controlling the smoothing parameters along each variable.
f2 <- ash2(bins, m = c(10,10))
image(f2$x, f2$y, f2$z)
contour(f2$x,f2$y,f2$z,add=TRUE)
Compare f
and f2
The binning algorithm is implemented in fortran and is very fast.
Summarise X,Y,theta data in 2D bins before passing to `geom_spoke`
Here's one approach. You'll need to determine the number / range of bins in each dimension (x & y) once, & everything else should be covered by code:
# adjust range & number of bins here
x.range <- pretty(dat$x, n = 3)
y.range <- pretty(dat$y, n = 3)
> x.range
[1] 0 50 100
> y.range
[1] 0 50 100
Automatically assign each row to a bin based on which x & y intervals it falls into:
dat <- dat %>%
rowwise() %>%
mutate(x.bin = max(which(x > x.range)),
y.bin = max(which(y > y.range)),
bin = paste(x.bin, y.bin, sep = "_")) %>%
ungroup()
> head(dat)
# A tibble: 6 x 6
x y angle x.bin y.bin bin
<dbl> <dbl> <dbl> <int> <int> <chr>
1 26.55087 65.47239 1.680804 1 2 1_2
2 37.21239 35.31973 1.373789 1 1 1_1
3 57.28534 27.02601 3.247130 2 1 2_1
4 90.82078 99.26841 1.689866 2 2 2_2
5 20.16819 63.34933 1.138314 1 2 1_2
6 89.83897 21.32081 3.258310 2 1 2_1
Calculate the mean values for each bin:
dat <- dat %>%
group_by(bin) %>%
mutate(x.mean = mean(x),
y.mean = mean(y),
angle.mean = mean(angle),
n = n()) %>%
ungroup()
> head(dat)
# A tibble: 6 x 10
x y angle x.bin y.bin bin x.mean y.mean angle.mean n
<dbl> <dbl> <dbl> <int> <int> <chr> <dbl> <dbl> <dbl> <int>
1 26.55087 65.47239 1.680804 1 2 1_2 26.66662 68.56461 2.672454 29
2 37.21239 35.31973 1.373789 1 1 1_1 33.05887 28.86027 2.173177 23
3 57.28534 27.02601 3.247130 2 1 2_1 74.71214 24.99131 3.071629 23
4 90.82078 99.26841 1.689866 2 2 2_2 77.05622 77.91031 3.007859 25
5 20.16819 63.34933 1.138314 1 2 1_2 26.66662 68.56461 2.672454 29
6 89.83897 21.32081 3.258310 2 1 2_1 74.71214 24.99131 3.071629 23
Plot without hard-coding any bin number / bin width:
ggplot(dat,
aes(x, y, fill = bin)) +
geom_bin2d(binwidth = c(diff(x.range)[1],
diff(y.range)[1])) +
geom_point(aes(x = x.mean, y = y.mean)) +
geom_spoke(aes(x = x.mean, y = y.mean, angle = angle.mean, radius = n/2),
arrow=arrow(length = unit(0.2,"cm"))) +
coord_equal()
Other details such as the choice of fill palette, legend label, plot title, etc can be tweaked subsequently.
Packing the 2D data into same size bins
Assumptions
the source is an array of some dimensions, meaning we have a datum for each cell; this has not been tested on sparse matrices, though I don't see why it would not work in theory; and
when you say bins of the same size, I'm assuming you will accept "approximate" when the dimensions of the original matrix do not divide evenly.
Some perks of this implementation, as obscure or difficult-to-read as it may be:
I think it will work with arbitrary dimensionality of the source array; though I've tested it with some cases, I am not 100% confident that it is fool-proof; and
it will allow arbitrary summary functions, defaulting to
mean
; this might be useful when reducing a matrix but still desiring metrics of its central tendency (e.g.,mean
,median
) and dispersion (e.g.,sd
,var
,range
,IQR
). Obviously any number of functions can be used here, so long as they accept as their first argument an array of arbitrary dimensionality and nothing else.
The Code
reduceMatrix <- function(x, newdim, func = mean) {
if (length(newdim) == 1)
newdim <- rep(newdim, length(dim(x)))
if (length(dim(x)) != length(newdim))
stop('newdim must be of length 1 or the same length as dimensions of x')
allCuts <- lapply(1:length(newdim), function(d) {
tmp <- round(sapply(dim(x)[d], function(y) seq(1, 1 + y, len = 1 + newdim[d])),
digits = 0)
mapply(seq, head(tmp, n = -1), tmp[-1] - 1, SIMPLIFY = FALSE)
})
newIndices <- lapply(newdim, function(d) seq(1, d))
eg <- do.call(expand.grid, newIndices)
f <- function(m, cuts, ...) func(do.call(`[`, c(list(m), mapply(`[`, cuts, list(...)))))
ret <- do.call(mapply, c(list(FUN=f), list(list(x)), list(list(allCuts)), eg))
array(ret, dim = newdim)
}
Simple 2D Matrix
dim2 <- c(8,8)
mtx2 <- array(1:prod(dim2), dim = dim2)
mtx2
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## [1,] 1 9 17 25 33 41 49 57
## [2,] 2 10 18 26 34 42 50 58
## [3,] 3 11 19 27 35 43 51 59
## [4,] 4 12 20 28 36 44 52 60
## [5,] 5 13 21 29 37 45 53 61
## [6,] 6 14 22 30 38 46 54 62
## [7,] 7 15 23 31 39 47 55 63
## [8,] 8 16 24 32 40 48 56 64
reduceMatrix(mtx2, newdim = 2)
## [,1] [,2]
## [1,] 14.5 46.5
## [2,] 18.5 50.5
matrix(c(mean(mtx2[1:4,1:4]), mean(mtx2[1:4,5:8]),
mean(mtx2[5:8,1:4]), mean(mtx2[5:8,5:8])),
nrow = 2, byrow = TRUE)
## [,1] [,2]
## [1,] 14.5 46.5
## [2,] 18.5 50.5
A Third Dimension
dim3 <- c(4,4,16)
mtx3 <- array(1:prod(dim3), dim = dim3)
mtx3[,,1:2]
## , , 1
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 2 6 10 14
## [3,] 3 7 11 15
## [4,] 4 8 12 16
## , , 2
## [,1] [,2] [,3] [,4]
## [1,] 17 21 25 29
## [2,] 18 22 26 30
## [3,] 19 23 27 31
## [4,] 20 24 28 32
reduceMatrix(mtx3, newdim = c(2, 4, 2))
## , , 1
## [,1] [,2] [,3] [,4]
## [1,] 57.5 61.5 65.5 69.5
## [2,] 59.5 63.5 67.5 71.5
## , , 2
## [,1] [,2] [,3] [,4]
## [1,] 185.5 189.5 193.5 197.5
## [2,] 187.5 191.5 195.5 199.5
mean(mtx3[1:2, 4, 1:8])
## [1] 69.5
This can even reduce the depth from a 3D to a 2D (in this case):
reduceMatrix(mtx3, newdim = c(2, 4, 1))
## , , 1
## [,1] [,2] [,3] [,4]
## [1,] 121.5 125.5 129.5 133.5
## [2,] 123.5 127.5 131.5 135.5
Arbitrary Functions
reduceMatrix(mtx3, newdim = c(2, 4, 1), func = min)
## , , 1
## [,1] [,2] [,3] [,4]
## [1,] 1 5 9 13
## [2,] 3 7 11 15
reduceMatrix(mtx3, newdim = c(2, 4, 1), func = sd)
## , , 1
## [,1] [,2] [,3] [,4]
## [1,] 74.93825 74.93825 74.93825 74.93825
## [2,] 74.93825 74.93825 74.93825 74.93825
In this case, the results from sd
are not surprising given the data's sequential nature ...
Much Larger Matrix
hugedim <- c(100, 100, 100, 100)
## this may take a few seconds ...
hugemtx <- array(rnorm(prod(hugedim)), dim = hugedim)
reduceMatrix(hugemtx, newdim = c(4, 4, 4, 4))[,,1,1]
## [,1] [,2] [,3] [,4]
## [1,] -0.001348164 -0.0007825642 -0.000369729 -0.0021077956
## [2,] 0.001736173 0.0023445047 -0.001259591 -0.0024143239
## [3,] -0.001820090 -0.0011529691 -0.001941874 -0.0005125225
## [4,] 0.002675196 -0.0009186329 -0.002009339 -0.0009181220
reduceMatrix(hugemtx, newdim = c(4, 4, 2, 1), func = sd)
## , , 1, 1
## [,1] [,2] [,3] [,4]
## [1,] 1.000269 1.000427 1.000454 0.9999282
## [2,] 1.000533 0.999885 1.000669 1.0003171
## [3,] 1.000580 1.000232 1.000167 0.9999460
## [4,] 1.000744 1.000150 1.000510 0.9994936
## , , 2, 1
## [,1] [,2] [,3] [,4]
## [1,] 0.9997954 0.9991735 1.000129 0.9996085
## [2,] 0.9999844 1.0002422 1.000016 0.9999593
## [3,] 1.0002473 1.0001445 1.000639 0.9998599
## [4,] 1.0005914 0.9993628 0.999104 1.0002434
(I realize on re-reading your problem statement for the seventeenth time that your output is not what I have been envisioning. I wonder if this can be adapted ...)
R 2D binning of data frame with secondary complex calculations
Here's another faster way to do it, one that includes unpopulated bin combinations:
fasterWay <- function(df.data) {
a1 <- aggregate(df.data[,3:4], list(x=floor(df.data$x/3), y=floor(df.data$y/3)), sum)
a2 <- aggregate(list(count=rep(NA,nrow(df.data))), list(x=floor(df.data$x/3), y=floor(df.data$y/3)), length)
result <- merge(expand.grid(y=0:3,x=0:3), merge(a1,a2), by=c("x","y"), all=TRUE)
result[is.na(result)] <- 0
result <- result[order(result$y, result$x),]
rownames(result) <- NULL
result
}
It gives me:
x y vx vy count
1 0 0 0 0 1
2 0 1 0 0 0
3 0 2 -1 -1 1
4 0 3 0 0 0
5 1 0 -1 -1 1
6 1 1 0 0 0
7 1 2 0 0 0
8 1 3 -1 0 2
9 2 0 -1 -1 1
10 2 1 0 0 0
11 2 2 -1 1 2
12 2 3 0 0 1
13 3 0 0 0 0
14 3 1 0 0 0
15 3 2 -1 0 1
16 3 3 0 0 0
R generate 2D histogram from raw data
The ggplot
is elegant and fast and pretty, as usual. But if you want to use base graphics (image
, contour
, persp
) and display your actual frequencies (instead of the smoothing 2D kernel), you have to first obtain the binnings yourself and create a matrix of frequencies. Here's some code (not necessarily elegant, but pretty robust) that does 2D binning and generates plots somewhat similar to the ones above:
require(mvtnorm)
xy <- rmvnorm(1000,c(5,10),sigma=rbind(c(3,-2),c(-2,3)))
nbins <- 20
x.bin <- seq(floor(min(xy[,1])), ceiling(max(xy[,1])), length=nbins)
y.bin <- seq(floor(min(xy[,2])), ceiling(max(xy[,2])), length=nbins)
freq <- as.data.frame(table(findInterval(xy[,1], x.bin),findInterval(xy[,2], y.bin)))
freq[,1] <- as.numeric(freq[,1])
freq[,2] <- as.numeric(freq[,2])
freq2D <- diag(nbins)*0
freq2D[cbind(freq[,1], freq[,2])] <- freq[,3]
par(mfrow=c(1,2))
image(x.bin, y.bin, freq2D, col=topo.colors(max(freq2D)))
contour(x.bin, y.bin, freq2D, add=TRUE, col=rgb(1,1,1,.7))
palette(rainbow(max(freq2D)))
cols <- (freq2D[-1,-1] + freq2D[-1,-(nbins-1)] + freq2D[-(nbins-1),-(nbins-1)] + freq2D[-(nbins-1),-1])/4
persp(freq2D, col=cols)
For a really fun time, try making an interactive, zoomable, 3D surface:
require(rgl)
surface3d(x.bin,y.bin,freq2D/10, col="red")
2D irregular aggregation of a matrix
This answer provides a great starting point using tapply
:
b <- melt(a)
bb <- with(b, tapply(value,
list(
y=cut(Var1, breaks=c(0, breaks, Inf), include.lowest=T),
x=cut(Var2, breaks=c(0, breaks, Inf), include.lowest=T)
),
sum)
)
bb
# x
# y [0,12] (12,14] (14,25] (25,60] (60,71] (71,89] (89,Inf]
# [0,12] 297 48 260 825 242 416 246
# (12,14] 48 3 43 141 46 59 33
# (14,25] 260 43 261 794 250 369 240
# (25,60] 825 141 794 2545 730 1303 778
# (60,71] 242 46 250 730 193 394 225
# (71,89] 416 59 369 1303 394 597 369
# (89,Inf] 246 33 240 778 225 369 230
These can then be plotted as rectangular bins using a base plot and rect
— i.e.:
library("reshape2")
library("magrittr")
bsq <- melt(bb)
# convert range notation to numerics
getNum <- . %>%
# rm brackets
gsub("\\[|\\(|\\]|\\)", "", .) %>%
# split digits and convert
strsplit(",") %>%
unlist %>% as.numeric
y <- t(sapply(bsq[,1], getNum))
x <- t(sapply(bsq[,2], getNum))
# normalise bin intensity by area
bsq$size <- (y[,2] - y[,1]) * (x[,2] - x[,1])
bsq$norm <- bsq$value / bsq$size
# draw rectangles on top of empty plot
plot(1:100, 1:100, type="n", frame=F, axes=F)
rect(ybottom=y[,1], ytop=y[,2],
xleft=x[,1], xright=x[,2],
col=rgb(colorRamp(c("white", "steelblue4"))(bsq$norm / max(bsq$norm)),
alpha=255*(bsq$norm / max(bsq$norm)), max=255),
border="white")
Binning values in R with multiple files
There are several ways to do it, I'll provide one method using base functions. (An alternative would be to use dplyr
, also well suited for this. However, the base example should be simple enough.)
Generate Data
(This is here merely because we don't have any of your data.)
n <- 10
for (ii in 1:3) {
dat <- runif(n)
writeLines(paste(dat, collapse = ','),
con = sprintf('user2062207-file%s.txt', ii))
}
readLines('user2062207-file1.txt')
## [1] "0.929472318384796,0.921938128070906,0.707776406314224,0.236701443558559,0.271322417538613,0.388766387710348,0.422867075540125,0.324589917669073,0.92406965768896,0.171326051233336"
Read the Data
This is where you'll start, assuming you have a simple pattern for finding the files.
fnames <- list.files(pattern = 'user2062207-file.*.txt')
allData <- unlist(sapply(fnames, read.table, sep = ','))
allRange <- range(allData)
df <- data.frame(x = allData)
head(df)
## x
## 1 0.9294723
## 2 0.9219381
## 3 0.7077764
## 4 0.2367014
## 5 0.2713224
## 6 0.3887664
dim(df)
## [1] 30 1
Set the Bins
The {floor,ceiling} +/- binSize
below is because the bins include only one side of the range (default: right side), so the minimum value(s) will not be binned. It also ensures the bins are on rounded boundaries.
binSize <- 0.05
allBins <- seq(floor(allRange[1] / binSize) * binSize,
ceiling(allRange[2] / binSize) * binSize,
by = binSize)
## bin the data
df$bin <- cut(df$x, breaks = allBins)
head(df)
## x bin
## 1 0.9294723 (0.9,0.95]
## 2 0.9219381 (0.9,0.95]
## 3 0.7077764 (0.7,0.75]
## 4 0.2367014 (0.2,0.25]
## 5 0.2713224 (0.25,0.3]
## 6 0.3887664 (0.35,0.4]
Statistics on Each Bin
sapply(levels(df$bin), function(lvl) median(df$x[df$bin == lvl], na.rm = TRUE))
## (0,0.05] (0.05,0.1] (0.1,0.15] (0.15,0.2] (0.2,0.25] (0.25,0.3] (0.3,0.35]
## 0.03802277 NA 0.11528715 0.18195392 0.22918094 0.27132242 0.33626971
## (0.35,0.4] (0.4,0.45] (0.45,0.5] (0.5,0.55] (0.55,0.6] (0.6,0.65] (0.65,0.7]
## 0.38009637 0.42184059 NA 0.53826028 0.57820253 0.64165116 0.67825992
## (0.7,0.75] (0.75,0.8] (0.8,0.85] (0.85,0.9] (0.9,0.95] (0.95,1]
## 0.74243926 NA 0.80759621 0.88974267 0.92406966 0.95691077
This is an area where numerous other options could be advantageous. For instance, the base function by
can work, though dealing with its data structure is not always intuitive even if the function call itself is easy to read:
head(by(df$x, df$bin, median, na.rm = TRUE))
## df$bin
## (0,0.05] (0.05,0.1] (0.1,0.15] (0.15,0.2] (0.2,0.25] (0.25,0.3]
## 0.03802277 NA 0.11528715 0.18195392 0.22918094 0.27132242
You could also use dplyr
with ease. This example starts with the original allData
and allBins
:
library(dplyr)
data.frame(x = allData) %>%
mutate(bin = cut(x, breaks = allBins)) %>%
group_by(bin) %>%
summarise(median(x))
## Source: local data frame [17 x 2]
## bin median(x)
## 1 (0,0.05] 0.03802277
## 2 (0.1,0.15] 0.11528715
## 3 (0.15,0.2] 0.18195392
## 4 (0.2,0.25] 0.22918094
## 5 (0.25,0.3] 0.27132242
#### ..snip..
The first example preserves empty bins whereas the other methods are not aware of empty bins. There are possibly other ways of using by
and dplyr
that would include these empty bins, but this seems to suffice.
EDIT
After a bit of chat, it was determined that the range of the data was too wide with a bin width of 0.0005. A better solution was devised. (No sample data to provide, sorry, not mine to give ...) I'll use random data to mimic the process:
set.seed(42)
x <- 5e7 * runif(5e5)
library(dplyr)
binSize <- 0.0005
df <- data.frame(dat = sort(x))
df$bin <- floor(df$dat / binSize) * binSize
head(df)
## dat bin
## 1 410.9577 410.9575
## 2 456.6275 456.6270
## 3 552.3674 552.3670
## 4 875.4898 875.4895
## 5 1018.6806 1018.6805
## 6 1102.2436 1102.2435
system.time(results <- df %>% group_by(bin) %>% summarize(med = median(dat)))
## user system elapsed
## 12.08 0.00 12.11
head(results)
## Source: local data frame [6 x 2]
## bin med
## 1 410.9575 410.9577
## 2 456.6270 456.6275
## 3 552.3670 552.3674
## 4 875.4895 875.4898
## 5 1018.6805 1018.6806
## 6 1102.2435 1102.2436
R code to show the actual continuous values stored in each bin?
You already have done everything you need, except wrap it into a data.frame
head(data.frame(Values=x, Bin=binned, Rowname=seq_along(x))[order(binned), ])
# Values Bin Rowname
# 2 -66.88718 [-189,-64.7] 2
# 5 -99.08521 [-189,-64.7] 5
# 8 -95.06063 [-189,-64.7] 8
# 10 -95.04592 [-189,-64.7] 10
# 15 -78.48819 [-189,-64.7] 15
# 28 -78.49396 [-189,-64.7] 28
You don't need a column for rownames though, since data.frame
keep a rowname attribute, ie rownames(yourData)
Related Topics
How to Define the Version of a Package in R Install.Packages
Rmarkdown::Render() in a Loop - Cannot Allocate Vector of Size
How to Put a Complicated Equation into a R Formula
R: Pass a List of Filtering Conditions into a Dataframe
How to Make Join Operations in Dplyr Silent
Parsimonious Way to Add North Arrow and Scale Bar to Ggmap
Making Gsub Only Replace Entire Words
Multiple Groups in Geom_Density() Plot
How to Use a Character as Attribute of a Function
Dplyr: Mutate_At + Coalesce: Dynamic Names of Columns
How to Change the Size of the Strip on Facets in a Ggplot
Gadm-Maps Cross-Country Comparison Graphics
Scraping a Complex HTML Table into a Data.Frame in R