how to calculate the median on grouped dataset?
Since you already know the formula, it should be easy enough to create a function to do the calculation for you.
Here, I've created a basic function to get you started. The function takes four arguments:
frequencies
: A vector of frequencies ("number" in your first example)intervals
: A 2-rowmatrix
with the same number of columns as the length of frequencies, with the first row being the lower class boundary, and the second row being the upper class boundary. Alternatively, "intervals
" may be a column in yourdata.frame
, and you may specifysep
(and possibly,trim
) to have the function automatically create the required matrix for you.sep
: The separator character in your "intervals
" column in yourdata.frame
.trim
: A regular expression of characters that need to be removed before trying to coerce to a numeric matrix. One pattern is built into the function:trim = "cut"
. This sets the regular expression pattern to remove (, ), [, and ] from the input.
Here's the function (with comments showing how I used your instructions to put it together):
GroupedMedian <- function(frequencies, intervals, sep = NULL, trim = NULL) {
# If "sep" is specified, the function will try to create the
# required "intervals" matrix. "trim" removes any unwanted
# characters before attempting to convert the ranges to numeric.
if (!is.null(sep)) {
if (is.null(trim)) pattern <- ""
else if (trim == "cut") pattern <- "\\[|\\]|\\(|\\)"
else pattern <- trim
intervals <- sapply(strsplit(gsub(pattern, "", intervals), sep), as.numeric)
}
Midpoints <- rowMeans(intervals)
cf <- cumsum(frequencies)
Midrow <- findInterval(max(cf)/2, cf) + 1
L <- intervals[1, Midrow] # lower class boundary of median class
h <- diff(intervals[, Midrow]) # size of median class
f <- frequencies[Midrow] # frequency of median class
cf2 <- cf[Midrow - 1] # cumulative frequency class before median class
n_2 <- max(cf)/2 # total observations divided by 2
unname(L + (n_2 - cf2)/f * h)
}
Here's a sample data.frame
to work with:
mydf <- structure(list(salary = c("1500-1600", "1600-1700", "1700-1800",
"1800-1900", "1900-2000", "2000-2100", "2100-2200", "2200-2300",
"2300-2400", "2400-2500"), number = c(110L, 180L, 320L, 460L,
850L, 250L, 130L, 70L, 20L, 10L)), .Names = c("salary", "number"),
class = "data.frame", row.names = c(NA, -10L))
mydf
# salary number
# 1 1500-1600 110
# 2 1600-1700 180
# 3 1700-1800 320
# 4 1800-1900 460
# 5 1900-2000 850
# 6 2000-2100 250
# 7 2100-2200 130
# 8 2200-2300 70
# 9 2300-2400 20
# 10 2400-2500 10
Now, we can simply do:
GroupedMedian(mydf$number, mydf$salary, sep = "-")
# [1] 1915.294
Here's an example of the function in action on some made up data:
set.seed(1)
x <- sample(100, 100, replace = TRUE)
y <- data.frame(table(cut(x, 10)))
y
# Var1 Freq
# 1 (1.9,11.7] 8
# 2 (11.7,21.5] 8
# 3 (21.5,31.4] 8
# 4 (31.4,41.2] 15
# 5 (41.2,51] 13
# 6 (51,60.8] 5
# 7 (60.8,70.6] 11
# 8 (70.6,80.5] 15
# 9 (80.5,90.3] 11
# 10 (90.3,100] 6
### Here's GroupedMedian's output on the grouped data.frame...
GroupedMedian(y$Freq, y$Var1, sep = ",", trim = "cut")
# [1] 49.49231
### ... and the output of median on the original vector
median(x)
# [1] 49.5
By the way, with the sample data that you provided, where I think there was a mistake in one of your ranges (all were separated by dashes except one, which was separated by a comma), since strsplit
uses a regular expression by default to split on, you can use the function like this:
x<-c(110,180,320,460,850,250,130,70,20,10)
colnames<-c("numbers")
rownames<-c("[1500-1600]","(1600-1700]","(1700-1800]","(1800-1900]",
"(1900-2000]"," (2000,2100]","(2100-2200]","(2200-2300]",
"(2300-2400]","(2400-2500]")
y<-matrix(x,nrow=length(x),dimnames=list(rownames,colnames))
GroupedMedian(y[, "numbers"], rownames(y), sep="-|,", trim="cut")
# [1] 1915.294
Find median of interval data in python
If you want to approximate median for discrete grouped data, you can approximate the median of the entire data set by interpolation using the formula:
median = L + interval * (N / 2 - CF) / F
L = lower limit of the median interval
N = total number of data points
CF = number of data points below the median interval
F = number of data points in the median interval
# Approximating median by pure python and pandas functions
import pandas as pd
df = pd.DataFrame.from_dict({'low_range':[1,11,21,31,41,51], 'high_range':[10,20,30,40,50,60], 'frequency':[123,350,200,1700,360,60]})
N = df['frequency'].sum()
# calulating L1
index = abs(df['frequency'].cumsum() - N/2).idxmin()
L1 = df['low_range'][index + 1]
cumsum_before = df['frequency'].cumsum()[index]
freq_medain = df['frequency'][index + 1]
width = df['high_range'][index + 1] - df['low_range'][index + 1] + 1
median = L1 + (N/2 - cumsum_before ) / freq_medain * width
print("L1 = {} , cumsum_before = {}, freq_medain = {}, width = {}".format(L1, cumsum_before, freq_medain, width ) )
print("Approximated median = ", median)
L1 = 31 , cumsum_before = 673, freq_medain = 1700, width = 10
Approximated median = 35.25588235294118
If you have continuous data, you can use median_grouped function in statistics package.
# Approximating median by statistics grouped_median for continuous values and fixed intervals
import statistics as st
import pandas as pd
df = pd.DataFrame.from_dict({'low_range':[1,10,21,31,41,51], 'high_range':[10,21,31,41,51,60], 'frequency':[123,350,200,1700,360,60]})
X = ((df['low_range'] + df['high_range'])/2).tolist()
f = df['frequency'].tolist()
# repeating values based on their frequencies
Y = [item for i,item in enumerate(X)
for count in range(f[i])]
width = df['high_range'][0] - df['low_range'][0] + 1
median = st.median_grouped(Y, width)
print("Approximated median = ", median)
Approximated median = 35.25588235294118
Access VBA: Calculating Median on data using GROUP BY on two columns
Consider an extension of @Fionnuala's great answer to calculate median in MS Access by accommodating an open-ended number of grouping variables.
VBA (save below in a standard module of Access project)
Code builds a dynamic SQL string for DAO recordset call for later median calculation. Special handling required for groupings with 0-2 records and null values for groupings.
Public Function MedianVBA(ParamArray Arr() As Variant) As Double
On Error GoTo ErrHandle
Dim N As Long
Dim tblName As String, numCol As String, grpVals As String
Dim strSQL As String
Dim db As DAO.Database, rs As DAO.Recordset
Dim varMedian As Double, fMedian As Double
'BUILD DYNAMIC SQL
tblName = Arr(0)
numCol = Arr(1)
grpVals = " WHERE " & numCol & " IS NOT NULL "
For N = 2 To UBound(Arr) Step 2
If Arr(N + 1) = "" Or IsNull(Arr(N + 1)) Then
grpVals = grpVals & " AND " & Arr(N) & " IS NULL"
ElseIf IsDate(Arr(N + 1)) Then
grpVals = grpVals & " AND " & Arr(N) & " = #" & Arr(N + 1) & "#"
Else
grpVals = grpVals & " AND CStr(" & Arr(N) & ") = '" & Arr(N + 1) & "'"
End If
Next N
strSQL = "SELECT " & numCol _
& " FROM " & tblName _
& grpVals _
& " ORDER BY " & numCol
'CALCULATE MEDIAN
Set db = CurrentDb
Set rs = db.OpenRecordset(strSQL, dbOpenDynaset)
If rs.RecordCount = 0 Then
MedianAcc = fMedian
GoTo ExitHandle
ElseIf rs.RecordCount = 1 Then
MedianAcc = rs.Fields(numCol)
GoTo ExitHandle
End If
rs.Move (rs.RecordCount / 2)
rs.MovePrevious
If rs.RecordCount Mod 2 = 0 Then
varMedian = rs.Fields(numCol)
If rs.RecordCount = 2 Then
rs.MoveLast
Else
rs.MoveNext
End If
fMedian = (varMedian + rs.Fields(numCol)) / 2
Else
fMedian = rs.Fields(numCol)
End If
rs.Close
MedianAcc = fMedian
ExitHandle:
Set rs = Nothing: Set db = Nothing
Exit Function
ErrHandle:
MsgBox Err.Number & ": " & Err.Description, vbCritical, "RUNTIME ERROR"
Resume ExitHandle
End Function
Do note, above VBA function uses a ParamArray
where first argument expects the source table and second column expects the numeric column and the remaining is open-ended for group column name and value pairs. Signature of call is as follows:
=MedianAcc("table_name",
"numeric_col",
"group1_column", "group1_value",
"group2_column", "group2_value",
...)
SQL (stored query that calls above VBA function)
Below runs a one-group and two-group median calculation.
SELECT t.typeA, t.typeB
, MedianVBA('[myTable]', '[total]', '[typeA]', t.typeA) AS MedianGrp1,
, MedianVBA('[myTable]', '[total]', '[typeA]', t.typeA, '[typeB]', t.typeB) AS MedianGrp2
FROM myTable t
GROUP BY t.typeA, t.typeB
How to get median with frequency table in R?
EDIT:
Here's how you calculate the mean patient age by hospital:
df %>%
group_by(hospital) %>%
summarise(
mean_age = sum(patient_age*number_patients)/sum(number_patients)
)
or simply:
df %>%
group_by(hospital) %>%
summarise(
mean_age = mean(rep(patient_age,number_patients))
)
Here's the medians:
df %>%
group_by(hospital) %>%
summarise(
median_age = sort(rep(patient_age,number_patients))[length(rep(patient_age,number_patients))/2]
)
Here, we subset sort(rep(patient_age,number_patients))
on its middle value, which is length(rep(patient_age,number_patients))/2
EDIT 2:
or simply:
df %>%
group_by(hospital) %>%
summarise(
median_age = median(rep(patient_age,number_patients))
)
R - Median of a Frequency distribution, grouped by another variable
We can try with dplyr
library(dplyr)
Clean1 <- Clean[rep(1:nrow(Clean), Clean$Frequency),]
Clean1 %>%
group_by(State) %>%
summarise(Median = median(medicare_average_payment))
Or using data.table
library(data.table)
setDT(Clean)[, .(Median = median(rep(medicare_average_payment, Frequency))) , State]
Related Topics
How to Install the Odbc Driver for Snowflake Successfully on an M1 Apple Silicon MAC
Visualising and Rotating a Matrix
How to Turn the Filename into a Variable When Reading Multiple CSVS into R
Major and Minor Tickmarks with Plotly
Split Column in Data.Table to Multiple Rows
R: Loop Over Columns in Data.Table
How to Use R Package "Formattable" in Shiny Dashboard
Obtaining Percent Scales Reflective of Individual Facets with Ggplot2
Naive Bayes in Quanteda VS Caret: Wildly Different Results
How to Install R Packages via Proxy [User + Password]
How to Get a List of All Possible Partitions of a Vector in R
Determining Minimum Values in a Vector in R
How to Simultaneously Apply Color/Shape/Size in a Scatter Plot Using Plotly
Change Background Colour of Knitr::Kable Headers