How to Calculate the Median on Grouped Dataset

how to calculate the median on grouped dataset?

Since you already know the formula, it should be easy enough to create a function to do the calculation for you.

Here, I've created a basic function to get you started. The function takes four arguments:

  • frequencies: A vector of frequencies ("number" in your first example)
  • intervals: A 2-row matrix with the same number of columns as the length of frequencies, with the first row being the lower class boundary, and the second row being the upper class boundary. Alternatively, "intervals" may be a column in your data.frame, and you may specify sep (and possibly, trim) to have the function automatically create the required matrix for you.
  • sep: The separator character in your "intervals" column in your data.frame.
  • trim: A regular expression of characters that need to be removed before trying to coerce to a numeric matrix. One pattern is built into the function: trim = "cut". This sets the regular expression pattern to remove (, ), [, and ] from the input.

Here's the function (with comments showing how I used your instructions to put it together):

GroupedMedian <- function(frequencies, intervals, sep = NULL, trim = NULL) {
# If "sep" is specified, the function will try to create the
# required "intervals" matrix. "trim" removes any unwanted
# characters before attempting to convert the ranges to numeric.
if (!is.null(sep)) {
if (is.null(trim)) pattern <- ""
else if (trim == "cut") pattern <- "\\[|\\]|\\(|\\)"
else pattern <- trim
intervals <- sapply(strsplit(gsub(pattern, "", intervals), sep), as.numeric)
}

Midpoints <- rowMeans(intervals)
cf <- cumsum(frequencies)
Midrow <- findInterval(max(cf)/2, cf) + 1
L <- intervals[1, Midrow] # lower class boundary of median class
h <- diff(intervals[, Midrow]) # size of median class
f <- frequencies[Midrow] # frequency of median class
cf2 <- cf[Midrow - 1] # cumulative frequency class before median class
n_2 <- max(cf)/2 # total observations divided by 2

unname(L + (n_2 - cf2)/f * h)
}

Here's a sample data.frame to work with:

mydf <- structure(list(salary = c("1500-1600", "1600-1700", "1700-1800", 
"1800-1900", "1900-2000", "2000-2100", "2100-2200", "2200-2300",
"2300-2400", "2400-2500"), number = c(110L, 180L, 320L, 460L,
850L, 250L, 130L, 70L, 20L, 10L)), .Names = c("salary", "number"),
class = "data.frame", row.names = c(NA, -10L))
mydf
# salary number
# 1 1500-1600 110
# 2 1600-1700 180
# 3 1700-1800 320
# 4 1800-1900 460
# 5 1900-2000 850
# 6 2000-2100 250
# 7 2100-2200 130
# 8 2200-2300 70
# 9 2300-2400 20
# 10 2400-2500 10

Now, we can simply do:

GroupedMedian(mydf$number, mydf$salary, sep = "-")
# [1] 1915.294

Here's an example of the function in action on some made up data:

set.seed(1)
x <- sample(100, 100, replace = TRUE)
y <- data.frame(table(cut(x, 10)))
y
# Var1 Freq
# 1 (1.9,11.7] 8
# 2 (11.7,21.5] 8
# 3 (21.5,31.4] 8
# 4 (31.4,41.2] 15
# 5 (41.2,51] 13
# 6 (51,60.8] 5
# 7 (60.8,70.6] 11
# 8 (70.6,80.5] 15
# 9 (80.5,90.3] 11
# 10 (90.3,100] 6

### Here's GroupedMedian's output on the grouped data.frame...
GroupedMedian(y$Freq, y$Var1, sep = ",", trim = "cut")
# [1] 49.49231

### ... and the output of median on the original vector
median(x)
# [1] 49.5

By the way, with the sample data that you provided, where I think there was a mistake in one of your ranges (all were separated by dashes except one, which was separated by a comma), since strsplit uses a regular expression by default to split on, you can use the function like this:

x<-c(110,180,320,460,850,250,130,70,20,10)
colnames<-c("numbers")
rownames<-c("[1500-1600]","(1600-1700]","(1700-1800]","(1800-1900]",
"(1900-2000]"," (2000,2100]","(2100-2200]","(2200-2300]",
"(2300-2400]","(2400-2500]")
y<-matrix(x,nrow=length(x),dimnames=list(rownames,colnames))
GroupedMedian(y[, "numbers"], rownames(y), sep="-|,", trim="cut")
# [1] 1915.294

Find median of interval data in python

If you want to approximate median for discrete grouped data, you can approximate the median of the entire data set by interpolation using the formula:

median = L + interval * (N / 2 - CF) / F

L = lower limit of the median interval

N = total number of data points

CF = number of data points below the median interval

F = number of data points in the median interval

# Approximating median by pure python and pandas functions
import pandas as pd

df = pd.DataFrame.from_dict({'low_range':[1,11,21,31,41,51], 'high_range':[10,20,30,40,50,60], 'frequency':[123,350,200,1700,360,60]})

N = df['frequency'].sum()
# calulating L1
index = abs(df['frequency'].cumsum() - N/2).idxmin()
L1 = df['low_range'][index + 1]
cumsum_before = df['frequency'].cumsum()[index]
freq_medain = df['frequency'][index + 1]
width = df['high_range'][index + 1] - df['low_range'][index + 1] + 1

median = L1 + (N/2 - cumsum_before ) / freq_medain * width

print("L1 = {} , cumsum_before = {}, freq_medain = {}, width = {}".format(L1, cumsum_before, freq_medain, width ) )
print("Approximated median = ", median)

L1 = 31 , cumsum_before = 673, freq_medain = 1700, width = 10

Approximated median = 35.25588235294118

If you have continuous data, you can use median_grouped function in statistics package.

# Approximating median by statistics grouped_median for continuous values and fixed intervals

import statistics as st
import pandas as pd

df = pd.DataFrame.from_dict({'low_range':[1,10,21,31,41,51], 'high_range':[10,21,31,41,51,60], 'frequency':[123,350,200,1700,360,60]})

X = ((df['low_range'] + df['high_range'])/2).tolist()

f = df['frequency'].tolist()

# repeating values based on their frequencies
Y = [item for i,item in enumerate(X)
for count in range(f[i])]

width = df['high_range'][0] - df['low_range'][0] + 1

median = st.median_grouped(Y, width)
print("Approximated median = ", median)

Approximated median = 35.25588235294118

Access VBA: Calculating Median on data using GROUP BY on two columns

Consider an extension of @Fionnuala's great answer to calculate median in MS Access by accommodating an open-ended number of grouping variables.

VBA (save below in a standard module of Access project)

Code builds a dynamic SQL string for DAO recordset call for later median calculation. Special handling required for groupings with 0-2 records and null values for groupings.

Public Function MedianVBA(ParamArray Arr() As Variant) As Double
On Error GoTo ErrHandle
Dim N As Long
Dim tblName As String, numCol As String, grpVals As String
Dim strSQL As String
Dim db As DAO.Database, rs As DAO.Recordset
Dim varMedian As Double, fMedian As Double

'BUILD DYNAMIC SQL
tblName = Arr(0)
numCol = Arr(1)
grpVals = " WHERE " & numCol & " IS NOT NULL "

For N = 2 To UBound(Arr) Step 2
If Arr(N + 1) = "" Or IsNull(Arr(N + 1)) Then
grpVals = grpVals & " AND " & Arr(N) & " IS NULL"
ElseIf IsDate(Arr(N + 1)) Then
grpVals = grpVals & " AND " & Arr(N) & " = #" & Arr(N + 1) & "#"
Else
grpVals = grpVals & " AND CStr(" & Arr(N) & ") = '" & Arr(N + 1) & "'"
End If
Next N

strSQL = "SELECT " & numCol _
& " FROM " & tblName _
& grpVals _
& " ORDER BY " & numCol

'CALCULATE MEDIAN
Set db = CurrentDb
Set rs = db.OpenRecordset(strSQL, dbOpenDynaset)

If rs.RecordCount = 0 Then
MedianAcc = fMedian
GoTo ExitHandle
ElseIf rs.RecordCount = 1 Then
MedianAcc = rs.Fields(numCol)
GoTo ExitHandle
End If

rs.Move (rs.RecordCount / 2)
rs.MovePrevious

If rs.RecordCount Mod 2 = 0 Then
varMedian = rs.Fields(numCol)
If rs.RecordCount = 2 Then
rs.MoveLast
Else
rs.MoveNext
End If
fMedian = (varMedian + rs.Fields(numCol)) / 2
Else
fMedian = rs.Fields(numCol)
End If

rs.Close
MedianAcc = fMedian

ExitHandle:
Set rs = Nothing: Set db = Nothing
Exit Function

ErrHandle:
MsgBox Err.Number & ": " & Err.Description, vbCritical, "RUNTIME ERROR"
Resume ExitHandle
End Function

Do note, above VBA function uses a ParamArray where first argument expects the source table and second column expects the numeric column and the remaining is open-ended for group column name and value pairs. Signature of call is as follows:

=MedianAcc("table_name", 
"numeric_col",
"group1_column", "group1_value",
"group2_column", "group2_value",
...)

SQL (stored query that calls above VBA function)

Below runs a one-group and two-group median calculation.

SELECT t.typeA, t.typeB
, MedianVBA('[myTable]', '[total]', '[typeA]', t.typeA) AS MedianGrp1,
, MedianVBA('[myTable]', '[total]', '[typeA]', t.typeA, '[typeB]', t.typeB) AS MedianGrp2
FROM myTable t
GROUP BY t.typeA, t.typeB

How to get median with frequency table in R?

EDIT:

Here's how you calculate the mean patient age by hospital:

df %>%
group_by(hospital) %>%
summarise(
mean_age = sum(patient_age*number_patients)/sum(number_patients)
)

or simply:

df %>%
group_by(hospital) %>%
summarise(
mean_age = mean(rep(patient_age,number_patients))
)

Here's the medians:

df %>%
group_by(hospital) %>%
summarise(
median_age = sort(rep(patient_age,number_patients))[length(rep(patient_age,number_patients))/2]
)

Here, we subset sort(rep(patient_age,number_patients)) on its middle value, which is length(rep(patient_age,number_patients))/2

EDIT 2:

or simply:

df %>%
group_by(hospital) %>%
summarise(
median_age = median(rep(patient_age,number_patients))
)

R - Median of a Frequency distribution, grouped by another variable

We can try with dplyr

library(dplyr)    
Clean1 <- Clean[rep(1:nrow(Clean), Clean$Frequency),]
Clean1 %>%
group_by(State) %>%
summarise(Median = median(medicare_average_payment))

Or using data.table

library(data.table)
setDT(Clean)[, .(Median = median(rep(medicare_average_payment, Frequency))) , State]


Related Topics



Leave a reply



Submit