Count Occurrences of Value in a Set of Variables in R (Per Row)

Count occurrences of value in a set of variables in R (per row)

Try

apply(df,MARGIN=1,table)

Where df is your data.frame. This will return a list of the same length of the amount of rows in your data.frame. Each item of the list corresponds to a row of the data.frame (in the same order), and it is a table where the content is the number of occurrences and the names are the corresponding values.

For instance:

df=data.frame(V1=c(10,20,10,20),V2=c(20,30,20,30),V3=c(20,10,20,10))
#create a data.frame containing some data
df #show the data.frame
V1 V2 V3
1 10 20 20
2 20 30 10
3 10 20 20
4 20 30 10
apply(df,MARGIN=1,table) #apply the function table on each row (MARGIN=1)
[[1]]

10 20
1 2

[[2]]

10 20 30
1 1 1

[[3]]

10 20
1 2

[[4]]

10 20 30
1 1 1

#desired result

Count occurrences of value in a set of variables in R (per row) - with weights

One option could be apply table function to each row and find out occurrence for value in each column. The factors defined in V will then be applied to each column to find index of column with max freq*V value. The value from that index of that row values will be the desired value.

#Multiplier for occurrence in each column
V = c(0.25,0.25,0.5)

#data frame
df8=data.frame(V1=c(10,20,10,20),V2=c(20,30,20,30),V3=c(20,10,20,10))

# This function accepts all columns for a row. Finds frequencies for each
# column values and then multiply with V (column wise)
# Finally value in row at index with max(freq*V) is returned.

find_max_freq_val <- function(x){
freq_df <- as.data.frame(table(x))
freq_vec <- mapply(function(y)freq_df[freq_df$x==y,"Freq"], x)
#multiply with V with freq and find index of max(a*V)
#Then return item at that index from x
x[which((freq_vec*V) == max(freq_vec*V))]

}

# call above function to add an column with desired value
df8$new_val <- apply(df8, 1, find_max_freq_val)

df8
# V1 V2 V3 new_val
#1 10 20 20 20
#2 20 30 10 10
#3 10 20 20 20
#4 20 30 10 10

R count number of variables with value =mq per row

You can use the 'apply' function to count a particular value in your existing dataframe 'df',

df$count.MQ <- apply(df, 1, function(x) length(which(x=="mq")))

Here the second argument is 1 since you want to count for each row. You can read more about it from https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/apply

Count occurrence of string values per row in dataframe in R (dplyr)

You can use across with rowSums -

library(dplyr)

df %>% mutate(d9 = rowSums(across(all_of(cols), `%in%`, bcde)))

# d1 d2 d3 d4 d5 d6 d7 d8 d9
# <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
#1 b a a a a a a a 0
#2 a a a a c a a a 1
#3 a b a a a a a a 1
#4 a a c a a b a a 2
#5 a a a a a a a a 0
#6 a a b a a a a a 1
#7 a a a a a d a a 1
#8 a a a d a a a a 1

This can also be written in base R -

df$d9 <- rowSums(sapply(df[cols], `%in%`, bcde))

How to count occurrences of several strings per row in a data frame in R

If I understand correctly, the OP has multiple lists of agents that can be clustered for one purpose not just one list of beta blockers. The OP mentions statins, e.g. The OP wants to count how many different agents belonging to each cluster are being taken by each subject. The counts for each agent cluster are to be appended to each row.

I suggest to compute the sums for all clusters at once rather than to do this manually list by list.

For this, we first need to set-up a data frame with the clustering:

cluster
    Purpose              Agent
1: BETA METOPROLOL
2: BETA BISOPROLOL
3: BETA NEBILET
4: BETA METOHEXAL
5: BETA SOTALEX
6: BETA QUERTO
7: BETA NEBIVOLOL
8: BETA CARVEDILOL
9: BETA METOPROLOLSUCCINAT
10: BETA BELOC
11: STATIN ATORVASTATIN
12: STATIN SIMVASTATIN
13: STATIN LOVASTATIN
14: STATIN PRAVASTATIN
15: STATIN FLUVASTATIN
16: STATIN PITAVASTIN

cluster can be created, e.g., by

library(data.table)
library(magrittr)
cluster <- list(
BETA = c("METOPROLOL", "BISOPROLOL", "NEBILET", "METOHEXAL", "SOTALEX",
"QUERTO", "NEBIVOLOL", "CARVEDILOL", "METOPROLOLSUCCINAT", "BELOC"),
STATIN = c("ATORVASTATIN", "SIMVASTATIN", "LOVASTATIN", "PRAVASTATIN",
"FLUVASTATIN", "PITAVASTIN")
) %>%
lapply(data.table) %>%
rbindlist(idcol = "Purpose") %>%
setnames("V1", "Agent")

For counting the occurrences, we need to join or merge this table with the list of agents each subject is taking dat after dat has been reshaped from wide to long format.

While data in spreadsheet-style wide format, i.e., with one row per subject and many columns, are often suitable for data entry and inspection the database-style long format is often more suitable for data processing.

taken <- melt(setDT(dat)[, ID := .I], "ID", value.name = "Agent", na.rm = TRUE)[
Agent != ""][
, Agent := toupper(Agent)][]
    ID variable           Agent
1: 1 Med1 AMLODIPIN
2: 2 Med1 PLAVIX
3: 3 Med1 BISOPROLOL
4: 4 Med1 ASS
5: 5 Med1 ASS
6: 6 Med1 ASS
7: 1 Med2 RAMIPRIL
8: 2 Med2 SIMVASTATIN
9: 3 Med2 AMLODIPIN
10: 4 Med2 ENALAPRIL
11: 5 Med2 ATORVASTATIN
12: 6 Med2 FRAGMIN
13: 1 Med3 METOPROLOL
14: 2 Med3 MIRTAZAPIN
15: 3 Med3 ASS
16: 4 Med3 L-THYROXIN
17: 5 Med3 FOSAMAX
18: 6 Med3 TORASEMID
19: 3 Med4 VALSARTAN
20: 4 Med4 LITALIR
21: 5 Med4 CALCIUM
22: 6 Med4 SPIRONOLACTON
23: 3 Med5 CHLORALDURAT
24: 4 Med5 LITALIR
25: 5 Med5 PANTOZOL
26: 6 Med5 LORZAAR PROTECT
27: 3 Med6 DOXOZOSIN
28: 4 Med6 AMLODIPIN
29: 5 Med6 NOVAMINSULFON
30: 6 Med6 VESIKUR
31: 3 Med7 TAMSULOSIN
32: 4 Med7 CETIRIZIN
33: 6 Med7 ROCALTROL
34: 3 Med8 CIPRAMIL
35: 4 Med8 HCT
36: 6 Med8 ATORVASTATIN
37: 4 Med9 NACL
38: 6 Med9 PREDNISOLON
39: 4 Med10 CARMEN
40: 6 Med10 LACTULOSE
41: 4 Med11 PROTEIN 88
42: 6 Med11 MIRTAZAPIN
43: 4 Med12 NOVALGIN
44: 6 Med12 LANTUS
45: 6 Med13 ACTRAPID
46: 6 Med14 PANTOZOL
47: 6 Med15 SALBUTAMOL
48: 6 Med16 AMPHO MORONAL
ID variable Agent

dat is modified by appending a row number which identifies each subject, then it is reshaped to long format using melt(). Missing or empty entries are removed and agent names are converted to uppercase for consistency.

Edit In long format it is also easy to check for duplicate agents per subject

taken[duplicated(taken, by = c("ID", "Agent"))]
   ID variable   Agent
1: 4 Med5 LITALIR

and remove the duplicates:

taken <- unique(taken, by = c("ID", "Agent"))

The final step creates what I believe is the expected result:

   ID BETA STATIN       Med1         Med2       Med3          Med4            Med5          Med6       Med7         Med8
1: 1 1 0 AMLODIPIN RAMIPRIL METOPROLOL
2: 2 0 1 PLAVIX SIMVASTATIN MIRTAZAPIN
3: 3 1 0 BISOPROLOL AMLODIPIN ASS VALSARTAN CHLORALDURAT Doxozosin TAMSULOSIN CIPRAMIL
4: 4 0 0 ASS ENALAPRIL L-THYROXIN LITALIR LITALIR AMLODIPIN CETIRIZIN HCT
5: 5 0 1 ASS ATORVASTATIN FOSAMAX CALCIUM PANTOZOL NOVAMINSULFON
6: 6 0 1 ASS FRAGMIN TORASEMID SPIRONOLACTON LORZAAR PROTECT VESIKUR ROCALTROL ATORVASTATIN

Pleae, note the additional columns with the counts by cluster (Due to limited space not all columns of the result are shown here). This is created by

cluster[taken, on = .(Agent)][
, dcast(.SD, ID ~ Purpose, length)][
dat, on = "ID"][
, "NA" := NULL][]

using the following operations:

  1. Join cluster and taken to have Purpose appended
  2. Reshape to wide format, one row per subject and one column per purpose, thereby counting the number of occurrences
  3. Join this result result with the original data dat
  4. Remove the superfluous column of NA counts

Data

dat <- structure(list(Med1 = c("AMLODIPIN", "PLAVIX", "BISOPROLOL", 
"ASS", "ASS", "ASS"), Med2 = c("RAMIPRIL", "SIMVASTATIN", "AMLODIPIN",
"ENALAPRIL", "ATORVASTATIN", "FRAGMIN"), Med3 = c("METOPROLOL",
"MIRTAZAPIN", "ASS", "L-THYROXIN", "FOSAMAX", "TORASEMID"), Med4 = c("",
"", "VALSARTAN", "LITALIR", "CALCIUM", "SPIRONOLACTON"), Med5 = c("",
"", "CHLORALDURAT", "LITALIR", "PANTOZOL", "LORZAAR PROTECT"),
Med6 = c("", "", "Doxozosin", "AMLODIPIN", "NOVAMINSULFON",
"VESIKUR"), Med7 = c("", "", "TAMSULOSIN", "CETIRIZIN", "",
"ROCALTROL"), Med8 = c("", "", "CIPRAMIL", "HCT", "", "ATORVASTATIN"
), Med9 = c("", "", "", "NACL", "", "PREDNISOLON"), Med10 = c("",
"", "", "CARMEN", "", "LACTULOSE"), Med11 = c("", "", "",
"PROTEIN 88", "", "MIRTAZAPIN"), Med12 = c("", "", "", "NOVALGIN",
"", "LANTUS"), Med13 = c("", "", "", "", "", "ACTRAPID"),
Med14 = c("", "", "", "", "", "PANTOZOL"), Med15 = c("",
"", "", "", "", "SALBUTAMOL"), Med16 = c("", "", "", "",
"", "AMPHO MORONAL")), class = "data.frame", row.names = c(NA,
-6L))

Counting number of instances of a condition per row R

You can use rowSums.

df$no_calls <- rowSums(df == "nc")
df
# rsID sample1 sample2 sample3 sample1304 no_calls
#1 abcd aa bb nc nc 2
#2 efgh nc nc nc nc 4
#3 ijkl aa ab aa nc 1

Or, as pointed out by MrFlick, to exclude the first column from the row sums, you can slightly modify the approach to

df$no_calls <- rowSums(df[-1] == "nc")

Regarding the row names: They are not counted in rowSums and you can make a simple test to demonstrate it:

rownames(df)[1] <- "nc"  # name first row "nc"
rowSums(df == "nc") # compute the row sums
#nc 2 3
# 2 4 1 # still the same in first row

Count occurrences of a variable having two given values corresponding to one value of another variable

The optimal solution in terms of memory space would be one row for each pair which would be 700*699 / 2. This problem is still relatively small and the simplicity of manipulating a 700*700 matrix is probably more valuable than the 700*701/2 cells you're saving, which would work out to 240kB with one byte per cell. It could be even less if the matrix is sparse (i.e. most pairs of materials are never ordered together) and you use an appropriate data structure.

Here's how the code would look like:

First we want to create a dataframe with as many rows and columns as there are materials. Matrices are easier to create so we create one that we convert to a dataframe afterwards.

all_materials = levels(as.factor(X$Materials))
number_materials = length(all_materials)
Pairs <- as.data.frame(matrix(data = 0, nrow = number_materials, ncol = number_materials))

(Here, X is your dataset)

We then set the row names and column names to be able to access the rows and columns directly with the identifiers of the materials which are apparently not necessarily numbered from 1 to 700.

colnames(Pairs) <- all_materials
rownames(Pairs) <- all_materials

Then we iterate over the dataset

for(order in levels(as.factor(X$Order.number))){
# getting the materials in each order
materials_for_order = X[X$Order.number==order, "Materials"]
if (length(materials_for_order)>1) {
# finding each possible pair from the materials list
all_pairs_in_order = combn(x=materials_for_order, m=2)
# incrementing the cell at the line and column corresponding to each pair
for(i in 1:ncol(all_pairs_in_order)){
Pairs[all_pairs_in_order[1, i], all_pairs_in_order[2, i]] = Pairs[all_pairs_in_order[1, i], all_pairs_in_order[2, i]] + 1
}
}
}

At the end of the loop, the Pairs table should contain everything you need.



Related Topics



Leave a reply



Submit