﻿ Count Occurrences of Value in a Set of Variables in R (Per Row) - ITCodar

Count Occurrences of Value in a Set of Variables in R (Per Row)

Count occurrences of value in a set of variables in R (per row)

Try

``apply(df,MARGIN=1,table)``

Where `df` is your `data.frame`. This will return a list of the same length of the amount of rows in your data.frame. Each item of the list corresponds to a row of the data.frame (in the same order), and it is a table where the content is the number of occurrences and the names are the corresponding values.

For instance:

``df=data.frame(V1=c(10,20,10,20),V2=c(20,30,20,30),V3=c(20,10,20,10))#create a data.frame containing some datadf #show the data.frame  V1 V2 V31 10 20 202 20 30 103 10 20 204 20 30 10apply(df,MARGIN=1,table) #apply the function table on each row (MARGIN=1)[[1]]10 20  1  2 [[2]]10 20 30  1  1  1 [[3]]10 20  1  2 [[4]]10 20 30  1  1  1 #desired result``

Count occurrences of value in a set of variables in R (per row) - with weights

One option could be apply `table` function to each row and find out occurrence for value in each column. The factors defined in `V` will then be applied to each column to find index of column with max `freq*V` value. The value from that `index` of that row values will be the desired value.

``#Multiplier for occurrence in each columnV = c(0.25,0.25,0.5)#data framedf8=data.frame(V1=c(10,20,10,20),V2=c(20,30,20,30),V3=c(20,10,20,10))# This function accepts all columns for a row. Finds frequencies for each# column values and then multiply with V (column wise)# Finally value in row at index with max(freq*V) is returned.find_max_freq_val <- function(x){  freq_df <- as.data.frame(table(x))  freq_vec <- mapply(function(y)freq_df[freq_df\$x==y,"Freq"], x)  #multiply with V with freq and find index of max(a*V)  #Then return item at that index from x  x[which((freq_vec*V) == max(freq_vec*V))]}# call above function to add an column with desired valuedf8\$new_val <- apply(df8, 1, find_max_freq_val)df8#  V1 V2 V3 new_val#1 10 20 20      20#2 20 30 10      10#3 10 20 20      20#4 20 30 10      10``

R count number of variables with value =mq per row

You can use the 'apply' function to count a particular value in your existing dataframe 'df',

``df\$count.MQ <- apply(df, 1, function(x) length(which(x=="mq")))``

Here the second argument is 1 since you want to count for each row. You can read more about it from https://www.rdocumentation.org/packages/base/versions/3.5.1/topics/apply

Count occurrence of string values per row in dataframe in R (dplyr)

You can use `across` with `rowSums` -

``library(dplyr)df %>% mutate(d9 = rowSums(across(all_of(cols), `%in%`, bcde)))#  d1    d2    d3    d4    d5    d6    d7    d8       d9#  <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>#1 b     a     a     a     a     a     a     a         0#2 a     a     a     a     c     a     a     a         1#3 a     b     a     a     a     a     a     a         1#4 a     a     c     a     a     b     a     a         2#5 a     a     a     a     a     a     a     a         0#6 a     a     b     a     a     a     a     a         1#7 a     a     a     a     a     d     a     a         1#8 a     a     a     d     a     a     a     a         1``

This can also be written in base R -

``df\$d9 <- rowSums(sapply(df[cols], `%in%`, bcde))``

How to count occurrences of several strings per row in a data frame in R

If I understand correctly, the OP has multiple lists of agents that can be clustered for one purpose not just one list of beta blockers. The OP mentions statins, e.g. The OP wants to count how many different agents belonging to each cluster are being taken by each subject. The counts for each agent cluster are to be appended to each row.

I suggest to compute the sums for all clusters at once rather than to do this manually list by list.

For this, we first need to set-up a data frame with the clustering:

``cluster``
``    Purpose              Agent 1:    BETA         METOPROLOL 2:    BETA         BISOPROLOL 3:    BETA            NEBILET 4:    BETA          METOHEXAL 5:    BETA            SOTALEX 6:    BETA             QUERTO 7:    BETA          NEBIVOLOL 8:    BETA         CARVEDILOL 9:    BETA METOPROLOLSUCCINAT10:    BETA              BELOC11:  STATIN       ATORVASTATIN12:  STATIN        SIMVASTATIN13:  STATIN         LOVASTATIN14:  STATIN        PRAVASTATIN15:  STATIN        FLUVASTATIN16:  STATIN         PITAVASTIN``

`cluster` can be created, e.g., by

``library(data.table)library(magrittr)cluster <- list(  BETA = c("METOPROLOL", "BISOPROLOL", "NEBILET", "METOHEXAL", "SOTALEX",           "QUERTO", "NEBIVOLOL", "CARVEDILOL", "METOPROLOLSUCCINAT", "BELOC"),  STATIN = c("ATORVASTATIN", "SIMVASTATIN", "LOVASTATIN", "PRAVASTATIN",            "FLUVASTATIN", "PITAVASTIN")  ) %>%   lapply(data.table) %>%   rbindlist(idcol = "Purpose") %>%   setnames("V1", "Agent")``

For counting the occurrences, we need to join or merge this table with the list of agents each subject is taking `dat` after `dat` has been reshaped from wide to long format.

While data in spreadsheet-style wide format, i.e., with one row per subject and many columns, are often suitable for data entry and inspection the database-style long format is often more suitable for data processing.

``taken <- melt(setDT(dat)[, ID := .I], "ID", value.name = "Agent", na.rm = TRUE)[  Agent != ""][    , Agent := toupper(Agent)][]``
``    ID variable           Agent 1:  1     Med1       AMLODIPIN 2:  2     Med1          PLAVIX 3:  3     Med1      BISOPROLOL 4:  4     Med1             ASS 5:  5     Med1             ASS 6:  6     Med1             ASS 7:  1     Med2        RAMIPRIL 8:  2     Med2     SIMVASTATIN 9:  3     Med2       AMLODIPIN10:  4     Med2       ENALAPRIL11:  5     Med2    ATORVASTATIN12:  6     Med2         FRAGMIN13:  1     Med3      METOPROLOL14:  2     Med3      MIRTAZAPIN15:  3     Med3             ASS16:  4     Med3      L-THYROXIN17:  5     Med3         FOSAMAX18:  6     Med3       TORASEMID19:  3     Med4       VALSARTAN20:  4     Med4         LITALIR21:  5     Med4         CALCIUM22:  6     Med4   SPIRONOLACTON23:  3     Med5    CHLORALDURAT24:  4     Med5         LITALIR25:  5     Med5        PANTOZOL26:  6     Med5 LORZAAR PROTECT27:  3     Med6       DOXOZOSIN28:  4     Med6       AMLODIPIN29:  5     Med6   NOVAMINSULFON30:  6     Med6         VESIKUR31:  3     Med7      TAMSULOSIN32:  4     Med7       CETIRIZIN33:  6     Med7       ROCALTROL34:  3     Med8        CIPRAMIL35:  4     Med8             HCT36:  6     Med8    ATORVASTATIN37:  4     Med9            NACL38:  6     Med9     PREDNISOLON39:  4    Med10          CARMEN40:  6    Med10       LACTULOSE41:  4    Med11      PROTEIN 8842:  6    Med11      MIRTAZAPIN43:  4    Med12        NOVALGIN44:  6    Med12          LANTUS45:  6    Med13        ACTRAPID46:  6    Med14        PANTOZOL47:  6    Med15      SALBUTAMOL48:  6    Med16   AMPHO MORONAL    ID variable           Agent``

`dat` is modified by appending a row number which identifies each subject, then it is reshaped to long format using `melt()`. Missing or empty entries are removed and agent names are converted to uppercase for consistency.

Edit In long format it is also easy to check for duplicate agents per subject

``taken[duplicated(taken, by = c("ID", "Agent"))]``
``   ID variable   Agent1:  4     Med5 LITALIR``

and remove the duplicates:

``taken <- unique(taken, by = c("ID", "Agent"))``

The final step creates what I believe is the expected result:

``   ID BETA STATIN       Med1         Med2       Med3          Med4            Med5          Med6       Med7         Med81:  1    1      0  AMLODIPIN     RAMIPRIL METOPROLOL                                                                    2:  2    0      1     PLAVIX  SIMVASTATIN MIRTAZAPIN                                                                    3:  3    1      0 BISOPROLOL    AMLODIPIN        ASS     VALSARTAN    CHLORALDURAT     Doxozosin TAMSULOSIN     CIPRAMIL4:  4    0      0        ASS    ENALAPRIL L-THYROXIN       LITALIR         LITALIR     AMLODIPIN  CETIRIZIN          HCT5:  5    0      1        ASS ATORVASTATIN    FOSAMAX       CALCIUM        PANTOZOL NOVAMINSULFON                        6:  6    0      1        ASS      FRAGMIN  TORASEMID SPIRONOLACTON LORZAAR PROTECT       VESIKUR  ROCALTROL ATORVASTATIN``

Pleae, note the additional columns with the counts by cluster (Due to limited space not all columns of the result are shown here). This is created by

``cluster[taken, on = .(Agent)][  , dcast(.SD, ID ~ Purpose, length)][    dat, on = "ID"][      , "NA" := NULL][]``

using the following operations:

1. Join `cluster` and `taken` to have `Purpose` appended
2. Reshape to wide format, one row per subject and one column per purpose, thereby counting the number of occurrences
3. Join this result result with the original data `dat`
4. Remove the superfluous column of NA counts

Data

``dat <- structure(list(Med1 = c("AMLODIPIN", "PLAVIX", "BISOPROLOL", "ASS", "ASS", "ASS"), Med2 = c("RAMIPRIL", "SIMVASTATIN", "AMLODIPIN", "ENALAPRIL", "ATORVASTATIN", "FRAGMIN"), Med3 = c("METOPROLOL", "MIRTAZAPIN", "ASS", "L-THYROXIN", "FOSAMAX", "TORASEMID"), Med4 = c("", "", "VALSARTAN", "LITALIR", "CALCIUM", "SPIRONOLACTON"), Med5 = c("", "", "CHLORALDURAT", "LITALIR", "PANTOZOL", "LORZAAR PROTECT"),     Med6 = c("", "", "Doxozosin", "AMLODIPIN", "NOVAMINSULFON",     "VESIKUR"), Med7 = c("", "", "TAMSULOSIN", "CETIRIZIN", "",     "ROCALTROL"), Med8 = c("", "", "CIPRAMIL", "HCT", "", "ATORVASTATIN"    ), Med9 = c("", "", "", "NACL", "", "PREDNISOLON"), Med10 = c("",     "", "", "CARMEN", "", "LACTULOSE"), Med11 = c("", "", "",     "PROTEIN 88", "", "MIRTAZAPIN"), Med12 = c("", "", "", "NOVALGIN",     "", "LANTUS"), Med13 = c("", "", "", "", "", "ACTRAPID"),     Med14 = c("", "", "", "", "", "PANTOZOL"), Med15 = c("",     "", "", "", "", "SALBUTAMOL"), Med16 = c("", "", "", "",     "", "AMPHO MORONAL")), class = "data.frame", row.names = c(NA, -6L))``

Counting number of instances of a condition per row R

You can use `rowSums`.

``df\$no_calls <- rowSums(df == "nc")df#  rsID sample1 sample2 sample3 sample1304 no_calls#1 abcd      aa      bb      nc         nc        2#2 efgh      nc      nc      nc         nc        4#3 ijkl      aa      ab      aa         nc        1``

Or, as pointed out by MrFlick, to exclude the first column from the row sums, you can slightly modify the approach to

``df\$no_calls <- rowSums(df[-1] == "nc")``

Regarding the row names: They are not counted in `rowSums` and you can make a simple test to demonstrate it:

``rownames(df)[1] <- "nc"  # name first row "nc"rowSums(df == "nc")      # compute the row sums#nc  2  3             # 2  4  1        # still the same in first row``

Count occurrences of a variable having two given values corresponding to one value of another variable

The optimal solution in terms of memory space would be one row for each pair which would be 700*699 / 2. This problem is still relatively small and the simplicity of manipulating a 700*700 matrix is probably more valuable than the 700*701/2 cells you're saving, which would work out to 240kB with one byte per cell. It could be even less if the matrix is sparse (i.e. most pairs of materials are never ordered together) and you use an appropriate data structure.

Here's how the code would look like:

First we want to create a dataframe with as many rows and columns as there are materials. Matrices are easier to create so we create one that we convert to a dataframe afterwards.

``all_materials = levels(as.factor(X\$Materials))number_materials = length(all_materials)Pairs <- as.data.frame(matrix(data = 0, nrow = number_materials, ncol = number_materials))``

We then set the row names and column names to be able to access the rows and columns directly with the identifiers of the materials which are apparently not necessarily numbered from 1 to 700.

``colnames(Pairs) <- all_materialsrownames(Pairs) <- all_materials``

Then we iterate over the dataset

``for(order in levels(as.factor(X\$Order.number))){  # getting the materials in each order  materials_for_order = X[X\$Order.number==order, "Materials"]  if (length(materials_for_order)>1) {    # finding each possible pair from the materials list    all_pairs_in_order = combn(x=materials_for_order, m=2)    # incrementing the cell at the line and column corresponding to each pair    for(i in 1:ncol(all_pairs_in_order)){      Pairs[all_pairs_in_order[1, i], all_pairs_in_order[2, i]] = Pairs[all_pairs_in_order[1, i], all_pairs_in_order[2, i]] + 1    }  }}``

At the end of the loop, the `Pairs` table should contain everything you need.