Take the Subsets of a Data.Frame with the Same Feature and Select a Single Row from Each Subset

How to run a function on each subset of a dataframe based on multiple conditions

Here's how I'd approach this problem. You don't necessarily need to resort to mapping here, since the problem isn't actually dataframe-in, dataframe-out (the only input is the Text vector in each subset). This means we can simply use a grouped filter to obtain either of the dataframes of interest (uniques or duplicates).

library(stringdist)
library(dplyr)
test_df <- data.frame(
  "ID" = c(100, 103, 105, 106, 107, 209, 300, 501, 503, 711, 799, 811, 812, 820, 831),
  "Type" = c('A', 'A', 'A', 'A', 'B', 'B', 'C', 'C', 'A', 'A', 'B', 'B', 'C', 'C', 'C'),
  "Group" = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3),
  "Text" = c('Lorem ipsum dolor sit amet', 'Lorem ipsum dolor sit amet', 'consectetur adipiscing eli', 'et dolore magna aliqua. Ut', 'Lorem ipsum dolor sit amet', 'Lorem ipsum dolor sing eli', 'Lorem ipsum dolor sit amet', 'Lorem ipsum dolor sit amet', 'Lorem ipsum dolor sit amet', 'consectetur adipiscing eli', 'Lorem ipsum dolor sit amet', 'Lorem ipsum dolor sit amet', 'Lorem ipsum dolor sit amet', 'Lorem ipsum dolor sing eli', 'sed do eiusmod temporo eli'),
  stringsAsFactors = FALSE
)

The key thing to realise is that group_by will expose only a section of the vector to whatever function we use later, so we need to write a function that accepts a vector in. We want it to return TRUE if a string is too similar to any of the other elements in the vector, so we use apply with any to check each row for this condition. We have to make sure we first get rid of the diagonal elements to avoid self-comparison. This is also a good time to parameterise threshold.

any_string_duplicates <- function(text_vector, threshold = 10) {
  mat <- stringdistmatrix(text_vector, text_vector)
  mat <- mat < threshold
  diag(mat) <- NA # Simpler way to remove self-comparisons
  apply(mat, 1, any, na.rm = TRUE)
}

Now the duplicate values and the unique values can easily be retrieved with a grouped filter.

test_df %>% # Duplicates
  group_by(Type, Group) %>%
  filter(any_string_duplicates(Text))
#> # A tibble: 10 x 4
#> # Groups:   Type, Group [5]
#>       ID Type  Group Text                      
#>    <dbl> <chr> <dbl> <chr>                     
#>  1   100 A         1 Lorem ipsum dolor sit amet
#>  2   103 A         1 Lorem ipsum dolor sit amet
#>  3   107 B         1 Lorem ipsum dolor sit amet
#>  4   209 B         1 Lorem ipsum dolor sing eli
#>  5   300 C         1 Lorem ipsum dolor sit amet
#>  6   501 C         1 Lorem ipsum dolor sit amet
#>  7   799 B         2 Lorem ipsum dolor sit amet
#>  8   811 B         2 Lorem ipsum dolor sit amet
#>  9   812 C         3 Lorem ipsum dolor sit amet
#> 10   820 C         3 Lorem ipsum dolor sing eli

test_df %>% # Uniques
  group_by(Type, Group) %>%
  filter(!any_string_duplicates(Text))
#> # A tibble: 5 x 4
#> # Groups:   Type, Group [3]
#>      ID Type  Group Text                      
#>   <dbl> <chr> <dbl> <chr>                     
#> 1   105 A         1 consectetur adipiscing eli
#> 2   106 A         1 et dolore magna aliqua. Ut
#> 3   503 A         2 Lorem ipsum dolor sit amet
#> 4   711 A         2 consectetur adipiscing eli
#> 5   831 C         3 sed do eiusmod temporo eli

^{Created on 2019-09-03 by the reprex package (v0.3.0)}

Create subsets based on a certain sequence of values

You can try this, Please let me know if its your desired outcome:

library(stringr)
pattrn <- data.frame(str_locate_all(paste0(df$m+1,collapse=''),'0[1]*?2[1]*?0')[[1]])
## str_locate_all will find all start and end of the pattern -1,1,-1
## to find -1, 1, -1 , I have added 1 to the column, this will remove the negative sign for correct capture of location
## so, the new pattern to be found is 0,2,0, to do this I concatenated the m column and try to find the 0, 2, 0 with regex mentioned
pattrn_rows <- Map(seq, from=pattrn$start, to=pattrn$end)
## converting to data.frame
lapply(pattrn_rows,function(x)df[x,])
## finally subsetting, this step will give the final result into two lists of dataframes

Output:

[[1]]
    x  y  m
5   4 54 -1
6   5 55  0
7   6 56  0
8   7 57  1
9   8 58  0
10  9 59  0
11 10 60 -1

[[2]]
    x  y  m
14 13 63 -1
15 14 64  0
16 15 65  0
17 16 66  1
18 17 67  0
19 18 68  0
20 19 69 -1

How to create a for loop for combining several data frames and df subsets into one data frame?

You can define a function that will sum up all numeric columns of a data.frame, and leave other columns as NA, append this to original data frame:

numericCols = sapply(iris,is.numeric)

func = function(df,numCols){

iris_sums <- colSums(df[,numCols])
result <- rep(NA,ncol(df))
names(result) <- colnames(df)
result[names(iris_sums)] <- iris_sums
rbind(df,result,rep(NA,ncol(df)))
}

Then we use purrr to map each subset:

split(iris,iris$Species) %>% map_dfr(func,numCols=numericCols)

Advice on a loop function to subset data according to variables

Based on your description, I assume your data looks something like this:

country_year <- c("Australia_2013", "Australia_2014", "Bangladesh_2013")
health <- matrix(nrow = 3, ncol = 3, data = runif(9))
dataset <- data.frame(rbind(country_year, health), row.names = NULL, stringsAsFactors = FALSE)

dataset
#                 X1                X2                 X3
#1    Australia_2013    Australia_2014    Bangladesh_2013
#2 0.665947273839265 0.677187719382346  0.716064820764586
#3 0.499680359382182 0.514755881391466  0.178317369660363
#4 0.730102791683748 0.666969108628109 0.0719663293566555

First, move your row 1 (e.g., Australia_2013, Australia_2014 etc.) to the column names, and then apply the loop to create country-based data frames.

library(dplyr)

# move header
dataset2 <- dataset %>% 
    `colnames<-`(dataset[1,]) %>%  # uses row 1 as column names
    slice(-1) %>% # removes row 1 from data
    mutate_all(type.convert) # converts data to appropriate type

# apply loop
for(country in unique(gsub("_\\d+", "", colnames(dataset2)))) {
    assign(country, select(dataset2, starts_with(country))) # makes subsets
}

Regarding the loop,

gsub("_\\d+", "", colnames(dataset2)) extracts the country names by replacing "_[year]" with nothing (i.e., removing it), and the unique() function that is applied extracts one of each country name.

assign(country, select(dataset2, starts_with(country))) creates a variable named after the country and this country variable only contains the columns from dataset2 that start with the country name.

Edit: Responding to Comment

The question in the comment was asking how to add row-wise summaries (e.g., rowSums(), rowMeans()) as new columns in the country-based data frames, while using this for-loop.

Here is one solution that requires minimal changes:

for(country in unique(gsub("_\\d+", "", colnames(dataset2)))) {
    assign(country, 
        select(dataset2, starts_with(country)) %>% # makes subsets
            mutate( # creates new columns
                rowSums = rowSums(select(., starts_with(country))),
                rowMeans = rowMeans(select(., starts_with(country)))
            )
    )
}

mutate() adds new columns to a dataset.

select(., starts_with(country)) selects columns that start with the country name from the current object (represented as . in the function).

Subset data frame based on number of rows per group

First, two base alternatives. One relies on table, and the other on ave and length. Then, two data.table ways.

1. `table`

tt <- table(df$name)

df2 <- subset(df, name %in% names(tt[tt < 3]))
# or
df2 <- df[df$name %in% names(tt[tt < 3]), ]

If you want to walk it through step by step:

# count each 'name', assign result to an object 'tt'
tt <- table(df$name)

# which 'name' in 'tt' occur more than three times?
# Result is a logical vector that can be used to subset the table 'tt'
tt < 3

# from the table, select 'name' that occur < 3 times
tt[tt < 3]

# ...their names
names(tt[tt < 3])

# rows of 'name' in the data frame that matches "the < 3 names"
# the result is a logical vector that can be used to subset the data frame 'df'
df$name %in% names(tt[tt < 3])

# subset data frame by a logical vector
# 'TRUE' rows are kept, 'FALSE' rows are removed.
# assign the result to a data frame with a new name
df2 <- subset(df, name %in% names(tt[tt < 3]))
# or
df2 <- df[df$name %in% names(tt[tt < 3]), ]

2. `ave` and `length`

As suggested by @flodel:

df[ave(df$x, df$name, FUN = length) < 3, ]

3. `data.table`: `.N` and `.SD`:

library(data.table)
setDT(df)[, if (.N < 3) .SD, by = name]

4. `data.table`: `.N` and `.I`:

setDT(df)
df[df[, .I[.N < 3], name]$V1]

See also the related Q&A Count number of observations/rows per group and add result to data frame.

Collapsing data frame by selecting one row per group

Maybe duplicated() can help:

R> d[ !duplicated(d$x), ]
  x  y  z
1 1 10 20
3 2 12 18
4 4 13 17
R>

Edit Shucks, never mind. This picks the first in each block of repetitions, you wanted the last. So here is another attempt using plyr:

R> ddply(d, "x", function(z) tail(z,1))
  x  y  z
1 1 11 19
2 2 12 18
3 4 13 17
R>

Here plyr does the hard work of finding unique subsets, looping over them and applying the supplied function -- which simply returns the last set of observations in a block z using tail(z, 1).

Creating a subset of a data frame with specific years

You just used the wrong argument, to select rows you want subset=.

subset(dat, subset=c(year == 2000 | year == 2005 | year == 2010))

Or more concise:

subset(dat, subset=year %in% c(2000, 2005, 2010))
#    year          x          z
# 1  2000 -0.4703161 0.62147778
# 6  2005 -0.6667708 0.03479132
# 11 2010 -0.8059292 0.43732005

select= is for the columns.

subset(dat, subset=year %in% c(2000, 2005, 2010), select=c(year, z))
#    year          z
# 1  2000 0.62147778
# 6  2005 0.03479132
# 11 2010 0.43732005

Note, that if you provide the arguments in the right order, you may leave out the argument names and just do:

subset(dat, year %in% c(2000, 2005, 2010), c(year, z))

Data:

set.seed(42)
dat <- data.frame(year=2000:2022, x=rnorm(23), z=runif(23))

Take the Subsets of a Data.Frame with the Same Feature and Select a Single Row from Each Subset

How to run a function on each subset of a dataframe based on multiple conditions

Create subsets based on a certain sequence of values

How to create a for loop for combining several data frames and df subsets into one data frame?

Advice on a loop function to subset data according to variables

Subset data frame based on number of rows per group

1. `table`

2. `ave` and `length`

3. `data.table`: `.N` and `.SD`:

4. `data.table`: `.N` and `.I`:

Collapsing data frame by selecting one row per group

Creating a subset of a data frame with specific years

Related Topics

Leave a reply

How to run a function on each subset of a dataframe based on multiple conditions

Create subsets based on a certain sequence of values

How to create a for loop for combining several data frames and df subsets into one data frame?

Advice on a loop function to subset data according to variables

Subset data frame based on number of rows per group

1. table

2. ave and length

3. data.table: .N and .SD:

4. data.table: .N and .I:

Collapsing data frame by selecting one row per group

Creating a subset of a data frame with specific years

Related Topics

Leave a reply

1. `table`

2. `ave` and `length`

3. `data.table`: `.N` and `.SD`:

4. `data.table`: `.N` and `.I`: