Find How Many Times Duplicated Rows Repeat in R Data Frame

Find how many times duplicated rows repeat in R data frame

Here is solution using function ddply() from library plyr

library(plyr)
ddply(df,.(a,b),nrow)

a b V1
1 1 2.5 1
2 1 3.5 2
3 2 2.0 2
4 3 1.0 1
5 4 2.2 1
6 4 7.0 1

Counting Number of Times Each Row is Duplicated in R

With dplyr, we could group by all columns:

dat %>%
group_by(across(everything())) %>%
mutate(n = n())
# # A tibble: 5 x 5
# # Groups: SSN, Name, Age, Gender [3]
# SSN Name Age Gender n
# <dbl> <chr> <dbl> <dbl> <int>
# 1 204 Blossum 7 0 2
# 2 401 Buttercup 8 0 2
# 3 204 Blossum 7 0 2
# 4 666 MojoJojo 43 1 1
# 5 401 Buttercup 8 0 2

(mutate(n = n()) is has a shortcut, add_tally(), if you prefer. Use summarize(n = n() or count() if you want to collapse the data frame to the unique rows while adding counts)

getting a count of how many times a value in a column is duplicated

If we need to create a count column, use add_count

df %>% 
add_count(name, name = "new_count")

-output

      address     name other count  new_count
1 123 fake st joey 1 2 2
2 124 fake st rachel 1 1 1
3 125 fake st ross 1 3 3
4 126 fake st chandler 2 1 1
5 123 jerry st monika 2 1 1
6 124 road rd joey 3 2 2
7 125 tiny rd ross 4 3 3
8 126 cool r ross 4 3 3

group_size returns only the summary count

group_size(group_by(df,name))
[1] 1 2 1 1 3

Count the number of duplicate for a column

If we need to count the total number of duplicates

sum(table(df1$date)-1)
#[1] 5

Suppose, we need the count of each date, one option would be to group by 'date' and get the number of rows. This can be done with data.table.

library(data.table)
setDT(df1)[, .N, date]

Finding duplicates in a dataframe and returning count of each duplicate record

We can use group_by_all to group by all columns and then remove the ones which are not duplicates by selecting rows which have count > 1.

library(dplyr)

df %>%
group_by_all() %>%
count() %>%
filter(n > 1)

# col1 col2 col3 n
# <fct> <fct> <fct> <int>
#1 A B B 2
#2 A B C 3

Find duplicate values in R

You could use table, i.e.

n_occur <- data.frame(table(vocabulary$id))

gives you a data frame with a list of ids and the number of times they occurred.

n_occur[n_occur$Freq > 1,]

tells you which ids occurred more than once.

vocabulary[vocabulary$id %in% n_occur$Var1[n_occur$Freq > 1],]

returns the records with more than one occurrence.

R: Repeating row of dataframe with respect to multiple count columns

Here is a tidyverse option. We can use uncount from tidyr to duplicate the rows according to the count in value (i.e., from the var columns) after pivoting to long format.

library(tidyverse)

df %>%
pivot_longer(starts_with("var"), names_to = "class") %>%
filter(value != 0) %>%
uncount(value) %>%
mutate(class = str_extract(class, "\\d+"))

Output

  f1    f2    class
<chr> <chr> <chr>
1 a c 1
2 a c 3
3 a c 3
4 a c 3
5 b d 1
6 b d 2
7 b d 2

Another slight variation is to use expandrows from splitstackshape in conjunction with tidyverse.

library(splitstackshape)

df %>%
pivot_longer(starts_with("var"), names_to = "class") %>%
filter(value != 0) %>%
expandRows("value") %>%
mutate(class = str_extract(class, "\\d+"))

Repeat rows of a data.frame

df <- data.frame(a = 1:2, b = letters[1:2]) 
df[rep(seq_len(nrow(df)), each = 2), ]

Repeat rows making each repeated rows following the original rows and assign new variables for each row

You can repeat each row twice and repeat c('origin', 'destination') for each row.

In base R, you can do -

transform(df[rep(seq(nrow(df)), each = 2), ], type = c('origin', 'destination'))

Or in tidyverse -

library(dplyr)
library(tidyr)

df %>%
uncount(2) %>%
mutate(type = rep(c('origin', 'destination'), length.out = n()))

# a b type
#1 1 1 origin
#2 1 1 destination
#3 2 2 origin
#4 2 2 destination
#5 3 3 origin
#6 3 3 destination


Related Topics



Leave a reply



Submit