Dummify Character Column and Find Unique Values

Create dummy variables for every unique value in a column based on a condition from a second column in R

Here is a crude way to do this

df <- data.frame(country = c ("Australia","Australia","Australia","Angola","Angola","Angola","US","US","US"), year=c("1945","1946","1947"), leader = c("David", "NA", "NA", "NA","Henry","NA","Tom","NA","Chris"), natural.death = c(0,NA,NA,NA,1,NA,1,NA,0),gdp.growth.rate=c(1,4,3,5,6,1,5,7,9))

tmp=which(df$natural.death==1) #index of deaths
lng=length(tmp) #number of deaths

#create matrix with zeros and lng columns, append to df
df=cbind(df,data.frame(matrix(0,nrow=nrow(df),ncol=lng)))
#change the newly added column names
colnames(df)[(ncol(df)-lng+1):ncol(df)]=paste0("id",1:lng)

for (i in 1:lng) { #loop over new columns
df[tmp[i],paste0("id",i)]=1 #at index i of death and column id+i set df to 1
}

country year leader natural.death gdp.growth.rate id1 id2
1 Australia 1945 David 0 1 0 0
2 Australia 1946 NA NA 4 0 0
3 Australia 1947 NA NA 3 0 0
4 Angola 1945 NA NA 5 0 0
5 Angola 1946 Henry 1 6 1 0
6 Angola 1947 NA NA 1 0 0
7 US 1945 Tom 1 5 0 1
8 US 1946 NA NA 7 0 0
9 US 1947 Chris 0 9 0 0

Create Dummies for Multiple Columns on Unique Value in a Column

I believe you can get this by using both pd.get_dummies() and df.groupby().any(). The groupby().any() will return TRUE/FALSE, and so you end that with converting to int

df2 = pd.get_dummies(df,columns=['CTI','RESOLUTION']) # df is what you have in your first example. Putting in the columns here restricts dummies to just those columns.
df2.groupby('ACCOUNT').any().astype(int)

Separate each unique value of a column into separate columns and remove original column?

This will do all that you're after


library(fastDummies)

# Numerically encode gear column as dummy variables
mt_cars_with_gear_dummy_variables <- fastDummies::dummy_cols(mtcars, select_columns = "gear")


# Remove original gear column
mt_cars_with_gear_dummy_variables[, !names(mt_cars_with_gear_dummy_variables) %in% c("gear")]


mt_cars_with_gear_dummy_variables

How to search for and extract unique values from one column in another column?

I think this works for you:

mutate(df, Col_C = stringr::str_extract(
Col_A,
paste0("\\b(", paste0(unique(Col_B), collapse = "|"), ")\\b")))
# Col_A Col_B Col_C
# 1 blue shovel 1024 blue blue
# 2 red shovel 1022 red red
# 3 green bucket 3021 green green
# 4 green rake 3021 blue green
# 5 yellow shovel 1023 yellow yellow

Breakdown:

  • paste0(unique(Col_B), collapse="|") takes the words in Col_B, de-duplicates it, and concatenates them all together with | symbols; that is, c("blue","red","green") --> "blue|red|green". In regex, the | symbol is an "OR" operator.
  • \\b( and )\\b are word-boundaries, meaning that there isn't a word-like character immediately before (first) or after (second) the patterns; by adding this around the words, we prevent a partial match of blu on blue (in case that ever happens); while it is not apparent that this changes anything here, it's a more defensive/specific pattern. The parens add grouping, more evident in the next bullet.
  • With all of that, our overall pattern looks something like "\\b(blue|red|green)\\b" (abbreviated). This translates into "find blue or red or green such that there is a word-boundary on both ends of whichever one(s) you find".

Generate all posible dummies according values of var in r

Here is a solution which uses strsplit() to split up the character strings and dcast() to reshape from long to wide format:

library(data.table)
setDT(df)[, rn := .I][
, strsplit(as.character(V1), ","), by = rn][
, dcast(.SD, rn ~ V1, length)]
   rn a b c d e f
1: 1 1 1 1 1 1 1
2: 2 1 1 1 0 0 0
3: 3 0 0 0 0 1 1
4: 4 0 1 0 1 0 0
5: 5 1 0 0 0 1 0

If V1 is to be included, it can be joined afterwards:

library(data.table) # version 1.11.4 used
setDT(df)[, rn := .I][
, strsplit(as.character(V1), ","), by = rn][
, dcast(.SD, rn ~ V1, length)][
df, on = "rn"][
, setcolorder(.SD, "V1")]
            V1 rn a b c d e f
1: a,b,c,d,e,f 1 1 1 1 1 1 1
2: a,b,c 2 1 1 1 0 0 0
3: e,f 3 0 0 0 0 1 1
4: b,d 4 0 1 0 1 0 0
5: a,e 5 1 0 0 0 1 0

setcolorder() is used to move the V1 column to the front.

creating a dummy matrix from a concatenated column

You can do:

relative <- c("aunt", "mother,grandmother", "sister,mother", "", "other")
R <- strsplit(relative, ',')
r <- unique(unlist(R))
result <- t(sapply(R, function(Ri) if (length(Ri)==0) rep(FALSE, length(r)) else r %in% Ri))
colnames(result) <- r
result
# > result
# aunt mother grandmother sister other
# [1,] TRUE FALSE FALSE FALSE FALSE
# [2,] FALSE TRUE TRUE FALSE FALSE
# [3,] FALSE TRUE FALSE TRUE FALSE
# [4,] FALSE FALSE FALSE FALSE FALSE
# [5,] FALSE FALSE FALSE FALSE TRUE

or (for integers):

+result
# > +result
# aunt mother grandmother sister other
# [1,] 1 0 0 0 0
# [2,] 0 1 1 0 0
# [3,] 0 1 0 1 0
# [4,] 0 0 0 0 0
# [5,] 0 0 0 0 1

Storing unique values of each column (of a df) in list

Your for loop is almost right, just needs one fix to work:

# for loop
cols = names(df)
unique_values_by_col = list()
for (i in cols) {
x = unique(df[[i]])
unique_values_by_col[[i]] = x
}
unique_values_by_col
# $a
# [1] A B C D
# Levels: A B C D
#
# $b
# [1] 1 2 3 4

i is just a character, the name of a column within df so unique(i) doesn't make sense.


Anyhow, the most standard way for this task is lapply() as shown by demirev.

R: Unbalanced panel, create dummy for unique observations

Using dplyr, you could avoid the loop and try this:

set.seed(123)
df <- data.frame(id = sample(1:10, 20, replace = TRUE),
happy = sample(c("yes", "no"), 20, replace = TRUE))

library(dplyr)
df <- df %>%
group_by(id) %>%
mutate(dummy = ifelse(length(id)>=2, 1, 0))

> df
# A tibble: 20 x 3
# Groups: id [10]
id happy dummy
<int> <fct> <dbl>
1 3 no 1
2 8 no 0
3 5 no 1
4 9 no 1
5 10 no 1
6 1 no 1
7 6 no 1
8 9 no 1
9 6 yes 1
10 5 yes 1
11 10 no 1
12 5 no 1
13 7 no 0
14 6 no 1
15 2 yes 0
16 9 yes 1
17 3 no 1
18 1 yes 1
19 4 yes 0
20 10 yes 1

Essentially, this approach divides up df by unique values of id and then creates a column dummy that takes the value 1 if there are more than two occurrences of that id and 0 if not.



Related Topics



Leave a reply



Submit