Create dummy variables for every unique value in a column based on a condition from a second column in R
Here is a crude way to do this
df <- data.frame(country = c ("Australia","Australia","Australia","Angola","Angola","Angola","US","US","US"), year=c("1945","1946","1947"), leader = c("David", "NA", "NA", "NA","Henry","NA","Tom","NA","Chris"), natural.death = c(0,NA,NA,NA,1,NA,1,NA,0),gdp.growth.rate=c(1,4,3,5,6,1,5,7,9))
tmp=which(df$natural.death==1) #index of deaths
lng=length(tmp) #number of deaths
#create matrix with zeros and lng columns, append to df
df=cbind(df,data.frame(matrix(0,nrow=nrow(df),ncol=lng)))
#change the newly added column names
colnames(df)[(ncol(df)-lng+1):ncol(df)]=paste0("id",1:lng)
for (i in 1:lng) { #loop over new columns
df[tmp[i],paste0("id",i)]=1 #at index i of death and column id+i set df to 1
}
country year leader natural.death gdp.growth.rate id1 id2
1 Australia 1945 David 0 1 0 0
2 Australia 1946 NA NA 4 0 0
3 Australia 1947 NA NA 3 0 0
4 Angola 1945 NA NA 5 0 0
5 Angola 1946 Henry 1 6 1 0
6 Angola 1947 NA NA 1 0 0
7 US 1945 Tom 1 5 0 1
8 US 1946 NA NA 7 0 0
9 US 1947 Chris 0 9 0 0
Create Dummies for Multiple Columns on Unique Value in a Column
I believe you can get this by using both pd.get_dummies()
and df.groupby().any()
. The groupby().any()
will return TRUE/FALSE, and so you end that with converting to int
df2 = pd.get_dummies(df,columns=['CTI','RESOLUTION']) # df is what you have in your first example. Putting in the columns here restricts dummies to just those columns.
df2.groupby('ACCOUNT').any().astype(int)
Separate each unique value of a column into separate columns and remove original column?
This will do all that you're after
library(fastDummies)
# Numerically encode gear column as dummy variables
mt_cars_with_gear_dummy_variables <- fastDummies::dummy_cols(mtcars, select_columns = "gear")
# Remove original gear column
mt_cars_with_gear_dummy_variables[, !names(mt_cars_with_gear_dummy_variables) %in% c("gear")]
mt_cars_with_gear_dummy_variables
How to search for and extract unique values from one column in another column?
I think this works for you:
mutate(df, Col_C = stringr::str_extract(
Col_A,
paste0("\\b(", paste0(unique(Col_B), collapse = "|"), ")\\b")))
# Col_A Col_B Col_C
# 1 blue shovel 1024 blue blue
# 2 red shovel 1022 red red
# 3 green bucket 3021 green green
# 4 green rake 3021 blue green
# 5 yellow shovel 1023 yellow yellow
Breakdown:
paste0(unique(Col_B), collapse="|")
takes the words inCol_B
, de-duplicates it, and concatenates them all together with|
symbols; that is,c("blue","red","green")
-->"blue|red|green"
. In regex, the|
symbol is an "OR" operator.\\b(
and)\\b
are word-boundaries, meaning that there isn't a word-like character immediately before (first) or after (second) the patterns; by adding this around the words, we prevent a partial match ofblu
onblue
(in case that ever happens); while it is not apparent that this changes anything here, it's a more defensive/specific pattern. The parens add grouping, more evident in the next bullet.- With all of that, our overall pattern looks something like
"\\b(blue|red|green)\\b"
(abbreviated). This translates into "findblue
orred
orgreen
such that there is a word-boundary on both ends of whichever one(s) you find".
Generate all posible dummies according values of var in r
Here is a solution which uses strsplit()
to split up the character strings and dcast()
to reshape from long to wide format:
library(data.table)
setDT(df)[, rn := .I][
, strsplit(as.character(V1), ","), by = rn][
, dcast(.SD, rn ~ V1, length)]
rn a b c d e f
1: 1 1 1 1 1 1 1
2: 2 1 1 1 0 0 0
3: 3 0 0 0 0 1 1
4: 4 0 1 0 1 0 0
5: 5 1 0 0 0 1 0
If V1
is to be included, it can be joined afterwards:
library(data.table) # version 1.11.4 used
setDT(df)[, rn := .I][
, strsplit(as.character(V1), ","), by = rn][
, dcast(.SD, rn ~ V1, length)][
df, on = "rn"][
, setcolorder(.SD, "V1")]
V1 rn a b c d e f
1: a,b,c,d,e,f 1 1 1 1 1 1 1
2: a,b,c 2 1 1 1 0 0 0
3: e,f 3 0 0 0 0 1 1
4: b,d 4 0 1 0 1 0 0
5: a,e 5 1 0 0 0 1 0
setcolorder()
is used to move the V1
column to the front.
creating a dummy matrix from a concatenated column
You can do:
relative <- c("aunt", "mother,grandmother", "sister,mother", "", "other")
R <- strsplit(relative, ',')
r <- unique(unlist(R))
result <- t(sapply(R, function(Ri) if (length(Ri)==0) rep(FALSE, length(r)) else r %in% Ri))
colnames(result) <- r
result
# > result
# aunt mother grandmother sister other
# [1,] TRUE FALSE FALSE FALSE FALSE
# [2,] FALSE TRUE TRUE FALSE FALSE
# [3,] FALSE TRUE FALSE TRUE FALSE
# [4,] FALSE FALSE FALSE FALSE FALSE
# [5,] FALSE FALSE FALSE FALSE TRUE
or (for integers):
+result
# > +result
# aunt mother grandmother sister other
# [1,] 1 0 0 0 0
# [2,] 0 1 1 0 0
# [3,] 0 1 0 1 0
# [4,] 0 0 0 0 0
# [5,] 0 0 0 0 1
Storing unique values of each column (of a df) in list
Your for
loop is almost right, just needs one fix to work:
# for loop
cols = names(df)
unique_values_by_col = list()
for (i in cols) {
x = unique(df[[i]])
unique_values_by_col[[i]] = x
}
unique_values_by_col
# $a
# [1] A B C D
# Levels: A B C D
#
# $b
# [1] 1 2 3 4
i
is just a character, the name of a column within df
so unique(i)
doesn't make sense.
Anyhow, the most standard way for this task is lapply()
as shown by demirev.
R: Unbalanced panel, create dummy for unique observations
Using dplyr
, you could avoid the loop and try this:
set.seed(123)
df <- data.frame(id = sample(1:10, 20, replace = TRUE),
happy = sample(c("yes", "no"), 20, replace = TRUE))
library(dplyr)
df <- df %>%
group_by(id) %>%
mutate(dummy = ifelse(length(id)>=2, 1, 0))
> df
# A tibble: 20 x 3
# Groups: id [10]
id happy dummy
<int> <fct> <dbl>
1 3 no 1
2 8 no 0
3 5 no 1
4 9 no 1
5 10 no 1
6 1 no 1
7 6 no 1
8 9 no 1
9 6 yes 1
10 5 yes 1
11 10 no 1
12 5 no 1
13 7 no 0
14 6 no 1
15 2 yes 0
16 9 yes 1
17 3 no 1
18 1 yes 1
19 4 yes 0
20 10 yes 1
Essentially, this approach divides up df
by unique values of id
and then creates a column dummy
that takes the value 1 if there are more than two occurrences of that id and 0 if not.
Related Topics
Subset Dataframe by Multiple Logical Conditions of Rows to Remove
How to Change the Order of Facet Labels in Ggplot (Custom Facet Wrap Labels)
Offline Install of R Package and Dependencies
Sum Values in a Rolling/Sliding Window
Subscript Out of Bounds - General Definition and Solution
Merge Several Data.Frames into One Data.Frame With a Loop
Select Multiple Columns in Data.Table by Their Numeric Indices
Replace/Translate Characters in a String
Select Subset of Columns in Data.Table R
How to Use Facets With a Dual Y-Axis Ggplot
Number of Months Between Two Dates
Add Correct Century to Dates With Year Provided as "Year Without Century", %Y
How to Delete Rows from a Dataframe That Contain N*Na