Add new value to new column based on if value exists in other dataframe in R
We can use %in%
to compare the values and wrap as.integer
to convert logical values to integers.
purchases$buyers <- as.integer(purchases$ID %in% users$ID)
purchases
# ID buyers
#1 6456 1
#2 4436 0
#3 88945 0
This can also be written as :
purchases$buyers <- +(purchases$ID %in% users$ID)
Add a new column to a dataframe using matching values of another dataframe
merge(table1, table2[, c("pid", "val2")], by="pid")
Add in the all.x=TRUE
argument in order to keep all of the pids in table1 that don't have matches in table2...
You were on the right track. Here's a way using match...
table1$val2 <- table2$val2[match(table1$pid, table2$pid)]
Create new column based on matches in another table
In Base R, you could do a merge:
data3 <- merge( data1, data2, all.y = TRUE )
and then replace the NAs with your string of choice:
data3[ is.na( data3[ 5 ] ), 5 ] <- "Not Determined"
which gives you
> data3
Col1 Col2 Col3 Count Method
1 ABC AA Al 1 Sample
2 ABC AA B 4 Dry
3 ABC CC C 5 Not Determined
4 EFG AA Al 6 Sample
5 XYZ BB Al 2 Sample
6 XYZ CC C 1 Not Determined
Attention: If you are on an older version of R (< 4.0), you might be dealing with factors and need to add the additional factor level before with
levels( data3$Method ) <- c( levels( data3$Method ), "Not Determined" )
R - Create new column based on substring from another column with conditions
There is probably a more efficient way to do this, but we could do a series of ifelse
statements using case_when
from tidyverse
. First, I remove any rows that just end with ;s__
. Then, in the series of statements, I check to if a given taxonomic level is present, then if so, then return that in the desired format. Then, that is repeated across all taxonomic levels.
library(tidyverse)
output <- input_data %>%
mutate(taxon = trimws(taxon, whitespace = ";s__")) %>%
mutate(taxon_main = case_when(str_detect(taxon, "s__") ~ trimws(str_replace_all(str_extract(taxon, "(?<=g__).*"), ";s_", ""), whitespace = '_'),
!str_detect(taxon, "s__") & str_detect(taxon, "g__")~ str_replace_all(str_extract(taxon, "g__.*"), "__", "_"),
!str_detect(taxon, "g__") & str_detect(taxon, "f__") ~ str_replace_all(str_extract(taxon, "f__.*"), "__", "_"),
!str_detect(taxon, "f__") & str_detect(taxon, "o__")~ str_replace_all(str_extract(taxon, "o__.*"), "__", "_"),
!str_detect(taxon, "o__") & str_detect(taxon, "c__")~ str_replace_all(str_extract(taxon, "c__.*"), "__", "_"),
!str_detect(taxon, "c__") & str_detect(taxon, "p__")~ str_replace_all(str_extract(taxon, "p__.*"), "__", "_"),
!str_detect(taxon, "p__") & str_detect(taxon, "k__")~ str_replace_all(str_extract(taxon, "k__.*"), "__", "_"),
TRUE ~ NA_character_))
Output
output %>% select(taxon_main)
taxon_main
1 Lactobacillus_crispatus
2 g_Anaerococcus
3 f_Comamonadaceae
4 f_Lachnospiraceae
5 Bosea_massiliensis
6 Acinetobacter_baumannii
7 f_Methylophilaceae
Or you could also use separate
first, which will make the code less reliant on using a lot of stringr
. We can clean up before using separate
, such as only having one underscore and remove extra s__
. Then, we can go through the ifelse
statements, and then we can bind back to the original taxon
column and drop all the other columns, except for taxon_main
.
input_data %>%
mutate(taxon = trimws(taxon, whitespace = ";s__"),
taxon = str_replace_all(taxon, ";s__", ";"),
taxon = str_replace_all(taxon, "__", "_")) %>%
separate(taxon, sep = ";", into = c("Kingdom", "Phylum", "Class", "Order", "Family", "Genus", "Species")) %>%
mutate(taxon_main = case_when(!is.na(Species) ~ paste(str_extract(Genus, "(?<=g_).*"), Species, sep = "_"),
is.na(Species) & !is.na(Genus) ~ Genus,
is.na(Genus) & !is.na(Family) ~ Family,
is.na(Family) & !is.na(Order) ~ Order,
is.na(Order) & !is.na(Class) ~ Class,
is.na(Class) & !is.na(Phylum) ~ Phylum,
is.na(Phylum) & !is.na(Kingdom) ~ Kingdom
)) %>%
bind_cols(input_data,.) %>%
select(taxon_main, taxon)
Output
taxon_main taxon
1 Lactobacillus_crispatus k__Bacteria;p__Firmicutes;c__Bacilli;o__Lactobacillales;f__Lactobacillaceae;g__Lactobacillus;s__crispatus
2 g_Anaerococcus k__Bacteria;p__Firmicutes;c__Tissierellia;o__Tissierellales;f__Peptoniphilaceae;g__Anaerococcus;s__
3 f_Comamonadaceae k__Bacteria;p__Proteobacteria;c__Betap__Proteobacteria;o__Burkholderiales;f__Comamonadaceae
4 f_Lachnospiraceae k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Lachnospiraceae
5 Bosea_massiliensis k__Bacteria;p__Proteobacteria;c__Alphap__Proteobacteria;o__Rhizobiales;f__Bradyrhizobiaceae;g__Bosea;s__massiliensis
6 Acinetobacter_baumannii k__Bacteria;p__Proteobacteria;c__Gammap__Proteobacteria;o__Pseudomonadales;f__Moraxellaceae;g__Acinetobacter;s__baumannii
7 f_Methylophilaceae k__Bacteria;p__Proteobacteria;c__Betap__Proteobacteria;o__Nitrosomonadales;f__Methylophilaceae
How to check if values in one dataframe exist in another dataframe in R?
Try this using %in%
and a vector for all values:
#Code
df1$reply <- df1$user_name %in% c(df2$name,df2$organisation)
Output:
df1
id reply user_name
1 1 TRUE John
2 2 TRUE Amazon
3 3 FALSE Bob
Some data used:
#Data1
df1 <- structure(list(id = 1:3, reply = c(NA, NA, NA), user_name = c("John",
"Amazon", "Bob")), class = "data.frame", row.names = c(NA, -3L
))
#Data2
df2 <- structure(list(name = c("John", "Pat"), organisation = c("Amazon",
"Apple")), class = "data.frame", row.names = c(NA, -2L))
How to check if a value exists within a set of columns?
Create a vector with the columns of interest and use rowSums()
, i.e.
i1 <- grep('i10_', names(d1))
rowSums(d1[i1] == 'C7931' | d1[i1] == 'C7932', na.rm = TRUE) > 0
where,
d1 <- structure(list(v1 = c("A", "B", "C", "D", "E", "F"), i10_a = c(NA,
"C7931", NA, NA, "S272XXA", "R55"), i10_1 = c("C7931", "C7931",
"R079", "S272XXA", "S234sfs", "N179")), class = "data.frame", row.names = c(NA,
-6L))
subset a column in data frame based on another data frame/list
We can use %in%
to get a logical vector and subset
the rows of the 'table1' based on that.
subset(table1, gene_ID %in% accessions40$V1)
A better option would be data.table
library(data.table)
setDT(table1)[gene_ID %chin% accessions40$V1]
Or use filter
from dplyr
library(dplyr)
table1 %>%
filter(gene_ID %in% accessions40$V1)
Add column if value is in another column of another dataframe
We need to trim the df1 with explode
then we can do map
df1['list'] = df1['list'].str.split(',')
s = df1.explode('list')
df['present'] = df.name.map(dict(zip(s['list'],s['topic'])))
df
Out[550]:
name value present
0 Harry a topic1
1 Kenny b NaN
2 Zoey h NaN
Related Topics
Grouping Factor Levels in a Data.Table
R Geom_Tile Ggplot2 What Kind of Stat Is Applied
How to Create a Dropdown List in a Shiny Table Using Datatable When Editing the Table
How to Draw Roc Curve Using Value of Confusion Matrix
How to Color Bar Plots When Using ..Prop.. in Ggplot
R - Identify Consecutive Sequences
Place 1 Heatmap on Another with Transparency in R
Remove the Columns with the Colsums=0
Removing Everything After First 'Backslash' in a String
Shiny - Custom Warning/Error Messages
Difference of Two Character Vectors with Substring
Ggplot2: Adding Lines in a Loop and Retaining Colour Mappings
How to Pass Multiple Group_By Arguments and a Dynamic Variable Argument to a Dplyr Function
R: Ggplot2 Setting the Last Plot in the Midle with Facet_Wrap
Manual Simulation of Markov Chain in R