Add an index (numeric ID) column to large data frame
You can add a sequence of numbers very easily with
data$ID <- seq.int(nrow(data))
If you are already using library(tidyverse)
, you can use
data <- tibble::rowid_to_column(data, "ID")
Is there an R function for adding an index (numeric or character ID) column to large data frame by position of groups of values in blocks?
Using data.table
you can do this:
library(data.table)
df <- as.data.table(data.frame(seqnames,start,end,strand,block,cont_ID))
sp <- split(df,f = cont_ID)
grouping <- function(x){
x[, block_lag:= shift(block, fill = "lag")]
x[is.na(block_lag), cumsum_block_lag:=1]
x[!is.na(block_lag), cumsum_block_lag:=1+cumsum(block!=(block_lag+1))]
x[, index:=paste0(LETTERS[cumsum_block_lag],"_",cont_ID)]
x[, cumsum_block_lag:=NULL]
x[, block_lag:=NULL]
}
index_sp <- rbindlist(lapply(sp, grouping))
index_sp <- index_sp[order(block)]
# seqnames start end strand block cont_ID index
#1: H7 0 10 * 1 001 A_001
#2: H7 11 20 * 2 001 A_001
#3: H7 0 10 * 3 004 A_004
#4: H7 11 20 * 4 004 A_004
#5: H7 0 10 * 5 003 A_003
#6: H7 21 30 * 6 001 B_001
#7: H7 31 40 * 7 001 B_001
#8: H7 11 20 * 8 003 B_003
Create an ID (row number) column
You could use cbind
:
d <- data.frame(V1=c(23, 45, 56), V2=c(45, 45, 67))
## enter id here, you could also use 1:nrow(d) instead of rownames
id <- rownames(d)
d <- cbind(id=id, d)
## set colnames to OP's wishes
colnames(d) <- paste0("V", 1:ncol(d))
EDIT: Here a comparison of @dacko suggestions. d$id <- seq_len(nrow(d)
is slightly faster, but the order of the columns is different (id
is the last column; reorder them seems to be slower than using cbind
):
library("microbenchmark")
set.seed(1)
d <- data.frame(V1=rnorm(1e6), V2=rnorm(1e6))
cbindSeqLen <- function(x) {
return(cbind(id=seq_len(nrow(x)), x))
}
dickoa <- function(x) {
x$id <- seq_len(nrow(x))
return(x)
}
dickoaReorder <- function(x) {
x$id <- seq_len(nrow(x))
nc <- ncol(x)
x <- x[, c(nc, 1:(nc-1))]
return(x)
}
microbenchmark(cbindSeqLen(d), dickoa(d), dickoaReorder(d), times=100)
# Unit: milliseconds
# expr min lq median uq max neval
# cbindSeqLen(d) 23.00683 38.54196 40.24093 42.60020 47.73816 100
# dickoa(d) 10.70718 36.12495 37.58526 40.22163 72.92796 100
# dickoaReorder(d) 19.25399 68.46162 72.45006 76.51468 88.99620 100
Spark Dataframe :How to add a index Column : Aka Distributed Data Index
With Scala you can use:
import org.apache.spark.sql.functions._
df.withColumn("id",monotonicallyIncreasingId)
You can refer to this exemple and scala docs.
With Pyspark you can use:
from pyspark.sql.functions import monotonically_increasing_id
df_index = df.select("*").withColumn("id", monotonically_increasing_id())
Pandas (python): How to add column to dataframe for index?
How about this:
from pandas import *
idx = Int64Index([171, 174, 173])
df = DataFrame(index = idx, data =([1,2,3]))
print df
It gives me:
0
171 1
174 2
173 3
Is this what you are looking for?
Add a unique identifier to the same column value in R data frame
Using dplyr:
library(dplyr)
dplyr::group_by(df, sample_id) %>%
dplyr::mutate(sample_id = paste(sample_id, LETTERS[seq_along(sample_id)], sep = "-"))
index val sample_id
<int> <dbl> <chr>
1 1 14 5-A
2 2 22 6-A
3 3 1 6-B
4 4 25 7-A
5 5 3 7-B
6 6 34 7-C
How to use R 3.3.2 to add index column to dataframe based on column value?
dplyr has a dedicated function for that, row_number
:
df %>%
group_by(cat) %>%
mutate(rank = row_number())
Related Topics
Calculate Cumsum() While Ignoring Na Values
Pass Function Arguments to Both Dplyr and Ggplot
Why Is Using '<<-' Frowned Upon and How to Avoid It
Embedded Nul in String' Error When Importing CSV with Fread
Plot One Numeric Variable Against N Numeric Variables in N Plots
Filling Area Under Curve Based on Value
Display Weighted Mean by Group in the Data.Frame
R - How to Get Row & Column Subscripts of Matched Elements from a Distance Matrix
Finding Out Which Functions Are Called Within a Given Function
R Function with No Return Value
Sorting Each Row of a Data Frame
How to Create Two Independent Drill Down Plot Using Highcharter
What Leads the First Element of a Printed List to Be Enclosed with Backticks in R V3.5.1
What Is the Most Useful R Trick
Create a Matrix of Scatterplots (Pairs() Equivalent) in Ggplot2