Add an Index (Numeric Id) Column to Large Data Frame

Add an index (numeric ID) column to large data frame

You can add a sequence of numbers very easily with

data$ID <- seq.int(nrow(data))

If you are already using library(tidyverse), you can use

data <- tibble::rowid_to_column(data, "ID")

Is there an R function for adding an index (numeric or character ID) column to large data frame by position of groups of values in blocks?

Using data.table you can do this:

library(data.table)

df <- as.data.table(data.frame(seqnames,start,end,strand,block,cont_ID))

sp <- split(df,f = cont_ID)

grouping <- function(x){

  x[, block_lag:= shift(block, fill = "lag")]
  x[is.na(block_lag), cumsum_block_lag:=1]
  x[!is.na(block_lag), cumsum_block_lag:=1+cumsum(block!=(block_lag+1))]
  x[, index:=paste0(LETTERS[cumsum_block_lag],"_",cont_ID)]
  x[, cumsum_block_lag:=NULL]
  x[, block_lag:=NULL]
}

index_sp <- rbindlist(lapply(sp, grouping))
index_sp <- index_sp[order(block)]


#   seqnames start end strand block cont_ID index
#1:       H7     0  10      *     1     001 A_001
#2:       H7    11  20      *     2     001 A_001
#3:       H7     0  10      *     3     004 A_004
#4:       H7    11  20      *     4     004 A_004
#5:       H7     0  10      *     5     003 A_003
#6:       H7    21  30      *     6     001 B_001
#7:       H7    31  40      *     7     001 B_001
#8:       H7    11  20      *     8     003 B_003

Create an ID (row number) column

You could use cbind:

d <- data.frame(V1=c(23, 45, 56), V2=c(45, 45, 67))

## enter id here, you could also use 1:nrow(d) instead of rownames
id <- rownames(d)
d <- cbind(id=id, d)

## set colnames to OP's wishes
colnames(d) <- paste0("V", 1:ncol(d))

EDIT: Here a comparison of @dacko suggestions. d$id <- seq_len(nrow(d) is slightly faster, but the order of the columns is different (id is the last column; reorder them seems to be slower than using cbind):

library("microbenchmark")

set.seed(1)
d <- data.frame(V1=rnorm(1e6), V2=rnorm(1e6))

cbindSeqLen <- function(x) {
  return(cbind(id=seq_len(nrow(x)), x))
}

dickoa <- function(x) {
  x$id <- seq_len(nrow(x))
  return(x)
}

dickoaReorder <- function(x) {
  x$id <- seq_len(nrow(x))
  nc <- ncol(x)
  x <- x[, c(nc, 1:(nc-1))]
  return(x)
}

microbenchmark(cbindSeqLen(d), dickoa(d), dickoaReorder(d), times=100)

# Unit: milliseconds
#             expr      min       lq   median       uq      max neval
#   cbindSeqLen(d) 23.00683 38.54196 40.24093 42.60020 47.73816   100
#        dickoa(d) 10.70718 36.12495 37.58526 40.22163 72.92796   100
# dickoaReorder(d) 19.25399 68.46162 72.45006 76.51468 88.99620   100

Spark Dataframe :How to add a index Column : Aka Distributed Data Index

With Scala you can use:

import org.apache.spark.sql.functions._ 

df.withColumn("id",monotonicallyIncreasingId)

You can refer to this exemple and scala docs.

With Pyspark you can use:

from pyspark.sql.functions import monotonically_increasing_id 

df_index = df.select("*").withColumn("id", monotonically_increasing_id())

Pandas (python): How to add column to dataframe for index?

How about this:

from pandas import *

idx = Int64Index([171, 174, 173])
df = DataFrame(index = idx, data =([1,2,3]))
print df

It gives me:

Is this what you are looking for?

Add a unique identifier to the same column value in R data frame

Using dplyr:

library(dplyr)

dplyr::group_by(df, sample_id) %>% 
  dplyr::mutate(sample_id = paste(sample_id, LETTERS[seq_along(sample_id)], sep = "-"))

 index   val sample_id
  <int> <dbl> <chr>    
1     1    14 5-A      
2     2    22 6-A      
3     3     1 6-B      
4     4    25 7-A      
5     5     3 7-B      
6     6    34 7-C

How to use R 3.3.2 to add index column to dataframe based on column value?

dplyr has a dedicated function for that, row_number:

df %>%
    group_by(cat) %>%
    mutate(rank = row_number())

Add an Index (Numeric Id) Column to Large Data Frame