Add an Index (Numeric Id) Column to Large Data Frame

Add an index (numeric ID) column to large data frame

You can add a sequence of numbers very easily with

data$ID <- seq.int(nrow(data))

If you are already using library(tidyverse), you can use

data <- tibble::rowid_to_column(data, "ID")

Is there an R function for adding an index (numeric or character ID) column to large data frame by position of groups of values in blocks?

Using data.table you can do this:

library(data.table)

df <- as.data.table(data.frame(seqnames,start,end,strand,block,cont_ID))

sp <- split(df,f = cont_ID)

grouping <- function(x){

x[, block_lag:= shift(block, fill = "lag")]
x[is.na(block_lag), cumsum_block_lag:=1]
x[!is.na(block_lag), cumsum_block_lag:=1+cumsum(block!=(block_lag+1))]
x[, index:=paste0(LETTERS[cumsum_block_lag],"_",cont_ID)]
x[, cumsum_block_lag:=NULL]
x[, block_lag:=NULL]
}

index_sp <- rbindlist(lapply(sp, grouping))
index_sp <- index_sp[order(block)]


# seqnames start end strand block cont_ID index
#1: H7 0 10 * 1 001 A_001
#2: H7 11 20 * 2 001 A_001
#3: H7 0 10 * 3 004 A_004
#4: H7 11 20 * 4 004 A_004
#5: H7 0 10 * 5 003 A_003
#6: H7 21 30 * 6 001 B_001
#7: H7 31 40 * 7 001 B_001
#8: H7 11 20 * 8 003 B_003

Create an ID (row number) column

You could use cbind:

d <- data.frame(V1=c(23, 45, 56), V2=c(45, 45, 67))

## enter id here, you could also use 1:nrow(d) instead of rownames
id <- rownames(d)
d <- cbind(id=id, d)

## set colnames to OP's wishes
colnames(d) <- paste0("V", 1:ncol(d))

EDIT: Here a comparison of @dacko suggestions. d$id <- seq_len(nrow(d) is slightly faster, but the order of the columns is different (id is the last column; reorder them seems to be slower than using cbind):

library("microbenchmark")

set.seed(1)
d <- data.frame(V1=rnorm(1e6), V2=rnorm(1e6))

cbindSeqLen <- function(x) {
return(cbind(id=seq_len(nrow(x)), x))
}

dickoa <- function(x) {
x$id <- seq_len(nrow(x))
return(x)
}

dickoaReorder <- function(x) {
x$id <- seq_len(nrow(x))
nc <- ncol(x)
x <- x[, c(nc, 1:(nc-1))]
return(x)
}

microbenchmark(cbindSeqLen(d), dickoa(d), dickoaReorder(d), times=100)

# Unit: milliseconds
# expr min lq median uq max neval
# cbindSeqLen(d) 23.00683 38.54196 40.24093 42.60020 47.73816 100
# dickoa(d) 10.70718 36.12495 37.58526 40.22163 72.92796 100
# dickoaReorder(d) 19.25399 68.46162 72.45006 76.51468 88.99620 100

Spark Dataframe :How to add a index Column : Aka Distributed Data Index

With Scala you can use:

import org.apache.spark.sql.functions._ 

df.withColumn("id",monotonicallyIncreasingId)

You can refer to this exemple and scala docs.

With Pyspark you can use:

from pyspark.sql.functions import monotonically_increasing_id 

df_index = df.select("*").withColumn("id", monotonically_increasing_id())

Pandas (python): How to add column to dataframe for index?

How about this:

from pandas import *

idx = Int64Index([171, 174, 173])
df = DataFrame(index = idx, data =([1,2,3]))
print df

It gives me:

     0
171 1
174 2
173 3

Is this what you are looking for?

Add a unique identifier to the same column value in R data frame

Using dplyr:

library(dplyr)

dplyr::group_by(df, sample_id) %>%
dplyr::mutate(sample_id = paste(sample_id, LETTERS[seq_along(sample_id)], sep = "-"))

index val sample_id
<int> <dbl> <chr>
1 1 14 5-A
2 2 22 6-A
3 3 1 6-B
4 4 25 7-A
5 5 3 7-B
6 6 34 7-C

How to use R 3.3.2 to add index column to dataframe based on column value?

dplyr has a dedicated function for that, row_number:

df %>%
group_by(cat) %>%
mutate(rank = row_number())


Related Topics



Leave a reply



Submit