Adding an Repeated Index for Factors in Data Frame

Adding an repeated index for factors in data frame

One way is:

unlist(lapply(split(x, x), seq_along))

where x is your factor as a vector.

R> x <- factor(rep(letters[1:3], times = c(5,5,4))) ## your data
R> data.frame(factor = x, index = unlist(lapply(split(x, x), seq_along),
+ use.names = FALSE))
factor index
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 b 1
7 b 2
8 b 3
9 b 4
10 b 5
11 c 1
12 c 2
13 c 3
14 c 4

Another way, on a similar theme is to use table() and seq_len():

unlist(sapply(table(x), seq_len), use.names = FALSE)

And another way is to use the run-length encoding via rle():

R> rle(as.character(x))$lengths
[1] 5 5 4

which we can plug into the sapply() code instead of the table() call:

R> unlist(sapply(rle(as.character(x))$lengths, seq_len), use.names = FALSE)
[1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4

indexing duplicated cases of a data frame in R

We can use ave to create a sequence column using 'id' and 'date' as grouping variables.

 df1$datnno <- with(df1, ave(seq_along(id), id, date, FUN=seq_along))

r - How to add row index to a data frame, based on combination of factors

This is probably going to look like cheating since I am passing a vector into a function which I then totally ignore except to get its length:

 df$Index <- ave( 1:nrow(df), df$Dim1, factor( df$Dim2), FUN=function(x) 1:length(x) )

The ave function returns a vector of the same length as its first argument but computed within categories defined by all of the factors between the first argument and the argument named FUN. (I often forget to put the "FUN=" in for my function and get a cryptic error message along the lines of unique() applies only to vectors, since it was trying to determine how many unique values an anonymous function possesses and it fails.

There's actually another even more compact way of expressing function(x) 1:length(x) using the seq_along function whch is probably safer since it would fail properly if passed a vector of length zero whereas the anonymous function form would fail improperly by returning 1:0 instead of numeric(0):

ave( 1:nrow(df), df$Dim1, factor( df$Dim2), FUN=seq_along )

Adding counts of a factor to a dataframe

using jmsigner's data you could do:

dt$count <- ave(dt$school, dt$school,  FUN = length)

insert multiple rows in to data frame based on index position

Here is one way using pd.factorize on the first index level to get kind of order for this level, once you concat both dataframes.

np.random.seed(1)
df3 = pd.concat([df1, df2])

df3 = (
df3.set_index( # add two index level for sorting
[list(range(len(df3))), # to have current order of rows
pd.factorize(df3.index.get_level_values('first'))[0]], # to have order of first index
append=True) # to not replace original index
.sort_index(level=[-1, -2]) # sort as wanted
.droplevel([-2,-1]) # delete the extra index
)
print(df3)
0
first second
bar one 1.624345
two -0.611756
one 0.319039
two -0.249370
three 1.462108
baz one -0.528172
two -1.072969
foo one 0.865408
two -2.301539
qux one 1.744812
two -0.761207
one -2.060141
two -0.322417
three -0.384054
four 1.133769

Note that you could do the same adding the two levels for sorting as columns and use sort_values.

Add an index (numeric ID) column to large data frame

You can add a sequence of numbers very easily with

data$ID <- seq.int(nrow(data))

If you are already using library(tidyverse), you can use

data <- tibble::rowid_to_column(data, "ID")

Repeat dataframe rows based on cumsum index

This is closer to what I was looking for:

df %>%
mutate(str_split_content = str_split(content, " ")) %>%
unnest()

Someone posted, then revised/removed a while ago.

The original str_split content was by punctuation, actually. So not exactly purely splitting by number of words.



Related Topics



Leave a reply



Submit