Adding an repeated index for factors in data frame
One way is:
unlist(lapply(split(x, x), seq_along))
where x
is your factor as a vector.
R> x <- factor(rep(letters[1:3], times = c(5,5,4))) ## your data
R> data.frame(factor = x, index = unlist(lapply(split(x, x), seq_along),
+ use.names = FALSE))
factor index
1 a 1
2 a 2
3 a 3
4 a 4
5 a 5
6 b 1
7 b 2
8 b 3
9 b 4
10 b 5
11 c 1
12 c 2
13 c 3
14 c 4
Another way, on a similar theme is to use table()
and seq_len()
:
unlist(sapply(table(x), seq_len), use.names = FALSE)
And another way is to use the run-length encoding via rle()
:
R> rle(as.character(x))$lengths
[1] 5 5 4
which we can plug into the sapply()
code instead of the table()
call:
R> unlist(sapply(rle(as.character(x))$lengths, seq_len), use.names = FALSE)
[1] 1 2 3 4 5 1 2 3 4 5 1 2 3 4
indexing duplicated cases of a data frame in R
We can use ave
to create a sequence column using 'id' and 'date' as grouping variables.
df1$datnno <- with(df1, ave(seq_along(id), id, date, FUN=seq_along))
r - How to add row index to a data frame, based on combination of factors
This is probably going to look like cheating since I am passing a vector into a function which I then totally ignore except to get its length:
df$Index <- ave( 1:nrow(df), df$Dim1, factor( df$Dim2), FUN=function(x) 1:length(x) )
The ave
function returns a vector of the same length as its first argument but computed within categories defined by all of the factors between the first argument and the argument named FUN
. (I often forget to put the "FUN=" in for my function and get a cryptic error message along the lines of unique() applies only to vectors
, since it was trying to determine how many unique values an anonymous function possesses and it fails.
There's actually another even more compact way of expressing function(x) 1:length(x)
using the seq_along
function whch is probably safer since it would fail properly if passed a vector of length zero whereas the anonymous function form would fail improperly by returning 1:0
instead of numeric(0)
:
ave( 1:nrow(df), df$Dim1, factor( df$Dim2), FUN=seq_along )
Adding counts of a factor to a dataframe
using jmsigner's data you could do:
dt$count <- ave(dt$school, dt$school, FUN = length)
insert multiple rows in to data frame based on index position
Here is one way using pd.factorize
on the first index level to get kind of order for this level, once you concat
both dataframes.
np.random.seed(1)
df3 = pd.concat([df1, df2])
df3 = (
df3.set_index( # add two index level for sorting
[list(range(len(df3))), # to have current order of rows
pd.factorize(df3.index.get_level_values('first'))[0]], # to have order of first index
append=True) # to not replace original index
.sort_index(level=[-1, -2]) # sort as wanted
.droplevel([-2,-1]) # delete the extra index
)
print(df3)
0
first second
bar one 1.624345
two -0.611756
one 0.319039
two -0.249370
three 1.462108
baz one -0.528172
two -1.072969
foo one 0.865408
two -2.301539
qux one 1.744812
two -0.761207
one -2.060141
two -0.322417
three -0.384054
four 1.133769
Note that you could do the same adding the two levels for sorting as columns and use sort_values
.
Add an index (numeric ID) column to large data frame
You can add a sequence of numbers very easily with
data$ID <- seq.int(nrow(data))
If you are already using library(tidyverse)
, you can use
data <- tibble::rowid_to_column(data, "ID")
Repeat dataframe rows based on cumsum index
This is closer to what I was looking for:
df %>%
mutate(str_split_content = str_split(content, " ")) %>%
unnest()
Someone posted, then revised/removed a while ago.
The original str_split
content was by punctuation, actually. So not exactly purely splitting by number of words.
Related Topics
Apply Function to Elements Over a List
All Possible Combinations of a Set That Sum to a Target Value
Extract First Word from a Column and Insert into New Column
Minus Operation of Data Frames
R Grep Pattern Regex with Brackets
Overlay Grid Rather Than Draw on Top of It
Ggplot: Order Bars in Faceted Bar Chart Per Facet
How to Pass "Nothing" as an Argument to '[' for Subsetting
How to Filter Data Frame with Conditions of Two Columns
Add Text on Right of Shinydashboard Header
Update a Column of Nas in One Data Table with the Value from a Column in Another Data Table
How to Access Dimensions of Labels Plotted by 'Geom_Text' in 'Ggplot2'
How to Calculate Adjacency Matrices in R
Generate All Possible Permutations (Or N-Tuples)
Number Format, Writing 1E-5 Instead of 0.00001
R - Faster Way to Calculate Rolling Statistics Over a Variable Interval