Repeat Rows in Data Frame N Times

Repeat Rows in Data Frame n Times

Use a combination of pd.DataFrame.loc and pd.Index.repeat

test.loc[test.index.repeat(test.times)]

id times
0 a 2
0 a 2
1 b 3
1 b 3
1 b 3
2 c 1
3 d 5
3 d 5
3 d 5
3 d 5
3 d 5

To mimic your exact output, use reset_index

test.loc[test.index.repeat(test.times)].reset_index(drop=True)

id times
0 a 2
1 a 2
2 b 3
3 b 3
4 b 3
5 c 1
6 d 5
7 d 5
8 d 5
9 d 5
10 d 5

replicate rows by n times in python

Another method could be:

df.assign(Times = df.Times.apply(lambda x: range(1, x + 1))).explode('Times')
Out[]:
String Times
0 a 1
0 a 2
1 b 1
1 b 2
1 b 3
2 c 1
2 c 2
2 c 3
2 c 4
2 c 5

How can I replicate rows of a Pandas DataFrame?

Solutions:

Use np.repeat:

Version 1:

Try using np.repeat:

newdf = pd.DataFrame(np.repeat(df.values, 3, axis=0))
newdf.columns = df.columns
print(newdf)

The above code will output:

  Person   ID ZipCode  Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female

np.repeat repeats the values of df, 3 times.

Then we add the columns with assigning new_df.columns = df.columns.

Version 2:

You could also assign the column names in the first line, like below:

newdf = pd.DataFrame(np.repeat(df.values, 3, axis=0), columns=df.columns)
print(newdf)

The above code will also output:

  Person   ID ZipCode  Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female

Version 3:

You could shorten it with loc and only repeat the index, like below:

newdf = df.loc[np.repeat(df.index, 3)].reset_index(drop=True)
print(newdf)

The above code will also output:

  Person   ID ZipCode  Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female

I use reset_index to replace the index with monotonic indexes (0, 1, 2, 3, 4...).

Without np.repeat:

Version 4:

You could use the built-in pd.DataFrame.index.repeat function, like the below:

newdf = df.loc[df.index.repeat(3)].reset_index(drop=True)
print(newdf)

The above code will also output:

  Person   ID ZipCode  Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female

Remember to add reset_index to line-up the index.

Version 5:

Or by using concat with sort_index, like below:

newdf = pd.concat([df] * 3).sort_index().reset_index(drop=True)
print(newdf)

The above code will also output:

  Person   ID ZipCode  Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female

Version 6:

You could also use loc with Python list multiplication and sorted, like below:

newdf = df.loc[sorted([*df.index] * 3)].reset_index(drop=True)
print(newdf)

The above code will also output:

  Person   ID ZipCode  Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female

Timings:

Timing with the following code:

import timeit
import pandas as pd
import numpy as np

df = pd.DataFrame({'Person': {0: 12345, 1: 32917, 2: 18273}, 'ID': {0: 882, 1: 271, 2: 552}, 'ZipCode': {0: 38182, 1: 88172, 2: 90291}, 'Gender': {0: 'Female', 1: 'Male', 2: 'Female'}})

def version1():
newdf = pd.DataFrame(np.repeat(df.values, 3, axis=0))
newdf.columns = df.columns

def version2():
newdf = pd.DataFrame(np.repeat(df.values, 3, axis=0), columns=df.columns)


def version3():
newdf = df.loc[np.repeat(df.index, 3)].reset_index(drop=True)


def version4():
newdf = df.loc[df.index.repeat(3)].reset_index(drop=True)


def version5():
newdf = pd.concat([df] * 3).sort_index().reset_index(drop=True)


def version6():
newdf = df.loc[sorted([*df.index] * 3)].reset_index(drop=True)

print('Version 1 Speed:', timeit.timeit('version1()', 'from __main__ import version1', number=20000))
print('Version 2 Speed:', timeit.timeit('version2()', 'from __main__ import version2', number=20000))
print('Version 3 Speed:', timeit.timeit('version3()', 'from __main__ import version3', number=20000))
print('Version 4 Speed:', timeit.timeit('version4()', 'from __main__ import version4', number=20000))
print('Version 5 Speed:', timeit.timeit('version5()', 'from __main__ import version5', number=20000))
print('Version 6 Speed:', timeit.timeit('version6()', 'from __main__ import version6', number=20000))

Output:

Version 1 Speed: 9.879425965991686
Version 2 Speed: 7.752138633004506
Version 3 Speed: 7.078321029010112
Version 4 Speed: 8.01169377300539
Version 5 Speed: 19.853051771002356
Version 6 Speed: 9.801617017001263

We can see that Versions 2 & 3 are faster than the others, the reason for this is because they both use the np.repeat function, and numpy functions are very fast because they are implemented with C.

Version 3 wins against Version 2 marginally due to the usage of loc instead of DataFrame.

Version 5 is significantly slower because of the functions concat and sort_index, since concat copies DataFrames quadratically, which takes longer time.

Fastest Version: Version 3.

Repeat rows of a data.frame

df <- data.frame(a = 1:2, b = letters[1:2]) 
df[rep(seq_len(nrow(df)), each = 2), ]

Repeat rows of a data.frame N times

EDIT: updated to a better modern R answer.

You can use replicate(), then rbind the result back together. The rownames are automatically altered to run from 1:nrows.

d <- data.frame(a = c(1,2,3),b = c(1,2,3))
n <- 3
do.call("rbind", replicate(n, d, simplify = FALSE))

A more traditional way is to use indexing, but here the rowname altering is not quite so neat (but more informative):

 d[rep(seq_len(nrow(d)), n), ]

Here are improvements on the above, the first two using purrr functional programming, idiomatic purrr:

purrr::map_dfr(seq_len(3), ~d)

and less idiomatic purrr (identical result, though more awkward):

purrr::map_dfr(seq_len(3), function(x) d)

and finally via indexing rather than list apply using dplyr:

d %>% slice(rep(row_number(), 3))

Pandas: repeat dataframe n times

Use:

N = 3
df = pd.concat([df] * N, ignore_index=True)
print (df)
col
0 0
1 60
2 300
3 320
4 0
5 60
6 300
7 320
8 0
9 60
10 300
11 320

Dataframe groupby certain column and repeat the row n times

You can use GroupBy.apply per date, and pandas.concat:

N = 2
out = (df_input
.groupby(['date'], group_keys=False)
.apply(lambda d: pd.concat([d]*N))
)

output:

    date type value
0 01/01 1 10
1 01/01 2 5
0 01/01 1 10
1 01/01 2 5
2 01/02 1 9
3 01/02 2 7
2 01/02 1 9
3 01/02 2 7

With "repeat" column:

N = 2
out = (df_input
.groupby(['date'], group_keys=False)
.apply(lambda d: pd.concat([d.assign(repeat=n+1) for n in range(N)]))
)

output:

    date type value  repeat
0 01/01 1 10 1
1 01/01 2 5 1
0 01/01 1 10 2
1 01/01 2 5 2
2 01/02 1 9 1
3 01/02 2 7 1
2 01/02 1 9 2
3 01/02 2 7 2

How do I repeat the last row of a data frame n times, while changing 1 or 2 variables?

You can repeat the last row number n times, and add the seq(n) on Age to increase it by 1, i.e.

rbind(df, transform(df[rep(nrow(df), 3),], Age = Age + seq(3), Year = Year + seq(3)))

# Year Age x y
#1 2000 0 1 0.3
#2 2001 1 2 0.7
#3 2002 2 3 0.5
#31 2003 3 3 0.5
#3.1 2004 4 3 0.5
#3.2 2005 5 3 0.5


Related Topics



Leave a reply



Submit