How can I replicate rows in Pandas?
Use np.repeat
:
Version 1:
Try using np.repeat
:
newdf = pd.DataFrame(np.repeat(df.values, 3, axis=0))
newdf.columns = df.columns
print(newdf)
The above code will output:
Person ID ZipCode Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female
np.repeat
repeats the values of df
, 3
times.
Then we add the columns with assigning new_df.columns = df.columns
.
Version 2:
You could also assign the column names in the first line, like below:
newdf = pd.DataFrame(np.repeat(df.values, 3, axis=0), columns=df.columns)
print(newdf)
The above code will also output:
Person ID ZipCode Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female
Repeat rows of a data.frame
df <- data.frame(a = 1:2, b = letters[1:2])
df[rep(seq_len(nrow(df)), each = 2), ]
Repeat rows of a data.frame N times
EDIT: updated to a better modern R answer.
You can use replicate()
, then rbind
the result back together. The rownames are automatically altered to run from 1:nrows.
d <- data.frame(a = c(1,2,3),b = c(1,2,3))
n <- 3
do.call("rbind", replicate(n, d, simplify = FALSE))
A more traditional way is to use indexing, but here the rowname altering is not quite so neat (but more informative):
d[rep(seq_len(nrow(d)), n), ]
Here are improvements on the above, the first two using purrr
functional programming, idiomatic purrr:
purrr::map_dfr(seq_len(3), ~d)
and less idiomatic purrr (identical result, though more awkward):
purrr::map_dfr(seq_len(3), function(x) d)
and finally via indexing rather than list apply using dplyr
:
d %>% slice(rep(row_number(), 3))
Repeat Rows in Data Frame n Times
Use a combination of pd.DataFrame.loc
and pd.Index.repeat
test.loc[test.index.repeat(test.times)]
id times
0 a 2
0 a 2
1 b 3
1 b 3
1 b 3
2 c 1
3 d 5
3 d 5
3 d 5
3 d 5
3 d 5
To mimic your exact output, use reset_index
test.loc[test.index.repeat(test.times)].reset_index(drop=True)
id times
0 a 2
1 a 2
2 b 3
3 b 3
4 b 3
5 c 1
6 d 5
7 d 5
8 d 5
9 d 5
10 d 5
How to repeat rows until a certain number of rows is reached in R
We may use rep
with sample
if(nrow(df2) > nrow(df1)) {
i1 <- sample(rep(seq_len(nrow(df1)), length.out = nrow(df2)))
out <- cbind(df1[i1,], df2)
} else {
i1 <- sample(rep(seq_len(nrow(df2)), length.out = nrow(df1)))
out <- cbind(df1, df2[i1,])
}
row.names(out) <- NULL
-output
> out
A B C D
1 12 13 19 20
2 12 13 20 30
3 15 16 10 13
4 12 13 54 32
5 15 16 34 10
data
df1 <- structure(list(A = c(12L, 15L), B = c(13L, 16L)),
class = "data.frame", row.names = c("x",
"y"))
df2 <- structure(list(C = c(19L, 20L, 10L, 54L, 34L), D = c(20L, 30L,
13L, 32L, 10L)), class = "data.frame", row.names = c("z", "w",
"r", "k", "f"))
Repeat rows in pandas data frame with a sequential change in a column value
I took a different approach by pivoting & melting..
Seems to be working.. Any body sees an issue..?
data = {'year': ['2000', '2000', '2005', '2005', '2007', '2007', '2007', '2009'],
'country':['UK', 'US', 'FR','US','UK','FR','US','UK'],
'sales': [10, 21, 20, 10,12,20, 10,12],
'rep': ['john', 'john', 'claire','claire', 'kyle','kyle','kyle','amy']
}
df=pd.DataFrame(data)
year country sales rep
0 2000 UK 10 john
1 2000 US 21 john
2 2005 FR 20 claire
3 2005 US 10 claire
4 2007 UK 12 kyle
5 2007 FR 20 kyle
6 2007 US 10 kyle
7 2009 UK 12 amy
First doing a pivot...
dfp=pd.pivot_table(df,index=['country','rep'],values=['sales'],columns=['year']).fillna(0)
dfp=dfp.xs('sales', axis=1, drop_level=True)
year 2000 2005 2007 2009
country rep
FR claire 0.0 20.0 0.0 0.0
kyle 0.0 0.0 20.0 0.0
UK amy 0.0 0.0 0.0 12.0
john 10.0 0.0 0.0 0.0
kyle 0.0 0.0 12.0 0.0
US claire 0.0 10.0 0.0 0.0
john 21.0 0.0 0.0 0.0
kyle 0.0 0.0 10.0 0.0
Then a little logic to replicate the columns..
cols=dfp.columns.astype(int).values
dft=dfp.copy()
i=0
for col in cols :
if col != cols[-1]:
for newcol in range(col+1,cols[i+1]):
dft[str(newcol)]=dft[str(col)]
i+=1
year 2000 2005 2007 2009 2001 2002 2003 2004 2006 2008
country rep
FR claire 0.0 20.0 0.0 0.0 0.0 0.0 0.0 0.0 20.0 0.0
kyle 0.0 0.0 20.0 0.0 0.0 0.0 0.0 0.0 0.0 20.0
UK amy 0.0 0.0 0.0 12.0 0.0 0.0 0.0 0.0 0.0 0.0
john 10.0 0.0 0.0 0.0 10.0 10.0 10.0 10.0 0.0 0.0
kyle 0.0 0.0 12.0 0.0 0.0 0.0 0.0 0.0 0.0 12.0
US claire 0.0 10.0 0.0 0.0 0.0 0.0 0.0 0.0 10.0 0.0
john 21.0 0.0 0.0 0.0 21.0 21.0 21.0 21.0 0.0 0.0
kyle 0.0 0.0 10.0 0.0 0.0 0.0 0.0 0.0 0.0 10.0
Then did a melt get them back into original format..
dfm=dft.reset_index()
dfm=dfm.melt(id_vars=['country','rep'],value_vars=dfm.columns.values[2:],var_name='Year',value_name='sales')
dfm=dfm.loc[dfm.sales>0].reset_index(drop='True')
country rep Year sales
0 UK john 2000 10.0
1 US john 2000 21.0
2 FR claire 2005 20.0
3 US claire 2005 10.0
4 FR kyle 2007 20.0
5 UK kyle 2007 12.0
6 US kyle 2007 10.0
7 UK amy 2009 12.0
8 UK john 2001 10.0
9 US john 2001 21.0
10 UK john 2002 10.0
11 US john 2002 21.0
12 UK john 2003 10.0
13 US john 2003 21.0
14 UK john 2004 10.0
15 US john 2004 21.0
16 FR claire 2006 20.0
17 US claire 2006 10.0
18 FR kyle 2008 20.0
19 UK kyle 2008 12.0
20 US kyle 2008 10.0
How do you repeat each row for a dataframe for each value in a seperate dataframe and then combine the two into a single dataframe?
You can assign a redundant key
column to each DataFrame
(without mutating the original DataFrames) and join on it, then drop it before returning the final result:
import pandas as pd
df1 = pd.DataFrame({
'id': list(range(1, 5))
})
df2 = pd.DataFrame({
'month': ['2010-01', '2010-02', '2010-03']
})
df_merged = pd.merge(
df1.assign(key=1),
df2.assign(key=1),
on='key'
).drop('key', axis=1)
+----+----+---------+
| | id | month |
+----+----+---------+
| 0 | 1 | 2010-01 |
| 1 | 1 | 2010-02 |
| 2 | 1 | 2010-03 |
| 3 | 2 | 2010-01 |
| 4 | 2 | 2010-02 |
| 5 | 2 | 2010-03 |
| 6 | 3 | 2010-01 |
| 7 | 3 | 2010-02 |
| 8 | 3 | 2010-03 |
| 9 | 4 | 2010-01 |
| 10 | 4 | 2010-02 |
| 11 | 4 | 2010-03 |
+----+----+---------+
Repeat rows in a pandas DataFrame based on column value
reindex
+ repeat
df.reindex(df.index.repeat(df.persons))
Out[951]:
code . role ..1 persons
0 123 . Janitor . 3
0 123 . Janitor . 3
0 123 . Janitor . 3
1 123 . Analyst . 2
1 123 . Analyst . 2
2 321 . Vallet . 2
2 321 . Vallet . 2
3 321 . Auditor . 5
3 321 . Auditor . 5
3 321 . Auditor . 5
3 321 . Auditor . 5
3 321 . Auditor . 5
PS: you can add.reset_index(drop=True)
to get the new index
Pandas data frame repeat each row a certain number of times
Create dictionary for number of repeats for each Minute
, Series.map
and then repeat index with Index.repeat
, last use DataFrame.loc
for repeat rows:
print (df)
Minutiae LR
0 1 1.975476
1 2 1.082983
2 3 0.269608
3 4 0.878350
d = {1:2, 2:1, 3:5, 4:3}
df1 = df.loc[df.index.repeat(df['Minutiae'].map(d))]
print (df1)
Minutiae LR
0 1 1.975476
0 1 1.975476
1 2 1.082983
2 3 0.269608
2 3 0.269608
2 3 0.269608
2 3 0.269608
2 3 0.269608
3 4 0.878350
3 4 0.878350
3 4 0.878350
Detail:
print (df['Minutiae'].map(d))
0 2
1 1
2 5
3 3
Name: Minutiae, dtype: int64
print (df.index.repeat(df['Minutiae'].map(d)))
Int64Index([0, 0, 1, 2, 2, 2, 2, 2, 3, 3, 3], dtype='int64')
Or create new column for repeating:
df['repeat'] = [2,1,5,3]
print (df)
Minutiae LR repeat
0 1 1.975476 2
1 2 1.082983 1
2 3 0.269608 5
3 4 0.878350 3
df2 = df.loc[df.index.repeat(df['repeat'])]
print (df2)
Minutiae LR repeat
0 1 1.975476 2
0 1 1.975476 2
1 2 1.082983 1
2 3 0.269608 5
2 3 0.269608 5
2 3 0.269608 5
2 3 0.269608 5
2 3 0.269608 5
3 4 0.878350 3
3 4 0.878350 3
3 4 0.878350 3
Related Topics
Create New Variables With Mutate_At While Keeping the Original Ones
Remove Extra Legends in Ggplot2
R: Use Magrittr Pipe Operator in Self Written Package
Subset Data to Contain Only Columns Whose Names Match a Condition
How to Print When Using %Dopar%
Repeat Rows of a Data.Frame N Times
Plotting Time-Series With Date Labels on X-Axis
Lattice: Multiple Plots in One Window
Read a Text File in R Line by Line
Finding Percentage in a Sub-Group Using Group_By and Summarise
Apply Multiple Functions to Multiple Columns in Data.Table
Difference: "Compile Pdf" Button in Rstudio Vs. Knit() and Knit2Pdf()