Replicating Rows in a Pandas Data Frame by a Column Value

Python: How to replicate rows in Dataframe with column value but changing the column value to its range

You can do a groupby().cumcount() after that:

out = df.loc[df.index.repeat(df['Table'])]
out['Table'] = out.groupby(level=0).cumcount() + 1

Output:

   Store  Aisle  Table
0 11 59 1
0 11 59 2
1 11 61 1
1 11 61 2
1 11 61 3

Python - Replicate rows in Pandas Dataframe based on condition

import pandas as pd

Firstly create a boolean mask to check your condition by using isin() method:

mask=df[columns].isin(values).any(1)

Finally use reindex() method ,repeat those rows rep_times and append() method to append rows back to dataframe that aren't satisfying the condition:

df=df.reindex(df[mask].index.repeat(rep_times)).append(df[~mask])

How to replicate rows based on value of a column in same pandas dataframe

Try with reindex + repeat

out = df.reindex(df.index.repeat(df['count']))

Replicating rows in a pandas data frame by a column value

You can use Index.repeat to get repeated index values based on the column then select from the DataFrame:

df2 = df.loc[df.index.repeat(df.n)]

id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8

Or you could use np.repeat to get the repeated indices and then use that to index into the frame:

df2 = df.loc[np.repeat(df.index.values, df.n)]

id n v
0 A 1 10
1 B 2 13
1 B 2 13
2 C 3 8
2 C 3 8
2 C 3 8

After which there's only a bit of cleaning up to do:

df2 = df2.drop("n", axis=1).reset_index(drop=True)

id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8

Note that if you might have duplicate indices to worry about, you could use .iloc instead:

df.iloc[np.repeat(np.arange(len(df)), df["n"])].drop("n", axis=1).reset_index(drop=True)

id v
0 A 10
1 B 13
2 B 13
3 C 8
4 C 8
5 C 8

which uses the positions, and not the index labels.

Pandas - replicate rows with new column value from a list for each replication

Here is a way using the keys paramater of pd.concat():

(pd.concat([df]*len(New_Cost_List),
keys = New_Cost_List,
names = ['New_Cost',None])
.reset_index(level=0))

Output:

   New_Cost State  Cost
0 1 A 2
1 1 B 9
2 1 C 8
3 1 D 4
0 5 A 2
1 5 B 9
2 5 C 8
3 5 D 4
0 10 A 2
1 10 B 9
2 10 C 8
3 10 D 4

Replicate row in Pandas dataframe based on condition and change values for a specific column

You can use pandas.Index.repeat to repeat the rows [Duration times] based on column Duration and then using pandas.core.groupby.GroupBy.cumcount you can add increasing cumulative values to the start_year column.

Reading data

data = [[1500, 1501, ['A','B'], ['C','D'], 1],
[1500, 1510, ['P','Q','R'], ['X','Y'], 10],
[1520, 1520, ['A','X'], ['C'], 0],
[1809, 1820, ['M'], ['F','H','Z'], 11]]
df = pd.DataFrame(data, columns = ['Start_Year', 'End_Year', 'Opp1', 'Opp2', 'Duration'])

Repeating the values

mask = df['Duration'].gt(0)
df1 = df[mask].copy()
df1 = df1.loc[df1.index.repeat(df1['Duration'] + 1)]

Assigning increasing values to each group

df1['Start_Year'] += df1[['Start_Year', 'End_Year', 'Opp1', 'Opp2']].astype(str).groupby(['Start_Year', 'End_Year', 'Opp1', 'Opp2']).cumcount()

Generating output

df1['Duration'] = df1['End_Year'] - df1['Start_Year']
df = pd.concat([df1, df[~mask]]).sort_index(kind = 'mergesort').reset_index(drop=True)

This gives us the expected output :

    Start_Year  End_Year       Opp1       Opp2  Duration
0 1500 1501 [A, B] [C, D] 1
1 1501 1501 [A, B] [C, D] 0
2 1500 1510 [P, Q, R] [X, Y] 10
3 1501 1510 [P, Q, R] [X, Y] 9
4 1502 1510 [P, Q, R] [X, Y] 8
5 1503 1510 [P, Q, R] [X, Y] 7
6 1504 1510 [P, Q, R] [X, Y] 6
7 1505 1510 [P, Q, R] [X, Y] 5
8 1506 1510 [P, Q, R] [X, Y] 4
9 1507 1510 [P, Q, R] [X, Y] 3
10 1508 1510 [P, Q, R] [X, Y] 2
11 1509 1510 [P, Q, R] [X, Y] 1
12 1510 1510 [P, Q, R] [X, Y] 0
13 1520 1520 [A, X] [C] 0
14 1809 1820 [M] [F, H, Z] 11
15 1810 1820 [M] [F, H, Z] 10
16 1811 1820 [M] [F, H, Z] 9
17 1812 1820 [M] [F, H, Z] 8
18 1813 1820 [M] [F, H, Z] 7
19 1814 1820 [M] [F, H, Z] 6
20 1815 1820 [M] [F, H, Z] 5
21 1816 1820 [M] [F, H, Z] 4
22 1817 1820 [M] [F, H, Z] 3
23 1818 1820 [M] [F, H, Z] 2
24 1819 1820 [M] [F, H, Z] 1
25 1820 1820 [M] [F, H, Z] 0

Alternatively

You can also try the other way around after Repeating the values by assigning Duration in first decreasing cumulatively. And then calculating the 'Start_Year' again

df1['Duration'] = df1[['Start_Year', 'End_Year', 'Opp1', 'Opp2']].astype(str).groupby(['Start_Year', 'End_Year', 'Opp1', 'Opp2']).cumcount(ascending=False)
df1['Start_Year'] = df1['End_Year'] - df1['Duration']
df = pd.concat([df1, df[~mask]]).sort_index(kind = 'mergesort').reset_index(drop=True)

Output :

This gives you same expected output:

    Start_Year  End_Year       Opp1       Opp2  Duration
0 1500 1501 [A, B] [C, D] 1
1 1501 1501 [A, B] [C, D] 0
2 1500 1510 [P, Q, R] [X, Y] 10
3 1501 1510 [P, Q, R] [X, Y] 9
4 1502 1510 [P, Q, R] [X, Y] 8
5 1503 1510 [P, Q, R] [X, Y] 7
6 1504 1510 [P, Q, R] [X, Y] 6
7 1505 1510 [P, Q, R] [X, Y] 5
8 1506 1510 [P, Q, R] [X, Y] 4
9 1507 1510 [P, Q, R] [X, Y] 3
10 1508 1510 [P, Q, R] [X, Y] 2
11 1509 1510 [P, Q, R] [X, Y] 1
12 1510 1510 [P, Q, R] [X, Y] 0
13 1520 1520 [A, X] [C] 0
14 1809 1820 [M] [F, H, Z] 11
15 1810 1820 [M] [F, H, Z] 10
16 1811 1820 [M] [F, H, Z] 9
17 1812 1820 [M] [F, H, Z] 8
18 1813 1820 [M] [F, H, Z] 7
19 1814 1820 [M] [F, H, Z] 6
20 1815 1820 [M] [F, H, Z] 5
21 1816 1820 [M] [F, H, Z] 4
22 1817 1820 [M] [F, H, Z] 3
23 1818 1820 [M] [F, H, Z] 2
24 1819 1820 [M] [F, H, Z] 1
25 1820 1820 [M] [F, H, Z] 0

You can reset the index using pandas.DataFrame.reset_index.

Summary :

Basically, what we have done here is duplicated rows based on value from column Duration with condition.

We saved the rows which could have got vanished on using pandas.Index.repeat to repeat the rows [Duration value times] and once we replicated and applied logic on the rows with Duration > 0 replacing column values by subsequent increasing/decreasing cumulative values using pandas.core.groupby.GroupBy.cumcount we concatenated both the dataframe and sorted them on index using pandas.DataFrame.sort_index since the index was also supposed to be repeated when we used pandas.Index.repeat to repeat the rows [Duration value times]. Hence the sort on index would give us the dataframe in same order as it was in the original dataframe.

How to replicate pandas DataFrame rows and change periodically one column

Add values to column col_d by DataFrame.assign with numpy.tile:

L = ['P','Q','R']
new_df = (pd.DataFrame(np.repeat(df.values, 3, axis=0), columns=df.columns)
.assign(col_d = np.tile(L, len(df))))

print (new_df)
col_acol_b col_c col_d
0 A1B1 C1 P
1 A1B1 C1 Q
2 A1B1 C1 R
3 A2B2 C2 P
4 A2B2 C2 Q
5 A2B2 C2 R
6 A3B3 C3 P
7 A3B3 C3 Q
8 A3B3 C3 R

Another similar idea is repeat indices and duplicated rows by DataFrame.loc:

L = ['P','Q','R']
new_df = (df.loc[df.index.repeat(3)]
.assign(col_d = np.tile(L, len(df)))
.reset_index(drop=True))

print (new_df)
col_acol_b col_c col_d
0 A1B1 C1 P
1 A1B1 C1 Q
2 A1B1 C1 R
3 A2B2 C2 P
4 A2B2 C2 Q
5 A2B2 C2 R
6 A3B3 C3 P
7 A3B3 C3 Q
8 A3B3 C3 R

EDIT:

L = ['P','Q','R','S']
new_df = (pd.DataFrame(np.repeat(df.values, len(L), axis=0), columns=df.columns)
.assign(col_d = np.tile(L, len(df)),
col_c = lambda x: x['col_c'].mask(x['col_d'].eq('S'), 'T')))

print (new_df)
col_acol_b col_c col_d
0 A1B1 C1 P
1 A1B1 C1 Q
2 A1B1 C1 R
3 A1B1 T S
4 A2B2 C2 P
5 A2B2 C2 Q
6 A2B2 C2 R
7 A2B2 T S
8 A3B3 C3 P
9 A3B3 C3 Q
10 A3B3 C3 R
11 A3B3 T S

How can I replicate rows in Pandas?

Use np.repeat:

Version 1:

Try using np.repeat:

newdf = pd.DataFrame(np.repeat(df.values, 3, axis=0))
newdf.columns = df.columns
print(newdf)

The above code will output:

  Person   ID ZipCode  Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female

np.repeat repeats the values of df, 3 times.

Then we add the columns with assigning new_df.columns = df.columns.

Version 2:

You could also assign the column names in the first line, like below:

newdf = pd.DataFrame(np.repeat(df.values, 3, axis=0), columns=df.columns)
print(newdf)

The above code will also output:

  Person   ID ZipCode  Gender
0 12345 882 38182 Female
1 12345 882 38182 Female
2 12345 882 38182 Female
3 32917 271 88172 Male
4 32917 271 88172 Male
5 32917 271 88172 Male
6 18273 552 90291 Female
7 18273 552 90291 Female
8 18273 552 90291 Female


Related Topics



Leave a reply



Submit