Shuffling/Permutating a Dataframe in Pandas

shuffling/permutating a DataFrame in pandas

In [16]: def shuffle(df, n=1, axis=0):     
...: df = df.copy()
...: for _ in range(n):
...: df.apply(np.random.shuffle, axis=axis)
...: return df
...:

In [17]: df = pd.DataFrame({'A':range(10), 'B':range(10)})

In [18]: shuffle(df)

In [19]: df
Out[19]:
A B
0 8 5
1 1 7
2 7 3
3 6 2
4 3 4
5 0 1
6 9 0
7 4 6
8 2 8
9 5 9

Shuffle DataFrame rows

The idiomatic way to do this with Pandas is to use the .sample method of your data frame to sample all rows without replacement:

df.sample(frac=1)

The frac keyword argument specifies the fraction of rows to return in the random sample, so frac=1 means to return all rows (in random order).


Note:
If you wish to shuffle your dataframe in-place and reset the index, you could do e.g.

df = df.sample(frac=1).reset_index(drop=True)

Here, specifying drop=True prevents .reset_index from creating a column containing the old index entries.

Follow-up note: Although it may not look like the above operation is in-place, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the reference object has changed (by which I mean id(df_old) is not the same as id(df_new)), the underlying C object is still the same. To show that this is indeed the case, you could run a simple memory profiler:

$ python3 -m memory_profiler .\test.py
Filename: .\test.py

Line # Mem usage Increment Line Contents
================================================
5 68.5 MiB 68.5 MiB @profile
6 def shuffle():
7 847.8 MiB 779.3 MiB df = pd.DataFrame(np.random.randn(100, 1000000))
8 847.9 MiB 0.1 MiB df = df.sample(frac=1).reset_index(drop=True)

Randomizing/Shuffling rows in a dataframe in pandas

Edit: I misunderstood the question, which was just to shuffle rows and not all the table (right?)

I think using dataframes does not make lots of sense, because columns names become useless. So you can just use 2D numpy arrays :

In [1]: A
Out[1]:
array([[11, 'Blue', 'Mon'],
[8, 'Red', 'Tues'],
[10, 'Green', 'Wed'],
[15, 'Yellow', 'Thurs'],
[11, 'Black', 'Fri']], dtype=object)

In [2]: _ = [np.random.shuffle(i) for i in A] # shuffle in-place, so return None

In [3]: A
Out[3]:
array([['Mon', 11, 'Blue'],
[8, 'Tues', 'Red'],
['Wed', 10, 'Green'],
['Thurs', 15, 'Yellow'],
[11, 'Black', 'Fri']], dtype=object)

And if you want to keep dataframe :

In [4]: pd.DataFrame(A, columns=data.columns)
Out[4]:
Number color day
0 Mon 11 Blue
1 8 Tues Red
2 Wed 10 Green
3 Thurs 15 Yellow
4 11 Black Fri

Here a function to shuffle rows and columns:

import numpy as np
import pandas as pd

def shuffle(df):
col = df.columns
val = df.values
shape = val.shape
val_flat = val.flatten()
np.random.shuffle(val_flat)
return pd.DataFrame(val_flat.reshape(shape),columns=col)

In [2]: data
Out[2]:
Number color day
0 11 Blue Mon
1 8 Red Tues
2 10 Green Wed
3 15 Yellow Thurs
4 11 Black Fri

In [3]: shuffle(data)
Out[3]:
Number color day
0 Fri Wed Yellow
1 Thurs Black Red
2 Green Blue 11
3 11 8 10
4 Mon Tues 15

Hope this helps

Shuffling one Column of a DataFrame By Group Efficiently

We can using sample Notice this is assuming df=df.sort_values('group')

df['New']=df.groupby('group').label.apply(lambda x : x.sample(len(x))).values

Or we can do it by

df['New']=df.sample(len(df)).sort_values('group').New.values

Shuffle one column in pandas dataframe

The immediate error is a symptom of using an inadvisable approach when working with dataframes.

np.random.shuffle works in-place and returns None, so assigning to the output of np.random.shuffle will not work. In fact, in-place operations are rarely required, and often yield no material benefits.

Here, for example, you can use np.random.permutation and use NumPy arrays via pd.Series.values rather than series:

if devprod == 'prod':
#do not shuffle data
df1['HS_FIRST_NAME'] = df[4]
df1['HS_LAST_NAME'] = df[6]
df1['HS_SSN'] = df[8]
else:
df1['HS_FIRST_NAME'] = np.random.permutation(df[4].values)
df1['HS_LAST_NAME'] = np.random.permutation(df[6].values)
df1['HS_SSN'] = np.random.permutation(df[8].values)

vectorized shuffling per row in pandas

Let us try with np.random.rand and argsort to generate shuffled indices

i = np.random.rand(*df.shape).argsort(1)
df.values[:] = np.take_along_axis(df.to_numpy(), i, axis=1)


print(df)

foo bar baz
0 3 1 2
1 4 5 6
2 7 9 8

Shuffle dataframe values on condition that no element appears in its original position (derangement)

Use the following function to find a way to remap your column:

def derange(x):
res = x
while np.any(res == x):
res = np.random.permutation(x)
return res

Then just apply it to any column:

df['b'] = derange(df['b'])

The method is to generate permutations until one is good enough. The expected number of attempts is (n/(n-1))^n which converges to e very quickly.

Note that for n=1 the expectation actually tends to infinity which makes sense as you cannot derange such a list.

Derangement can also be performed deterministically so here it is, for completeness:

def derange2(x):
n = len(x)
for i in range(n - 1):
j = random.randrange(i + 1, n)
x[i], x[j] = x[j], x[i]

This function actually transforms the list in-place.

You can also have a version that modifies pandas columns in-place:

def derange3(df, col):
n = df.shape[0]
for i in range(n - 1):
j = random.randrange(i + 1, n)
df.iat[i, col], df.iat[j, col] = df.iat[j, col], df.iat[i, col]

How to shuffle a pandas dataframe randomly by row

You can achieve this by using the sample method and apply it to axis # 1.
This will shuffle the elements in a row:

df = df.sample(frac=1, axis=1).reset_index(drop=True)

How ever your desired dataframe looks completely randomised, which can be done by shuffling by row and then by column:

df = df.sample(frac=1, axis=1).sample(frac=1).reset_index(drop=True)

Edit:

import numpy as np
df = df.apply(np.random.permutation, axis=1)


Related Topics



Leave a reply



Submit