Shuffling/Permutating a Dataframe in Pandas

shuffling/permutating a DataFrame in pandas

In [16]: def shuffle(df, n=1, axis=0):     
    ...:     df = df.copy()
    ...:     for _ in range(n):
    ...:         df.apply(np.random.shuffle, axis=axis)
    ...:     return df
    ...:     

In [17]: df = pd.DataFrame({'A':range(10), 'B':range(10)})

In [18]: shuffle(df)

In [19]: df
Out[19]: 
   A  B
0  8  5
1  1  7
2  7  3
3  6  2
4  3  4
5  0  1
6  9  0
7  4  6
8  2  8
9  5  9

Shuffle DataFrame rows

The idiomatic way to do this with Pandas is to use the .sample method of your data frame to sample all rows without replacement:

df.sample(frac=1)

The frac keyword argument specifies the fraction of rows to return in the random sample, so frac=1 means to return all rows (in random order).

Note:
If you wish to shuffle your dataframe in-place and reset the index, you could do e.g.

df = df.sample(frac=1).reset_index(drop=True)

Here, specifying drop=True prevents .reset_index from creating a column containing the old index entries.

Follow-up note: Although it may not look like the above operation is in-place, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the reference object has changed (by which I mean id(df_old) is not the same as id(df_new)), the underlying C object is still the same. To show that this is indeed the case, you could run a simple memory profiler:

$ python3 -m memory_profiler .\test.py
Filename: .\test.py

Line #    Mem usage    Increment   Line Contents
================================================
     5     68.5 MiB     68.5 MiB   @profile
     6                             def shuffle():
     7    847.8 MiB    779.3 MiB       df = pd.DataFrame(np.random.randn(100, 1000000))
     8    847.9 MiB      0.1 MiB       df = df.sample(frac=1).reset_index(drop=True)

Randomizing/Shuffling rows in a dataframe in pandas

Edit: I misunderstood the question, which was just to shuffle rows and not all the table (right?)

I think using dataframes does not make lots of sense, because columns names become useless. So you can just use 2D numpy arrays :

In [1]: A
Out[1]: 
array([[11, 'Blue', 'Mon'],
       [8, 'Red', 'Tues'],
       [10, 'Green', 'Wed'],
       [15, 'Yellow', 'Thurs'],
       [11, 'Black', 'Fri']], dtype=object)

In [2]: _ = [np.random.shuffle(i) for i in A] # shuffle in-place, so return None

In [3]: A
Out[3]: 
array([['Mon', 11, 'Blue'],
       [8, 'Tues', 'Red'],
       ['Wed', 10, 'Green'],
       ['Thurs', 15, 'Yellow'],
       [11, 'Black', 'Fri']], dtype=object)

And if you want to keep dataframe :

In [4]: pd.DataFrame(A, columns=data.columns)
Out[4]: 
  Number  color     day
0    Mon     11    Blue
1      8   Tues     Red
2    Wed     10   Green
3  Thurs     15  Yellow
4     11  Black     Fri

Here a function to shuffle rows and columns:

import numpy as np
import pandas as pd

def shuffle(df):
    col = df.columns
    val = df.values
    shape = val.shape
    val_flat = val.flatten()
    np.random.shuffle(val_flat)
    return pd.DataFrame(val_flat.reshape(shape),columns=col)

In [2]: data
Out[2]: 
   Number   color    day
0      11    Blue    Mon
1       8     Red   Tues
2      10   Green    Wed
3      15  Yellow  Thurs
4      11   Black    Fri

In [3]: shuffle(data)
Out[3]: 
  Number  color     day
0    Fri    Wed  Yellow
1  Thurs  Black     Red
2  Green   Blue      11
3     11      8      10
4    Mon   Tues      15

Hope this helps

Shuffling one Column of a DataFrame By Group Efficiently

We can using sample Notice this is assuming df=df.sort_values('group')

df['New']=df.groupby('group').label.apply(lambda x : x.sample(len(x))).values

Or we can do it by

df['New']=df.sample(len(df)).sort_values('group').New.values

Shuffle one column in pandas dataframe

The immediate error is a symptom of using an inadvisable approach when working with dataframes.

np.random.shuffle works in-place and returns None, so assigning to the output of np.random.shuffle will not work. In fact, in-place operations are rarely required, and often yield no material benefits.

Here, for example, you can use np.random.permutation and use NumPy arrays via pd.Series.values rather than series:

if devprod == 'prod':
    #do not shuffle data
    df1['HS_FIRST_NAME'] = df[4]
    df1['HS_LAST_NAME'] = df[6]
    df1['HS_SSN'] = df[8]
else:
    df1['HS_FIRST_NAME'] = np.random.permutation(df[4].values)
    df1['HS_LAST_NAME'] = np.random.permutation(df[6].values)
    df1['HS_SSN'] = np.random.permutation(df[8].values)

vectorized shuffling per row in pandas

Let us try with np.random.rand and argsort to generate shuffled indices

i = np.random.rand(*df.shape).argsort(1)
df.values[:] = np.take_along_axis(df.to_numpy(), i, axis=1)

print(df)

   foo  bar  baz
0    3    1    2
1    4    5    6
2    7    9    8

Shuffle dataframe values on condition that no element appears in its original position (derangement)

Use the following function to find a way to remap your column:

def derange(x):
  res = x
  while np.any(res == x):
    res = np.random.permutation(x)
  return res

Then just apply it to any column:

df['b'] = derange(df['b'])

The method is to generate permutations until one is good enough. The expected number of attempts is (n/(n-1))^n which converges to e very quickly.

Note that for n=1 the expectation actually tends to infinity which makes sense as you cannot derange such a list.

Derangement can also be performed deterministically so here it is, for completeness:

def derange2(x):
  n = len(x)
  for i in range(n - 1):
    j = random.randrange(i + 1, n)
    x[i], x[j] = x[j], x[i]

This function actually transforms the list in-place.

You can also have a version that modifies pandas columns in-place:

def derange3(df, col):
  n = df.shape[0]
  for i in range(n - 1):
    j = random.randrange(i + 1, n)
    df.iat[i, col], df.iat[j, col] = df.iat[j, col], df.iat[i, col]

How to shuffle a pandas dataframe randomly by row

You can achieve this by using the sample method and apply it to axis # 1.
This will shuffle the elements in a row:

df = df.sample(frac=1, axis=1).reset_index(drop=True)

How ever your desired dataframe looks completely randomised, which can be done by shuffling by row and then by column:

df = df.sample(frac=1, axis=1).sample(frac=1).reset_index(drop=True)

Edit:

import numpy as np
df = df.apply(np.random.permutation, axis=1)

Shuffling/Permutating a Dataframe in Pandas