shuffling/permutating a DataFrame in pandas
In [16]: def shuffle(df, n=1, axis=0):
...: df = df.copy()
...: for _ in range(n):
...: df.apply(np.random.shuffle, axis=axis)
...: return df
...:
In [17]: df = pd.DataFrame({'A':range(10), 'B':range(10)})
In [18]: shuffle(df)
In [19]: df
Out[19]:
A B
0 8 5
1 1 7
2 7 3
3 6 2
4 3 4
5 0 1
6 9 0
7 4 6
8 2 8
9 5 9
Shuffle DataFrame rows
The idiomatic way to do this with Pandas is to use the .sample
method of your data frame to sample all rows without replacement:
df.sample(frac=1)
The frac
keyword argument specifies the fraction of rows to return in the random sample, so frac=1
means to return all rows (in random order).
Note:
If you wish to shuffle your dataframe in-place and reset the index, you could do e.g.
df = df.sample(frac=1).reset_index(drop=True)
Here, specifying drop=True
prevents .reset_index
from creating a column containing the old index entries.
Follow-up note: Although it may not look like the above operation is in-place, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the reference object has changed (by which I mean id(df_old)
is not the same as id(df_new)
), the underlying C object is still the same. To show that this is indeed the case, you could run a simple memory profiler:
$ python3 -m memory_profiler .\test.py
Filename: .\test.py
Line # Mem usage Increment Line Contents
================================================
5 68.5 MiB 68.5 MiB @profile
6 def shuffle():
7 847.8 MiB 779.3 MiB df = pd.DataFrame(np.random.randn(100, 1000000))
8 847.9 MiB 0.1 MiB df = df.sample(frac=1).reset_index(drop=True)
Randomizing/Shuffling rows in a dataframe in pandas
Edit: I misunderstood the question, which was just to shuffle rows and not all the table (right?)
I think using dataframes does not make lots of sense, because columns names become useless. So you can just use 2D numpy arrays :
In [1]: A
Out[1]:
array([[11, 'Blue', 'Mon'],
[8, 'Red', 'Tues'],
[10, 'Green', 'Wed'],
[15, 'Yellow', 'Thurs'],
[11, 'Black', 'Fri']], dtype=object)
In [2]: _ = [np.random.shuffle(i) for i in A] # shuffle in-place, so return None
In [3]: A
Out[3]:
array([['Mon', 11, 'Blue'],
[8, 'Tues', 'Red'],
['Wed', 10, 'Green'],
['Thurs', 15, 'Yellow'],
[11, 'Black', 'Fri']], dtype=object)
And if you want to keep dataframe :
In [4]: pd.DataFrame(A, columns=data.columns)
Out[4]:
Number color day
0 Mon 11 Blue
1 8 Tues Red
2 Wed 10 Green
3 Thurs 15 Yellow
4 11 Black Fri
Here a function to shuffle rows and columns:
import numpy as np
import pandas as pd
def shuffle(df):
col = df.columns
val = df.values
shape = val.shape
val_flat = val.flatten()
np.random.shuffle(val_flat)
return pd.DataFrame(val_flat.reshape(shape),columns=col)
In [2]: data
Out[2]:
Number color day
0 11 Blue Mon
1 8 Red Tues
2 10 Green Wed
3 15 Yellow Thurs
4 11 Black Fri
In [3]: shuffle(data)
Out[3]:
Number color day
0 Fri Wed Yellow
1 Thurs Black Red
2 Green Blue 11
3 11 8 10
4 Mon Tues 15
Hope this helps
Shuffling one Column of a DataFrame By Group Efficiently
We can using sample
Notice this is assuming df=df.sort_values('group')
df['New']=df.groupby('group').label.apply(lambda x : x.sample(len(x))).values
Or we can do it by
df['New']=df.sample(len(df)).sort_values('group').New.values
Shuffle one column in pandas dataframe
The immediate error is a symptom of using an inadvisable approach when working with dataframes.
np.random.shuffle
works in-place and returns None
, so assigning to the output of np.random.shuffle
will not work. In fact, in-place operations are rarely required, and often yield no material benefits.
Here, for example, you can use np.random.permutation
and use NumPy arrays via pd.Series.values
rather than series:
if devprod == 'prod':
#do not shuffle data
df1['HS_FIRST_NAME'] = df[4]
df1['HS_LAST_NAME'] = df[6]
df1['HS_SSN'] = df[8]
else:
df1['HS_FIRST_NAME'] = np.random.permutation(df[4].values)
df1['HS_LAST_NAME'] = np.random.permutation(df[6].values)
df1['HS_SSN'] = np.random.permutation(df[8].values)
vectorized shuffling per row in pandas
Let us try with np.random.rand
and argsort
to generate shuffled indices
i = np.random.rand(*df.shape).argsort(1)
df.values[:] = np.take_along_axis(df.to_numpy(), i, axis=1)
print(df)
foo bar baz
0 3 1 2
1 4 5 6
2 7 9 8
Shuffle dataframe values on condition that no element appears in its original position (derangement)
Use the following function to find a way to remap your column:
def derange(x):
res = x
while np.any(res == x):
res = np.random.permutation(x)
return res
Then just apply it to any column:
df['b'] = derange(df['b'])
The method is to generate permutations until one is good enough. The expected number of attempts is (n/(n-1))^n
which converges to e
very quickly.
Note that for n=1
the expectation actually tends to infinity which makes sense as you cannot derange such a list.
Derangement can also be performed deterministically so here it is, for completeness:
def derange2(x):
n = len(x)
for i in range(n - 1):
j = random.randrange(i + 1, n)
x[i], x[j] = x[j], x[i]
This function actually transforms the list in-place.
You can also have a version that modifies pandas
columns in-place:
def derange3(df, col):
n = df.shape[0]
for i in range(n - 1):
j = random.randrange(i + 1, n)
df.iat[i, col], df.iat[j, col] = df.iat[j, col], df.iat[i, col]
How to shuffle a pandas dataframe randomly by row
You can achieve this by using the sample method and apply it to axis # 1.
This will shuffle the elements in a row:
df = df.sample(frac=1, axis=1).reset_index(drop=True)
How ever your desired dataframe looks completely randomised, which can be done by shuffling by row and then by column:
df = df.sample(frac=1, axis=1).sample(frac=1).reset_index(drop=True)
Edit:
import numpy as np
df = df.apply(np.random.permutation, axis=1)
Related Topics
How to Write Binary Data to Stdout in Python 3
Parsing a JSON String Which Was Loaded from a CSV Using Pandas
How to Get JSON from Webpage into Python Script
Comprehension for Flattening a Sequence of Sequences
Convert List to Tuple in Python
Pyserial Non-Blocking Read Loop
How to Join Two Wav Files Using Python
How to Change the Range of the X-Axis with Datetimes in Matplotlib
No Module Named 'Polls.Apps.Pollsconfigdjango'; Django Project Tutorial 2
How to Run a Function Periodically in Python
Valueerror: Numpy.Dtype Has the Wrong Size, Try Recompiling
Sqlite Insert Query Not Working with Python
Getting Started with the Python Debugger, Pdb
Pil Thumbnail Is Rotating My Image
Pandas Datetime to Unix Timestamp Seconds