Shuffle DataFrame rows
The idiomatic way to do this with Pandas is to use the .sample
method of your data frame to sample all rows without replacement:
df.sample(frac=1)
The frac
keyword argument specifies the fraction of rows to return in the random sample, so frac=1
means to return all rows (in random order).
Note:
If you wish to shuffle your dataframe in-place and reset the index, you could do e.g.
df = df.sample(frac=1).reset_index(drop=True)
Here, specifying drop=True
prevents .reset_index
from creating a column containing the old index entries.
Follow-up note: Although it may not look like the above operation is in-place, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the reference object has changed (by which I mean id(df_old)
is not the same as id(df_new)
), the underlying C object is still the same. To show that this is indeed the case, you could run a simple memory profiler:
$ python3 -m memory_profiler .\test.py
Filename: .\test.py
Line # Mem usage Increment Line Contents
================================================
5 68.5 MiB 68.5 MiB @profile
6 def shuffle():
7 847.8 MiB 779.3 MiB df = pd.DataFrame(np.random.randn(100, 1000000))
8 847.9 MiB 0.1 MiB df = df.sample(frac=1).reset_index(drop=True)
Trying to shuffle rows in Panda DataFrame
Something like this where you just return the shuffled df, and use pd.concat
on a list of these.
sales_to_do = pd.DataFrame({'id':[1,2], 'name':['bob','mike']})
def randomize(df):
return df.sample(frac=1)
df_shuffled = pd.concat([randomize(sales_to_do) for x in range(15)])
df_shuffled.to_excel(r'C:\Users\Alex\Desktop\Output1.xlsx', index=False, header=True)
Shuffle rows in dataframe by specific colum value
IIUC, you can select even indices shuffle, and add the odd indices using numpy:
import numpy as np
order = np.arange(0,len(df), 2)
np.random.shuffle(order)
order = np.vstack([order, order+1]).ravel('F')
df2 = df.iloc[order]
example output:
Video Frames Feature1 Feature2 Label
2 0 frame2.jpg feature1 feature2 0
3 0 frame3.jpg feature1 feature2 0
0 0 frame0.jpg feature1 feature2 0
1 0 frame1.jpg feature1 feature2 0
6 1 frame2.jpg feature1 feature2 1
7 1 frame3.jpg feature1 feature2 1
8 2 frame0.jpg feature1 feature2 0
9 2 frame1.jpg feature1 feature2 0
10 2 frame2.jpg feature1 feature2 0
11 2 frame3.jpg feature1 feature2 0
4 1 frame0.jpg feature1 feature2 1
5 1 frame1.jpg feature1 feature2 1
Shuffle rows in a dataframe based on a condition using R
You could try this:
library(purrr)
library(tidyr)
library(dplyr)
df %>%
split(f = as.factor(.$ClassNr)) %>%
map_dfr(~sample(.x$Name)) %>%
pivot_longer(everything(),
names_to = "ClassNr",
values_to = "Name")
returning (for example)
# A tibble: 6 x 2
ClassNr Name
<chr> <chr>
1 1 Ana
2 2 Ella
3 3 Sarah
4 1 Maria
5 2 Hanne
6 3 Liam
- We first split the data into groups based on the ClassNr. That's the
split
-part. Now we have three lists (one list for every class). - Next we take every list and sample the elements, which is basically shuffling each list independently and bind the result together as dataframe.
- Finally we bring this dataframe into a long format.
Note: This approach will most likely fail if there are different numbers of names in each class.
How to shuffle a pandas dataframe randomly by row
You can achieve this by using the sample method and apply it to axis # 1.
This will shuffle the elements in a row:
df = df.sample(frac=1, axis=1).reset_index(drop=True)
How ever your desired dataframe looks completely randomised, which can be done by shuffling by row and then by column:
df = df.sample(frac=1, axis=1).sample(frac=1).reset_index(drop=True)
Edit:
import numpy as np
df = df.apply(np.random.permutation, axis=1)
Shuffle rows of a large csv
Because you read in your data using Pandas, you can also do the randomisation in a different way using pd.sample
:
df = pd.read_csv('sentiment_train.csv', header= 0, delimiter=",", usecols=[0,5])
df.columns=['target', 'text']
df1 = df.sample(n=100000)
If this fails, it might be good to check out the amount of unique values and how frequent they appear. If the first 1,599,999 are 0 and the last is only 4, then the chances are that you won't get any 4.
Pandas - How do you randomize the rows of a dataframe
You can shuffle the index if it is a number:
df = pd.DataFrame(['A','B','C','D','E','F','G','H','I','j'],columns = ['Data'])
arr = np.arange(len(df))
out = np.random.permutation(arr) # random shuffle
df.ix[out]
Related Topics
How to Manually Create a Legend
Extracting Date from a String in Python
Calling Class Staticmethod Within the Class Body
Plotting a 2D Heatmap with Matplotlib
Given a Url to a Text File, What Is the Simplest Way to Read the Contents of the Text File
How to Forward-Declare a Function to Avoid 'Nameerror's for Functions Defined Later
Multi-Level Defaultdict with Variable Depth
Creating a New Column Based on If-Elif-Else Condition
How to Insert a Column at a Specific Column Index in Pandas
Choosing a File in Python with Simple Dialog
Python 3: Importerror "No Module Named Setuptools"
Object of Custom Type as Dictionary Key
Python Process Pool Non-Daemonic
How to Remove All Characters After a Specific Character in Python
Attributeerror: Module 'Time' Has No Attribute 'Clock' in Python 3.8