Shuffle Dataframe Rows

Shuffle DataFrame rows

The idiomatic way to do this with Pandas is to use the .sample method of your data frame to sample all rows without replacement:

df.sample(frac=1)

The frac keyword argument specifies the fraction of rows to return in the random sample, so frac=1 means to return all rows (in random order).


Note:
If you wish to shuffle your dataframe in-place and reset the index, you could do e.g.

df = df.sample(frac=1).reset_index(drop=True)

Here, specifying drop=True prevents .reset_index from creating a column containing the old index entries.

Follow-up note: Although it may not look like the above operation is in-place, python/pandas is smart enough not to do another malloc for the shuffled object. That is, even though the reference object has changed (by which I mean id(df_old) is not the same as id(df_new)), the underlying C object is still the same. To show that this is indeed the case, you could run a simple memory profiler:

$ python3 -m memory_profiler .\test.py
Filename: .\test.py

Line # Mem usage Increment Line Contents
================================================
5 68.5 MiB 68.5 MiB @profile
6 def shuffle():
7 847.8 MiB 779.3 MiB df = pd.DataFrame(np.random.randn(100, 1000000))
8 847.9 MiB 0.1 MiB df = df.sample(frac=1).reset_index(drop=True)

Trying to shuffle rows in Panda DataFrame

Something like this where you just return the shuffled df, and use pd.concat on a list of these.

sales_to_do = pd.DataFrame({'id':[1,2], 'name':['bob','mike']})

def randomize(df):
return df.sample(frac=1)

df_shuffled = pd.concat([randomize(sales_to_do) for x in range(15)])

df_shuffled.to_excel(r'C:\Users\Alex\Desktop\Output1.xlsx', index=False, header=True)

Shuffle rows in dataframe by specific colum value

IIUC, you can select even indices shuffle, and add the odd indices using numpy:

import numpy as np

order = np.arange(0,len(df), 2)
np.random.shuffle(order)
order = np.vstack([order, order+1]).ravel('F')

df2 = df.iloc[order]

example output:

    Video      Frames  Feature1  Feature2  Label
2 0 frame2.jpg feature1 feature2 0
3 0 frame3.jpg feature1 feature2 0
0 0 frame0.jpg feature1 feature2 0
1 0 frame1.jpg feature1 feature2 0
6 1 frame2.jpg feature1 feature2 1
7 1 frame3.jpg feature1 feature2 1
8 2 frame0.jpg feature1 feature2 0
9 2 frame1.jpg feature1 feature2 0
10 2 frame2.jpg feature1 feature2 0
11 2 frame3.jpg feature1 feature2 0
4 1 frame0.jpg feature1 feature2 1
5 1 frame1.jpg feature1 feature2 1

Shuffle rows in a dataframe based on a condition using R

You could try this:

library(purrr)
library(tidyr)
library(dplyr)

df %>%
split(f = as.factor(.$ClassNr)) %>%
map_dfr(~sample(.x$Name)) %>%
pivot_longer(everything(),
names_to = "ClassNr",
values_to = "Name")

returning (for example)

# A tibble: 6 x 2
ClassNr Name
<chr> <chr>
1 1 Ana
2 2 Ella
3 3 Sarah
4 1 Maria
5 2 Hanne
6 3 Liam
  • We first split the data into groups based on the ClassNr. That's the split-part. Now we have three lists (one list for every class).
  • Next we take every list and sample the elements, which is basically shuffling each list independently and bind the result together as dataframe.
  • Finally we bring this dataframe into a long format.

Note: This approach will most likely fail if there are different numbers of names in each class.

How to shuffle a pandas dataframe randomly by row

You can achieve this by using the sample method and apply it to axis # 1.
This will shuffle the elements in a row:

df = df.sample(frac=1, axis=1).reset_index(drop=True)

How ever your desired dataframe looks completely randomised, which can be done by shuffling by row and then by column:

df = df.sample(frac=1, axis=1).sample(frac=1).reset_index(drop=True)

Edit:

import numpy as np
df = df.apply(np.random.permutation, axis=1)

Shuffle rows of a large csv

Because you read in your data using Pandas, you can also do the randomisation in a different way using pd.sample:

df = pd.read_csv('sentiment_train.csv', header= 0, delimiter=",", usecols=[0,5])
df.columns=['target', 'text']
df1 = df.sample(n=100000)

If this fails, it might be good to check out the amount of unique values and how frequent they appear. If the first 1,599,999 are 0 and the last is only 4, then the chances are that you won't get any 4.

Pandas - How do you randomize the rows of a dataframe

You can shuffle the index if it is a number:

df = pd.DataFrame(['A','B','C','D','E','F','G','H','I','j'],columns = ['Data'])

arr = np.arange(len(df))
out = np.random.permutation(arr) # random shuffle

df.ix[out]


Related Topics



Leave a reply



Submit