Splitting a Large Data Frame into Smaller Segments

Splitting a large data frame into smaller segments

 > str(split(df, (as.numeric(rownames(df))-1) %/% 200))
List of 6
 $ 0:'data.frame':  200 obs. of  3 variables:
  ..$ one  : num [1:200] -1.592 1.664 -1.231 0.269 0.912 ...
  ..$ two  : num [1:200] 0.639 -0.525 0.642 1.347 1.142 ...
  ..$ three: num [1:200] -0.45 -0.877 0.588 1.188 -1.977 ...
 $ 1:'data.frame':  200 obs. of  3 variables:
  ..$ one  : num [1:200] -0.0017 1.9534 0.0155 -0.7732 -1.1752 ...
  ..$ two  : num [1:200] -0.422 0.869 0.45 -0.111 0.073 ...
  ..$ three: num [1:200] -0.2809 1.31908 0.26695 0.00594 -0.25583 ...
 $ 2:'data.frame':  200 obs. of  3 variables:
  ..$ one  : num [1:200] -1.578 0.433 0.277 1.297 0.838 ...
  ..$ two  : num [1:200] 0.913 0.378 0.35 -0.241 0.783 ...
  ..$ three: num [1:200] -0.8402 -0.2708 -0.0124 -0.4537 0.4651 ...
 $ 3:'data.frame':  200 obs. of  3 variables:
  ..$ one  : num [1:200] 1.432 1.657 -0.72 -1.691 0.596 ...
  ..$ two  : num [1:200] 0.243 -0.159 -2.163 -1.183 0.632 ...
  ..$ three: num [1:200] 0.359 0.476 1.485 0.39 -1.412 ...
 $ 4:'data.frame':  200 obs. of  3 variables:
  ..$ one  : num [1:200] -1.43 -0.345 -1.206 -0.925 -0.551 ...
  ..$ two  : num [1:200] -1.343 1.322 0.208 0.444 -0.861 ...
  ..$ three: num [1:200] 0.00807 -0.20209 -0.56865 1.06983 -0.29673 ...
 $ 5:'data.frame':  123 obs. of  3 variables:
  ..$ one  : num [1:123] -1.269 1.555 -0.19 1.434 -0.889 ...
  ..$ two  : num [1:123] 0.558 0.0445 -0.0639 -1.934 -0.8152 ...
  ..$ three: num [1:123] -0.0821 0.6745 0.6095 1.387 -0.382 ...

If some code might have changed the rownames it would be safer to use:

 split(df, (seq(nrow(df))-1) %/% 200)

Splitting data frame into segments for each factor based on a cutoff value in a column in R

In data.table:

dt[, V1 := paste0("A.", 1+cumsum(V4 >= 0.4))]

In dplyr:

df %>%
  mutate(V1 = paste0("A.", 1+cumsum(V4 >= 0.4)))

Splitting large data frame by column into smaller data frames (not lists) using loops

Try

for (i in 1:3) { # i = 1
  xname = paste("ch29", i, sep = "_")
  col.min = (i - 1) * chunk + 1
  col.max = min(i * chunk, ncol(df))
  assign(xname, df[,col.min:col.max])
}

In other words, use the notation df[,a:b], where a < b, to get the subset of the dataframe df consisting only of columns a to b.

R - Splitting a large dataframe into several smaller dateframes, performing fuzzyjoin on each one and outputting to a single dataframe

If you split (e.g. with base::split or dplyr::group_split) your Address data frame into a list of data frames, then you can call purrr::map on the list.

purrr::map(list_of_dfs, ~fuzzy_join(x=., y=UPRN, by = "Street"))

Your result will be a list of data frames each fuzzyjoined with UPRN. You can then call bind_rows (or you could do map_dfr) to get all the results in the same data frame again.

Split a large pandas dataframe

Use np.array_split:

Docstring:
Split an array into multiple sub-arrays.

Please refer to the ``split`` documentation.  The only difference
between these functions is that ``array_split`` allows
`indices_or_sections` to be an integer that does *not* equally
divide the axis.

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
   ...:                           'foo', 'bar', 'foo', 'foo'],
   ...:                    'B' : ['one', 'one', 'two', 'three',
   ...:                           'two', 'two', 'one', 'three'],
   ...:                    'C' : randn(8), 'D' : randn(8)})

In [3]: print df
     A      B         C         D
0  foo    one -0.174067 -0.608579
1  bar    one -0.860386 -1.210518
2  foo    two  0.614102  1.689837
3  bar  three -0.284792 -1.071160
4  foo    two  0.843610  0.803712
5  bar    two -1.514722  0.870861
6  foo    one  0.131529 -0.968151
7  foo  three -1.002946 -0.257468

In [4]: import numpy as np
In [5]: np.array_split(df, 3)
Out[5]: 
[     A    B         C         D
0  foo  one -0.174067 -0.608579
1  bar  one -0.860386 -1.210518
2  foo  two  0.614102  1.689837,
      A      B         C         D
3  bar  three -0.284792 -1.071160
4  foo    two  0.843610  0.803712
5  bar    two -1.514722  0.870861,
      A      B         C         D
6  foo    one  0.131529 -0.968151
7  foo  three -1.002946 -0.257468]

Is there a function to split a large dataframe into n smaller dataframes of equal size (by row) and have an n+1 dataframe of smaller size?

How about something like this:

df <- data.frame(x = 1:723500, y = runif(7235000))
split(df, rep(1:100, each = round(NROW(df) / 100, -4)))

Or abstracting some more:

num_dfs <- 100
split(df, rep(1:num_dfs, each = round(NROW(df) / num_dfs, -4)))

You may want to consider something from the caret package such as: caret::createFolds(df$x)

Proportionally Splitting a Data Frame in R

Consider the negative index:

set.seed(123)
sample_rows <- sample(round(.8*nrow(df)))

new_df_80 <- df[sample_rows,]
new_df_20 <- df[-sample_rows,]

How to split a data frame into mutiple data frame based on row sequence

Using lapply :

list_output <- lapply(seq_len(nrow(df) - 99), function(x) df[x:(x+99), ])

How to randomly split a DataFrame into several smaller DataFrames?

Use np.array_split

shuffled = df.sample(frac=1)
result = np.array_split(shuffled, 5)

df.sample(frac=1) shuffle the rows of df. Then use np.array_split split it into parts that have equal size.

It gives you:

for part in result:
    print(part,'\n')

    movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
5          6  5  0  0  0  0  0  0  5   0   0   0     10
4          5  3  0  0  0  0  0  0  0   0   0   0      3
7          8  1  0  0  0  4  5  0  0   0   4   0     14
16        17  3  0  0  4  0  0  0  0   0   0   0      7
22        23  4  0  0  0  4  3  0  0   5   0   0     16 

    movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
13        14  5  4  0  0  5  0  0  0   0   0   0     14
14        15  5  0  0  0  3  0  0  0   0   5   5     18
21        22  4  0  0  0  3  5  5  0   5   4   0     26
1          2  3  0  0  3  0  0  0  0   0   0   0      6
20        21  1  0  0  3  3  0  0  0   0   0   0      7 

    movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
10        11  2  0  4  0  0  3  3  0   4   2   0     18
9         10  3  2  0  0  0  4  0  0   0   0   0      9
11        12  5  0  0  0  4  5  0  0   5   2   0     21
8          9  5  0  0  0  4  5  0  0   4   5   0     23
12        13  5  4  0  0  2  0  0  0   3   0   0     14 

    movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
18        19  5  3  0  0  4  0  0  0   0   0   0     12
3          4  3  0  0  0  0  5  0  0   4   0   5     17
0          1  5  4  0  4  4  0  0  0   4   0   0     21
23        24  3  0  0  4  0  0  0  0   0   3   0     10
6          7  4  0  0  0  2  5  3  4   4   0   0     22 

    movie_id  1  2  4  5  6  7  8  9  10  11  12  borda
17        18  4  0  0  0  0  0  0  0   0   0   0      4
2          3  4  0  0  0  0  0  0  0   0   0   0      4
15        16  5  0  0  0  0  0  0  0   4   0   0      9
19        20  4  0  0  0  0  0  0  0   0   0   0      4

Subset a dateset using a loop - split a file into smaller multiple datasets

Try this (all dataframes will go to envir):

#Split
Liste <- split(df,df$group)
#Format names
names(Liste) <- paste0('group_',names(Liste))
#Set to envir
list2env(Liste,envir = .GlobalEnv)

Splitting a Large Data Frame into Smaller Segments