Splitting a Dataframe into Several Dataframes

Splitting dataframe into multiple dataframes

Firstly your approach is inefficient because the appending to the list on a row by basis will be slow as it has to periodically grow the list when there is insufficient space for the new entry, list comprehensions are better in this respect as the size is determined up front and allocated once.

However, I think fundamentally your approach is a little wasteful as you have a dataframe already so why create a new one for each of these users?

I would sort the dataframe by column 'name', set the index to be this and if required not drop the column.

Then generate a list of all the unique entries and then you can perform a lookup using these entries and crucially if you only querying the data, use the selection criteria to return a view on the dataframe without incurring a costly data copy.

Use pandas.DataFrame.sort_values and pandas.DataFrame.set_index:

# sort the dataframe
df.sort_values(by='name', axis=1, inplace=True)

# set the index to be this and don't drop
df.set_index(keys=['name'], drop=False,inplace=True)

# get a list of names
names=df['name'].unique().tolist()

# now we can perform a lookup on a 'view' of the dataframe
joe = df.loc[df.name=='joe']

# now you can query all 'joes'

How can I split a pandas DataFrame into multiple dataframes?

You can use, np.array_split to split the dataframe:

import numpy as np

dfs = np.array_split(df, 161) # split the dataframe into 161 separate tables

Edit (To assign a new col based on sequential number of df in dfs):

dfs = [df.assign(new_col=i) for i, df in enumerate(dfs, 1)]

How to divide a dataframe into several dataframes

Yes, one way is to enumerate all rows with the same categories:

cat_cols = ['cat_col1', 'cat_col2']

groups = df.groupby(cat_cols).cumcount() // 3

sub_df = {g: d for g,d in df.groupby(groups)}

Split a large pandas dataframe

Use np.array_split:

Docstring:
Split an array into multiple sub-arrays.

Please refer to the ``split`` documentation. The only difference
between these functions is that ``array_split`` allows
`indices_or_sections` to be an integer that does *not* equally
divide the axis.
In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
...: 'foo', 'bar', 'foo', 'foo'],
...: 'B' : ['one', 'one', 'two', 'three',
...: 'two', 'two', 'one', 'three'],
...: 'C' : randn(8), 'D' : randn(8)})

In [3]: print df
A B C D
0 foo one -0.174067 -0.608579
1 bar one -0.860386 -1.210518
2 foo two 0.614102 1.689837
3 bar three -0.284792 -1.071160
4 foo two 0.843610 0.803712
5 bar two -1.514722 0.870861
6 foo one 0.131529 -0.968151
7 foo three -1.002946 -0.257468

In [4]: import numpy as np
In [5]: np.array_split(df, 3)
Out[5]:
[ A B C D
0 foo one -0.174067 -0.608579
1 bar one -0.860386 -1.210518
2 foo two 0.614102 1.689837,
A B C D
3 bar three -0.284792 -1.071160
4 foo two 0.843610 0.803712
5 bar two -1.514722 0.870861,
A B C D
6 foo one 0.131529 -0.968151
7 foo three -1.002946 -0.257468]

Split dataframe into several data frames within a list, each column separately

Try this tidyverse approach. You can format your data to long to transform columns into rows. Then, with split() you can create a list based on the column name. Finally, you can apply a function to transform your data to wide at each dataframe in the list and reach the desired output. Here the code:

library(tidyverse)
#Data
df <- data.frame(my_names=sample(LETTERS,4,replace=F),
column2=sample(1.3:100.3,4,replace=T),
column3=sample(1.3:100.3,4,replace=T),
column4=sample(1.3:100.3,4,replace=T),
column5=sample(1.3:100.3,4,replace=T))
#Reshape to long
df2 <- df %>% pivot_longer(cols = -1)
#Split into a list
List <- split(df2,df2$name)
#Now reshape function for wide format
List2 <- lapply(List,function(x){x<-pivot_wider(x,names_from = name,values_from = value);return(x)})
names(List2) <- paste0('df',1:length(List2))

Output:

List2
$df1
# A tibble: 4 x 2
my_names column2
<fct> <dbl>
1 N 21.3
2 H 35.3
3 X 42.3
4 U 89.3

$df2
# A tibble: 4 x 2
my_names column3
<fct> <dbl>
1 N 94.3
2 H 54.3
3 X 2.3
4 U 38.3

$df3
# A tibble: 4 x 2
my_names column4
<fct> <dbl>
1 N 75.3
2 H 94.3
3 X 87.3
4 U 100.

$df4
# A tibble: 4 x 2
my_names column5
<fct> <dbl>
1 N 60.3
2 H 88.3
3 X 14.3
4 U 99.3

Split a dataframe into multiple dataframes based on specific row value in R

You are probably looking for the split function. I made a small example where I split every time the b column is equal to a

(d<-data.frame(a=1:10, b=sample(letters[1:3], replace = T, size = 10)))
#> a b
#> 1 1 a
#> 2 2 a
#> 3 3 c
#> 4 4 b
#> 5 5 c
#> 6 6 b
#> 7 7 c
#> 8 8 b
#> 9 9 c
#> 10 10 a
d$f<-cumsum(d$b=='a')
lst<-split(d, d$f)
lst
#> $`1`
#> a b f
#> 1 1 a 1
#>
#> $`2`
#> a b f
#> 2 2 a 2
#> 3 3 c 2
#> 4 4 b 2
#> 5 5 c 2
#> 6 6 b 2
#> 7 7 c 2
#> 8 8 b 2
#> 9 9 c 2
#>
#> $`3`
#> a b f
#> 10 10 a 3

Created on 2021-10-05 by the reprex package (v2.0.1)



Related Topics



Leave a reply



Submit