Split a Large Dataframe into a List of Data Frames Based on Common Value in Column

Split a large dataframe into a list of data frames based on common value in column

You can just as easily access each element in the list using e.g. path[[1]]. You can't put a set of matrices into an atomic vector and access each element. A matrix is an atomic vector with dimension attributes. I would use the list structure returned by split, it's what it was designed for. Each list element can hold data of different types and sizes so it's very versatile and you can use *apply functions to further operate on each element in the list. Example below.

#  For reproducibile data
set.seed(1)

# Make some data
userid <- rep(1:2,times=4)
data1 <- replicate(8 , paste( sample(letters , 3 ) , collapse = "" ) )
data2 <- sample(10,8)
df <- data.frame( userid , data1 , data2 )

# Split on userid
out <- split( df , f = df$userid )
#$`1`
# userid data1 data2
#1 1 gjn 3
#3 1 yqp 1
#5 1 rjs 6
#7 1 jtw 5

#$`2`
# userid data1 data2
#2 2 xfv 4
#4 2 bfe 10
#6 2 mrx 2
#8 2 fqd 9

Access each element using the [[ operator like this:

out[[1]]
# userid data1 data2
#1 1 gjn 3
#3 1 yqp 1
#5 1 rjs 6
#7 1 jtw 5

Or use an *apply function to do further operations on each list element. For instance, to take the mean of the data2 column you could use sapply like this:

sapply( out , function(x) mean( x$data2 ) )
# 1 2
#3.75 6.25

How to split a dataframe into a list of dataframes based on distinct value ranges

A base R one-liner can split the data by limits.

split(df, findInterval(df$weight, limits))
#$`0`
# subject weight
#3 C 179
#5 E 195
#8 H 118
#10 J 229
#
#$`1`
# subject weight
#1 A 415
#2 B 463
#9 I 299
#
#$`2`
# subject weight
#4 D 526
#
#$`3`
# subject weight
#6 F 938
#7 G 818

Splitting data frame into smaller data frames based on unique column values

As suggested you could use groupby() on your dataframe to segregate by one column name values:

import pandas as pd

cols = ['Quantity', 'Code', 'Value']
data = [[1757, '08951201', 717.0],
[1100, '08A85800', 0.0],
[2500, '08A85800', 0.0],
[323, '08951201', 0.0],
[800, '08A85800', 0.0]]

df = pd.DataFrame(data, columns=cols)

groups =df.groupby(['Code'])

Then you can recover indices by groups.indices , this will return a dict with 'Code' values as keys, and index as values. For last if you want to get every sub-dataframe you can call group_list = list(groups). I suggest to do the work in 2 steps (first group by, then call list), because this way you can call other methods over the groupDataframe (group)


EDIT

Then if you want a particular dataframe you could call

 df_i = group_list[i][1]

group_list[i] is the i-th element of sub-dataframe, but it's a tupple containing (group_val,group_df). where group_val is the value associated to this new dataframe ('08951201' or '08A85800') and group_df is the new dataframe.

Split a dataframe into a list of nested data frames and matrices

Do you mean something like this?

result <- lapply(diamonds_g, function(x) 
list(factors = x[2:4], mat = as.matrix(x[6:10])))

Splitting dataframe into multiple dataframes

Firstly your approach is inefficient because the appending to the list on a row by basis will be slow as it has to periodically grow the list when there is insufficient space for the new entry, list comprehensions are better in this respect as the size is determined up front and allocated once.

However, I think fundamentally your approach is a little wasteful as you have a dataframe already so why create a new one for each of these users?

I would sort the dataframe by column 'name', set the index to be this and if required not drop the column.

Then generate a list of all the unique entries and then you can perform a lookup using these entries and crucially if you only querying the data, use the selection criteria to return a view on the dataframe without incurring a costly data copy.

Use pandas.DataFrame.sort_values and pandas.DataFrame.set_index:

# sort the dataframe
df.sort_values(by='name', axis=1, inplace=True)

# set the index to be this and don't drop
df.set_index(keys=['name'], drop=False,inplace=True)

# get a list of names
names=df['name'].unique().tolist()

# now we can perform a lookup on a 'view' of the dataframe
joe = df.loc[df.name=='joe']

# now you can query all 'joes'

Split a large pandas dataframe

Use np.array_split:

Docstring:
Split an array into multiple sub-arrays.

Please refer to the ``split`` documentation. The only difference
between these functions is that ``array_split`` allows
`indices_or_sections` to be an integer that does *not* equally
divide the axis.
In [1]: import pandas as pd

In [2]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
...: 'foo', 'bar', 'foo', 'foo'],
...: 'B' : ['one', 'one', 'two', 'three',
...: 'two', 'two', 'one', 'three'],
...: 'C' : randn(8), 'D' : randn(8)})

In [3]: print df
A B C D
0 foo one -0.174067 -0.608579
1 bar one -0.860386 -1.210518
2 foo two 0.614102 1.689837
3 bar three -0.284792 -1.071160
4 foo two 0.843610 0.803712
5 bar two -1.514722 0.870861
6 foo one 0.131529 -0.968151
7 foo three -1.002946 -0.257468

In [4]: import numpy as np
In [5]: np.array_split(df, 3)
Out[5]:
[ A B C D
0 foo one -0.174067 -0.608579
1 bar one -0.860386 -1.210518
2 foo two 0.614102 1.689837,
A B C D
3 bar three -0.284792 -1.071160
4 foo two 0.843610 0.803712
5 bar two -1.514722 0.870861,
A B C D
6 foo one 0.131529 -0.968151
7 foo three -1.002946 -0.257468]

Split dataframe by unique values in column and create objects in global environment

We can use group_split from dplyr to return a list of data.frame/tibbles

library(dplyr)
dfone %>%
group_split(subproduct)

It may be better to split into a list and do the transformations within the list. But, global objects can be created by looping over the sequence of unique 'subproduct' elements, and then assign new objects ('subprod1', 'subprod2' ...) on the subset of data for that particular 'subproduct'

un1 <- unique(dfone$subproduct)
for(i in seq_along(un1))
assign(paste0('subprod', i), subset(dfone, subproduct == un1[i]))


Related Topics



Leave a reply



Submit