Split a large dataframe into a list of data frames based on common value in column
You can just as easily access each element in the list using e.g. path[[1]]
. You can't put a set of matrices into an atomic vector and access each element. A matrix is an atomic vector with dimension attributes. I would use the list structure returned by split
, it's what it was designed for. Each list element can hold data of different types and sizes so it's very versatile and you can use *apply
functions to further operate on each element in the list. Example below.
# For reproducibile data
set.seed(1)
# Make some data
userid <- rep(1:2,times=4)
data1 <- replicate(8 , paste( sample(letters , 3 ) , collapse = "" ) )
data2 <- sample(10,8)
df <- data.frame( userid , data1 , data2 )
# Split on userid
out <- split( df , f = df$userid )
#$`1`
# userid data1 data2
#1 1 gjn 3
#3 1 yqp 1
#5 1 rjs 6
#7 1 jtw 5
#$`2`
# userid data1 data2
#2 2 xfv 4
#4 2 bfe 10
#6 2 mrx 2
#8 2 fqd 9
Access each element using the [[
operator like this:
out[[1]]
# userid data1 data2
#1 1 gjn 3
#3 1 yqp 1
#5 1 rjs 6
#7 1 jtw 5
Or use an *apply
function to do further operations on each list element. For instance, to take the mean of the data2
column you could use sapply like this:
sapply( out , function(x) mean( x$data2 ) )
# 1 2
#3.75 6.25
How to split a dataframe into a list of dataframes based on distinct value ranges
A base R one-liner can split the data by limits
.
split(df, findInterval(df$weight, limits))
#$`0`
# subject weight
#3 C 179
#5 E 195
#8 H 118
#10 J 229
#
#$`1`
# subject weight
#1 A 415
#2 B 463
#9 I 299
#
#$`2`
# subject weight
#4 D 526
#
#$`3`
# subject weight
#6 F 938
#7 G 818
Splitting data frame into smaller data frames based on unique column values
As suggested you could use groupby()
on your dataframe to segregate by one column name values:
import pandas as pd
cols = ['Quantity', 'Code', 'Value']
data = [[1757, '08951201', 717.0],
[1100, '08A85800', 0.0],
[2500, '08A85800', 0.0],
[323, '08951201', 0.0],
[800, '08A85800', 0.0]]
df = pd.DataFrame(data, columns=cols)
groups =df.groupby(['Code'])
Then you can recover indices by groups.indices
, this will return a dict with 'Code' values as keys, and index as values. For last if you want to get every sub-dataframe you can call group_list = list(groups)
. I suggest to do the work in 2 steps (first group by, then call list), because this way you can call other methods over the groupDataframe (group
)
EDIT
Then if you want a particular dataframe you could call
df_i = group_list[i][1]
group_list[i]
is the i-th element of sub-dataframe, but it's a tupple containing (group_val,group_df)
. where group_val
is the value associated to this new dataframe ('08951201'
or '08A85800'
) and group_df
is the new dataframe.
Split a dataframe into a list of nested data frames and matrices
Do you mean something like this?
result <- lapply(diamonds_g, function(x)
list(factors = x[2:4], mat = as.matrix(x[6:10])))
Splitting dataframe into multiple dataframes
Firstly your approach is inefficient because the appending to the list on a row by basis will be slow as it has to periodically grow the list when there is insufficient space for the new entry, list comprehensions are better in this respect as the size is determined up front and allocated once.
However, I think fundamentally your approach is a little wasteful as you have a dataframe already so why create a new one for each of these users?
I would sort the dataframe by column 'name'
, set the index to be this and if required not drop the column.
Then generate a list of all the unique entries and then you can perform a lookup using these entries and crucially if you only querying the data, use the selection criteria to return a view on the dataframe without incurring a costly data copy.
Use pandas.DataFrame.sort_values
and pandas.DataFrame.set_index
:
# sort the dataframe
df.sort_values(by='name', axis=1, inplace=True)
# set the index to be this and don't drop
df.set_index(keys=['name'], drop=False,inplace=True)
# get a list of names
names=df['name'].unique().tolist()
# now we can perform a lookup on a 'view' of the dataframe
joe = df.loc[df.name=='joe']
# now you can query all 'joes'
Split a large pandas dataframe
Use np.array_split
:
Docstring:
Split an array into multiple sub-arrays.
Please refer to the ``split`` documentation. The only difference
between these functions is that ``array_split`` allows
`indices_or_sections` to be an integer that does *not* equally
divide the axis.
In [1]: import pandas as pd
In [2]: df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
...: 'foo', 'bar', 'foo', 'foo'],
...: 'B' : ['one', 'one', 'two', 'three',
...: 'two', 'two', 'one', 'three'],
...: 'C' : randn(8), 'D' : randn(8)})
In [3]: print df
A B C D
0 foo one -0.174067 -0.608579
1 bar one -0.860386 -1.210518
2 foo two 0.614102 1.689837
3 bar three -0.284792 -1.071160
4 foo two 0.843610 0.803712
5 bar two -1.514722 0.870861
6 foo one 0.131529 -0.968151
7 foo three -1.002946 -0.257468
In [4]: import numpy as np
In [5]: np.array_split(df, 3)
Out[5]:
[ A B C D
0 foo one -0.174067 -0.608579
1 bar one -0.860386 -1.210518
2 foo two 0.614102 1.689837,
A B C D
3 bar three -0.284792 -1.071160
4 foo two 0.843610 0.803712
5 bar two -1.514722 0.870861,
A B C D
6 foo one 0.131529 -0.968151
7 foo three -1.002946 -0.257468]
Split dataframe by unique values in column and create objects in global environment
We can use group_split
from dplyr
to return a list
of data.frame/tibbles
library(dplyr)
dfone %>%
group_split(subproduct)
It may be better to split into a list
and do the transformations within the list
. But, global objects can be created by looping over the sequence of unique
'subproduct' elements, and then assign
new objects ('subprod1', 'subprod2' ...) on the subset
of data for that particular 'subproduct'
un1 <- unique(dfone$subproduct)
for(i in seq_along(un1))
assign(paste0('subprod', i), subset(dfone, subproduct == un1[i]))
Related Topics
How to Escape Backslashes in R String
Include Levels of Zero Count in Result of Table()
Counting the Number of Elements With the Values of X in a Vector
Difference Between Require() and Library()
Cbind a Dataframe With an Empty Dataframe - Cbind.Fill
Can Dplyr Package Be Used For Conditional Mutating
How to Add Texture to Fill Colors in Ggplot2
Replace Specific Characters Within Strings
How to Trim Leading and Trailing White Space
Relative Frequencies/Proportions With Dplyr
Plot Two Graphs in Same Plot in R
Is There a Dplyr Equivalent to Data.Table::Rleid
Cleaning Up Factor Levels (Collapsing Multiple Levels/Labels)
Data.Table VS Dplyr: Can One Do Something Well the Other Can't or Does Poorly
Convert Data.Frame Columns from Factors to Characters
How to Drop Columns by Name in a Data Frame