Splitting a data.frame by a variable
split
seems to be appropriate here.
If you start with the following data frame :
df <- data.frame(ids=c(1,1,2,2,3),x=1:5,y=letters[1:5])
Then you can do :
split(df, df$ids)
And you will get a list of data frames :
R> split(df, df$ids)
$`1`
ids x y
1 1 1 a
2 1 2 b
$`2`
ids x y
3 2 3 c
4 2 4 d
$`3`
ids x y
5 3 5 e
Split a pandas DataFrame column into a variable number of columns
You could slightly change the function and use it in a list comprehension; then assign the nested list to columns:
def get_header_properties(header):
pf_type = re.match(".*?(?=\.)", header).group()
pf_id = re.search(f"(?<={pf_type}\.).*?(?=(,|$))", header).group()
pf_coords = re.search(f"(?<={pf_id}).*", header).group()
coords = pf_coords.split(",")[1:]
return [pf_type, pf_id] + coords + ([np.nan]*(2-len(coords)) if len(coords)<2 else [])
df[['Type','ID','dim1','dim2']] = [get_header_properties(i) for i in df['index']]
out = df.drop(columns='index')[['Type','ID','dim1','dim2','value']]
That said, instead of the function, it seems it's simpler and more efficient to use str.split
once on "index" column and join
it to df
:
df = (df['index'].str.split('[.,]', expand=True)
.fillna(np.nan)
.rename(columns={i: col for i,col in enumerate(['Type','ID','dim1','dim2'])})
.join(df[['value']]))
Output:
Type ID dim1 dim2 value
0 FirstType FirstID NaN NaN 0.23
1 OtherType OtherID 1 NaN 50.00
2 OtherType OtherID 4 NaN 60.00
3 LastType LastID 1 1 110.00
4 LastType LastID 1 2 199.00
5 LastType LastID 2 3 123.00
split a data frame using after a date where the value of another variable reaches to max/min on that date
You can use which.max
to get the index of max
value and use it to subset the dataframe.
ind <- which.max(df$value)
df1 <- df[seq_len(ind - 1), ]
df2 <- df[ind:nrow(df), ]
df1
# A tibble: 3 x 2
# date value
# <chr> <dbl>
#1 2020-02-01 0
#2 2020-02-02 1
#3 2020-02-03 2
df2
# A tibble: 3 x 2
# date value
# <chr> <dbl>
#1 2020-02-04 7
#2 2020-02-05 3
#3 2020-02-06 4
We could create a list of dataframes if there are lot of ID
's and we have to do this for each ID
.
result <- df %>%
group_split(ID) %>%
purrr::map(~{.x %>%
group_split(row_number() < which.max(value), .keep = FALSE)})
## In case, someone is interested you could make a data frame from the list above as follows:
result_df <- result %>%
bind_rows()
Splitting data frame according to (dichotomous) values in a column
# Using data frames
DF1 <- OriginalDF[OriginalDF$SEX == 0, ]
DF2 <- OriginalDF[OriginalDF$SEX == 1, ]
# If it's very large, I recommend you data.table
library(data.table)
OriginalDT <- data.table(OriginalDF)
DT1 <- OriginalDT[SEX == 0]
DT2 <- OriginalDT[SEX == 1]
Splitting a dataframe (csv)
I've never done this on a random basis but the basic approach would be:
- import pandas 2)
- read in your csv
- drop empty/null columns(avoid issues with these)
- create a new dataframe to put the split values into
- assign names to your new columns
- split values and combine the values (using apply/combine/lambda)
Code sample:
# importing pandas module
import pandas as pd
# read in csv file
data = pd.read_csv("https://mydata.csv")
# drop null values
data.dropna(inplace = True)
# create new data frame
new = data["ColumnName"].str.split(" ", n = 1, expand = True) #this 'split' code applies to splitting one column into two
# assign new name to first column
data["A"]= new[0] #8 concatenated values will go here
# making seperate last name column from new data frame
data["B"]= new[1] #last two [combined] values in go here
## other/different code required for concatenation of column values - look at this linked SO question##
# df display
data
Hope this helps
How can I split a large dataset and remove the variable that it was split by [R]
A base R option is to subset
(i.e., remove the grouping column) the data first. Next, I can split
the dataframe with the original grouping column.
split(subset(my_data, select = -group), my_data$group)
However, if the grouping column is always in the first position, then you can just use the index, rather than subset
to remove the grouping column for the output.
split(my_data[-1], my_data$group)
Output
$`1`
x y
1 3.421037 0.2846179
2 9.219159 5.0449367
3 4.157628 1.3970608
4 3.412703 2.2196774
5 9.948763 6.5528746
$`2`
x y
6 0.3746215 3.4387533
7 3.0722134 0.5371084
8 3.0580508 0.4649525
9 3.6308661 6.5796197
10 6.4435513 3.0641620
Another base R option is to use subset
inside lapply
. You can use split
and remove the grouping variable all in one step.
lapply(split(my_data, my_data$group, drop=TRUE), subset, select = -group)
Related Topics
R - How to Get Row & Column Subscripts of Matched Elements from a Distance Matrix
How to Randomize (Or Permute) a Dataframe Rowwise and Columnwise
Replace Values in a Vector Based on Another Vector
Differencebetween Parent.Frame() and Parent.Env() in R; How Do They Differ in Call by Reference
How to Multiply Data Frame by Vector
How to Add a Number of Observations Per Group and Use Group Mean in Ggplot2 Boxplot
There Is Pmin and Pmax Each Taking Na.Rm, Why No Psum
R Error "Sum Not Meaningful for Factors"
Calculating Mean for Every N Values from a Vector
Splitting a Data.Frame by a Variable
Add an Index (Numeric Id) Column to Large Data Frame
Set Default Cran Mirror Permanent in R
Faster Weighted Sampling Without Replacement