Splitting a Data.Frame by a Variable

Splitting a data.frame by a variable

split seems to be appropriate here.

If you start with the following data frame :

df <- data.frame(ids=c(1,1,2,2,3),x=1:5,y=letters[1:5])

Then you can do :

split(df, df$ids)

And you will get a list of data frames :

R> split(df, df$ids)
$`1`
  ids x y
1   1 1 a
2   1 2 b

$`2`
  ids x y
3   2 3 c
4   2 4 d

$`3`
  ids x y
5   3 5 e

Split a pandas DataFrame column into a variable number of columns

You could slightly change the function and use it in a list comprehension; then assign the nested list to columns:

def get_header_properties(header):
    pf_type = re.match(".*?(?=\.)", header).group()
    pf_id = re.search(f"(?<={pf_type}\.).*?(?=(,|$))", header).group()
    pf_coords = re.search(f"(?<={pf_id}).*", header).group()
    coords = pf_coords.split(",")[1:]
    return [pf_type, pf_id] + coords + ([np.nan]*(2-len(coords)) if len(coords)<2 else [])

df[['Type','ID','dim1','dim2']] = [get_header_properties(i) for i in df['index']]
out = df.drop(columns='index')[['Type','ID','dim1','dim2','value']]

That said, instead of the function, it seems it's simpler and more efficient to use str.split once on "index" column and join it to df:

df = (df['index'].str.split('[.,]', expand=True)
      .fillna(np.nan)
      .rename(columns={i: col for i,col in enumerate(['Type','ID','dim1','dim2'])})
      .join(df[['value']]))

Output:

        Type       ID dim1 dim2   value
0  FirstType  FirstID  NaN  NaN    0.23
1  OtherType  OtherID    1  NaN   50.00
2  OtherType  OtherID    4  NaN   60.00
3   LastType   LastID    1    1  110.00
4   LastType   LastID    1    2  199.00
5   LastType   LastID    2    3  123.00

split a data frame using after a date where the value of another variable reaches to max/min on that date

You can use which.max to get the index of max value and use it to subset the dataframe.

ind <- which.max(df$value)
df1 <- df[seq_len(ind - 1), ]
df2 <- df[ind:nrow(df), ]

df1
# A tibble: 3 x 2
#  date       value
#  <chr>      <dbl>
#1 2020-02-01     0
#2 2020-02-02     1
#3 2020-02-03     2

df2
# A tibble: 3 x 2
#  date       value
#  <chr>      <dbl>
#1 2020-02-04     7
#2 2020-02-05     3
#3 2020-02-06     4

We could create a list of dataframes if there are lot of ID's and we have to do this for each ID.

result <- df %>%
            group_split(ID) %>%
            purrr::map(~{.x %>% 
               group_split(row_number() < which.max(value), .keep = FALSE)})

## In case, someone is interested you could make a data frame from the list above as follows: 
result_df <- result %>%
bind_rows()

Splitting data frame according to (dichotomous) values in a column

# Using data frames
DF1 <- OriginalDF[OriginalDF$SEX == 0, ]
DF2 <- OriginalDF[OriginalDF$SEX == 1, ]

# If it's very large, I recommend you data.table
library(data.table)
OriginalDT <- data.table(OriginalDF)
DT1 <- OriginalDT[SEX == 0]
DT2 <- OriginalDT[SEX == 1]

Splitting a dataframe (csv)

I've never done this on a random basis but the basic approach would be:

import pandas 2)
read in your csv
drop empty/null columns(avoid issues with these)
create a new dataframe to put the split values into
assign names to your new columns
split values and combine the values (using apply/combine/lambda)

Code sample:

# importing pandas module 
import pandas as pd 

# read in csv file 
data = pd.read_csv("https://mydata.csv") 

# drop null values 
data.dropna(inplace = True) 

#  create new data frame 
new = data["ColumnName"].str.split(" ", n = 1, expand = True) #this 'split' code applies to splitting one column into two

# assign new name to first column
data["A"]= new[0] #8 concatenated values will go here

# making seperate last name column from new data frame 
data["B"]= new[1]  #last two [combined] values in go here

## other/different code required for concatenation of column values - look at this linked SO question##

# df display 
data

Hope this helps

How can I split a large dataset and remove the variable that it was split by [R]

A base R option is to subset (i.e., remove the grouping column) the data first. Next, I can split the dataframe with the original grouping column.

split(subset(my_data, select = -group), my_data$group)

However, if the grouping column is always in the first position, then you can just use the index, rather than subset to remove the grouping column for the output.

split(my_data[-1], my_data$group)

Output

$`1`
         x         y
1 3.421037 0.2846179
2 9.219159 5.0449367
3 4.157628 1.3970608
4 3.412703 2.2196774
5 9.948763 6.5528746

$`2`
           x         y
6  0.3746215 3.4387533
7  3.0722134 0.5371084
8  3.0580508 0.4649525
9  3.6308661 6.5796197
10 6.4435513 3.0641620

Another base R option is to use subset inside lapply. You can use split and remove the grouping variable all in one step.

lapply(split(my_data, my_data$group, drop=TRUE), subset, select = -group)

Splitting a Data.Frame by a Variable