Splitting a Data.Frame by a Variable

Splitting a data.frame by a variable

split seems to be appropriate here.

If you start with the following data frame :

df <- data.frame(ids=c(1,1,2,2,3),x=1:5,y=letters[1:5])

Then you can do :

split(df, df$ids)

And you will get a list of data frames :

R> split(df, df$ids)
$`1`
ids x y
1 1 1 a
2 1 2 b

$`2`
ids x y
3 2 3 c
4 2 4 d

$`3`
ids x y
5 3 5 e

Split a pandas DataFrame column into a variable number of columns

You could slightly change the function and use it in a list comprehension; then assign the nested list to columns:

def get_header_properties(header):
pf_type = re.match(".*?(?=\.)", header).group()
pf_id = re.search(f"(?<={pf_type}\.).*?(?=(,|$))", header).group()
pf_coords = re.search(f"(?<={pf_id}).*", header).group()
coords = pf_coords.split(",")[1:]
return [pf_type, pf_id] + coords + ([np.nan]*(2-len(coords)) if len(coords)<2 else [])

df[['Type','ID','dim1','dim2']] = [get_header_properties(i) for i in df['index']]
out = df.drop(columns='index')[['Type','ID','dim1','dim2','value']]

That said, instead of the function, it seems it's simpler and more efficient to use str.split once on "index" column and join it to df:

df = (df['index'].str.split('[.,]', expand=True)
.fillna(np.nan)
.rename(columns={i: col for i,col in enumerate(['Type','ID','dim1','dim2'])})
.join(df[['value']]))

Output:

        Type       ID dim1 dim2   value
0 FirstType FirstID NaN NaN 0.23
1 OtherType OtherID 1 NaN 50.00
2 OtherType OtherID 4 NaN 60.00
3 LastType LastID 1 1 110.00
4 LastType LastID 1 2 199.00
5 LastType LastID 2 3 123.00

split a data frame using after a date where the value of another variable reaches to max/min on that date

You can use which.max to get the index of max value and use it to subset the dataframe.

ind <- which.max(df$value)
df1 <- df[seq_len(ind - 1), ]
df2 <- df[ind:nrow(df), ]

df1
# A tibble: 3 x 2
# date value
# <chr> <dbl>
#1 2020-02-01 0
#2 2020-02-02 1
#3 2020-02-03 2

df2
# A tibble: 3 x 2
# date value
# <chr> <dbl>
#1 2020-02-04 7
#2 2020-02-05 3
#3 2020-02-06 4

We could create a list of dataframes if there are lot of ID's and we have to do this for each ID.

result <- df %>%
group_split(ID) %>%
purrr::map(~{.x %>%
group_split(row_number() < which.max(value), .keep = FALSE)})

## In case, someone is interested you could make a data frame from the list above as follows:
result_df <- result %>%
bind_rows()


Splitting data frame according to (dichotomous) values in a column


# Using data frames
DF1 <- OriginalDF[OriginalDF$SEX == 0, ]
DF2 <- OriginalDF[OriginalDF$SEX == 1, ]

# If it's very large, I recommend you data.table
library(data.table)
OriginalDT <- data.table(OriginalDF)
DT1 <- OriginalDT[SEX == 0]
DT2 <- OriginalDT[SEX == 1]

Splitting a dataframe (csv)

I've never done this on a random basis but the basic approach would be:

  1. import pandas 2)
  2. read in your csv
  3. drop empty/null columns(avoid issues with these)
  4. create a new dataframe to put the split values into
  5. assign names to your new columns
  6. split values and combine the values (using apply/combine/lambda)

Code sample:

# importing pandas module 
import pandas as pd

# read in csv file
data = pd.read_csv("https://mydata.csv")

# drop null values
data.dropna(inplace = True)

# create new data frame
new = data["ColumnName"].str.split(" ", n = 1, expand = True) #this 'split' code applies to splitting one column into two

# assign new name to first column
data["A"]= new[0] #8 concatenated values will go here

# making seperate last name column from new data frame
data["B"]= new[1] #last two [combined] values in go here

## other/different code required for concatenation of column values - look at this linked SO question##

# df display 
data

Hope this helps

How can I split a large dataset and remove the variable that it was split by [R]

A base R option is to subset (i.e., remove the grouping column) the data first. Next, I can split the dataframe with the original grouping column.

split(subset(my_data, select = -group), my_data$group)

However, if the grouping column is always in the first position, then you can just use the index, rather than subset to remove the grouping column for the output.

split(my_data[-1], my_data$group) 

Output

$`1`
x y
1 3.421037 0.2846179
2 9.219159 5.0449367
3 4.157628 1.3970608
4 3.412703 2.2196774
5 9.948763 6.5528746

$`2`
x y
6 0.3746215 3.4387533
7 3.0722134 0.5371084
8 3.0580508 0.4649525
9 3.6308661 6.5796197
10 6.4435513 3.0641620

Another base R option is to use subset inside lapply. You can use split and remove the grouping variable all in one step.

lapply(split(my_data, my_data$group, drop=TRUE), subset, select = -group)


Related Topics



Leave a reply



Submit