Split into training and testing set in R?
Using base R you can do the following:
set.seed(12345)
#getting training data set sizes of .20 (in this case 20 out of 100)
train.x<-sample(1:100, 20)
train.y<-sample(1:100, 20)
#simulating random data
x<-rnorm(100)
y<-rnorm(100)
#sub-setting the x data
training.x.data<-x[train]
testing.x.data<-x[-train]
#sub-setting the y data
training.y.data<-y[train]
testing.y.data<-y[-train]
How do I create test and train samples from one dataframe with pandas?
I would just use numpy's randn
:
In [11]: df = pd.DataFrame(np.random.randn(100, 2))
In [12]: msk = np.random.rand(len(df)) < 0.8
In [13]: train = df[msk]
In [14]: test = df[~msk]
And just to see this has worked:
In [15]: len(test)
Out[15]: 21
In [16]: len(train)
Out[16]: 79
Split a data set into a training and test set with runif
You have a dataset named USArrests
with length nrow(USArrests)
, let's say for the sake of simplification 100. So runif(nrow(USArrests))
creates 100 uniform distributed random numbers i.e. for every row in your dataset one number.
Next your expression runif(nrow(USArrests)) < 0.5
checks, if the number is < 0.5
or not returning TRUE
or FALSE
. This gives you a logical vector of length 100 (or nrow(USArrests)
) that indicates, if a row belongs to the training or to the test dataset.
It's not shown but finally you select your training data by
USArrests[inTrain,]
and your test data by
USArrests[-inTrain,]
Split data in 5 subsets with choose(k,n) & NOT with sample()
I don't know what you mean by every possible combination of 5-subset. That seems like an incredibly large amount of possibilities. I assume you mean that you want a subset of 5 datasets that contain all of the samples in your dataset. I would probably do something like this. We first make a vector of groups that is the number of k and the length of the dataset. We then sample the groups randomly and split the dataset by these groupings.
library(tidyverse)
set.seed(3465)
test_data <- tibble(A = runif(58),
B = runif(58))
k_split <- function(dat,k, seed = 1){
set.seed(seed)
grp <- rep(1:k, length.out = nrow(dat))
dat |>
mutate(grp = sample(grp, nrow(dat), replace = F)) |>
group_split(grp)|>
map(\(d) select(d, -grp))
}
k_split(test_data, 5)
#> [[1]]
#> # A tibble: 12 x 2
#> A B
#> <dbl> <dbl>
#> 1 0.476 0.468
#> 2 0.636 0.639
#> 3 0.334 0.0269
#> 4 0.668 0.220
#> 5 0.398 0.919
#> 6 0.343 0.748
#> 7 0.799 0.526
#> 8 0.710 0.759
#> 9 0.737 0.927
#> 10 0.819 0.441
#> 11 0.852 0.656
#> 12 0.416 0.541
#>
#> [[2]]
#> # A tibble: 12 x 2
#> A B
#> <dbl> <dbl>
#> 1 0.0107 0.905
#> 2 0.109 0.539
#> 3 0.715 0.778
#> 4 0.523 0.416
#> 5 0.609 0.357
#> 6 0.152 0.0972
#> 7 0.919 0.450
#> 8 0.866 0.510
#> 9 0.0347 0.0890
#> 10 0.862 0.465
#> 11 0.364 0.765
#> 12 0.789 0.601
#>
#> [[3]]
#> # A tibble: 12 x 2
#> A B
#> <dbl> <dbl>
#> 1 0.580 0.228
#> 2 0.201 0.0418
#> 3 0.0359 0.417
#> 4 0.521 0.758
#> 5 0.534 0.974
#> 6 0.580 0.563
#> 7 0.844 0.781
#> 8 0.756 0.271
#> 9 0.211 0.533
#> 10 0.851 0.764
#> 11 0.885 0.150
#> 12 0.262 0.371
#>
#> [[4]]
#> # A tibble: 11 x 2
#> A B
#> <dbl> <dbl>
#> 1 0.556 0.313
#> 2 0.353 0.821
#> 3 0.0959 0.861
#> 4 0.759 0.261
#> 5 0.207 0.772
#> 6 0.668 0.527
#> 7 0.150 0.788
#> 8 0.0939 0.257
#> 9 0.0913 0.817
#> 10 0.294 0.790
#> 11 0.0224 0.253
#>
#> [[5]]
#> # A tibble: 11 x 2
#> A B
#> <dbl> <dbl>
#> 1 0.0893 0.665
#> 2 0.966 0.142
#> 3 0.672 0.0849
#> 4 0.641 0.155
#> 5 0.490 0.187
#> 6 0.00394 0.295
#> 7 0.126 0.813
#> 8 0.202 0.474
#> 9 0.0740 0.107
#> 10 0.412 0.709
#> 11 0.509 0.253
Related Topics
Removing Space Between Numeric Values in R
Choose the Top Five Values from Each Group in R
Aggregate/Summarize Multiple Variables Per Group (E.G. Sum, Mean)
Counting Unique Values Across Variables (Columns) in R
How to Generate a Histogram for Each Column of My Table
Plotting Two Variables as Lines Using Ggplot2 on the Same Graph
Add Legend to Ggplot2 Line Plot
Calculate Group Mean, Sum, or Other Summary Stats. and Assign Column to Original Data
R Memory Management/Cannot Allocate Vector of Size N Mb
Split Data.Frame Based on Levels of a Factor into New Data.Frames
How to Reshape Data from Long to Wide Format
How to Sum a Variable by Group
Dynamically Select Data Frame Columns Using $ and a Character Value
How to Convert a Factor to Integer\Numeric Without Loss of Information
Transpose/Reshape Dataframe Without "Timevar" from Long to Wide Format