How to Split Data into Training/Testing Sets Using Sample Function

Split into training and testing set in R?

Using base R you can do the following:

set.seed(12345)
#getting training data set sizes of .20 (in this case 20 out of 100)
train.x<-sample(1:100, 20)
train.y<-sample(1:100, 20)

#simulating random data
x<-rnorm(100)
y<-rnorm(100)

#sub-setting the x data
training.x.data<-x[train]
testing.x.data<-x[-train]

#sub-setting the y data
training.y.data<-y[train]
testing.y.data<-y[-train]

How do I create test and train samples from one dataframe with pandas?

I would just use numpy's randn:

In [11]: df = pd.DataFrame(np.random.randn(100, 2))

In [12]: msk = np.random.rand(len(df)) < 0.8

In [13]: train = df[msk]

In [14]: test = df[~msk]

And just to see this has worked:

In [15]: len(test)
Out[15]: 21

In [16]: len(train)
Out[16]: 79

Split a data set into a training and test set with runif

You have a dataset named USArrests with length nrow(USArrests), let's say for the sake of simplification 100. So runif(nrow(USArrests)) creates 100 uniform distributed random numbers i.e. for every row in your dataset one number.

Next your expression runif(nrow(USArrests)) < 0.5 checks, if the number is < 0.5 or not returning TRUE or FALSE. This gives you a logical vector of length 100 (or nrow(USArrests)) that indicates, if a row belongs to the training or to the test dataset.

It's not shown but finally you select your training data by

USArrests[inTrain,]

and your test data by

USArrests[-inTrain,]

Split data in 5 subsets with choose(k,n) & NOT with sample()

I don't know what you mean by every possible combination of 5-subset. That seems like an incredibly large amount of possibilities. I assume you mean that you want a subset of 5 datasets that contain all of the samples in your dataset. I would probably do something like this. We first make a vector of groups that is the number of k and the length of the dataset. We then sample the groups randomly and split the dataset by these groupings.

library(tidyverse)

set.seed(3465)
test_data <- tibble(A = runif(58),
B = runif(58))


k_split <- function(dat,k, seed = 1){
set.seed(seed)
grp <- rep(1:k, length.out = nrow(dat))
dat |>
mutate(grp = sample(grp, nrow(dat), replace = F)) |>
group_split(grp)|>
map(\(d) select(d, -grp))
}

k_split(test_data, 5)
#> [[1]]
#> # A tibble: 12 x 2
#> A B
#> <dbl> <dbl>
#> 1 0.476 0.468
#> 2 0.636 0.639
#> 3 0.334 0.0269
#> 4 0.668 0.220
#> 5 0.398 0.919
#> 6 0.343 0.748
#> 7 0.799 0.526
#> 8 0.710 0.759
#> 9 0.737 0.927
#> 10 0.819 0.441
#> 11 0.852 0.656
#> 12 0.416 0.541
#>
#> [[2]]
#> # A tibble: 12 x 2
#> A B
#> <dbl> <dbl>
#> 1 0.0107 0.905
#> 2 0.109 0.539
#> 3 0.715 0.778
#> 4 0.523 0.416
#> 5 0.609 0.357
#> 6 0.152 0.0972
#> 7 0.919 0.450
#> 8 0.866 0.510
#> 9 0.0347 0.0890
#> 10 0.862 0.465
#> 11 0.364 0.765
#> 12 0.789 0.601
#>
#> [[3]]
#> # A tibble: 12 x 2
#> A B
#> <dbl> <dbl>
#> 1 0.580 0.228
#> 2 0.201 0.0418
#> 3 0.0359 0.417
#> 4 0.521 0.758
#> 5 0.534 0.974
#> 6 0.580 0.563
#> 7 0.844 0.781
#> 8 0.756 0.271
#> 9 0.211 0.533
#> 10 0.851 0.764
#> 11 0.885 0.150
#> 12 0.262 0.371
#>
#> [[4]]
#> # A tibble: 11 x 2
#> A B
#> <dbl> <dbl>
#> 1 0.556 0.313
#> 2 0.353 0.821
#> 3 0.0959 0.861
#> 4 0.759 0.261
#> 5 0.207 0.772
#> 6 0.668 0.527
#> 7 0.150 0.788
#> 8 0.0939 0.257
#> 9 0.0913 0.817
#> 10 0.294 0.790
#> 11 0.0224 0.253
#>
#> [[5]]
#> # A tibble: 11 x 2
#> A B
#> <dbl> <dbl>
#> 1 0.0893 0.665
#> 2 0.966 0.142
#> 3 0.672 0.0849
#> 4 0.641 0.155
#> 5 0.490 0.187
#> 6 0.00394 0.295
#> 7 0.126 0.813
#> 8 0.202 0.474
#> 9 0.0740 0.107
#> 10 0.412 0.709
#> 11 0.509 0.253


Related Topics



Leave a reply



Submit