### How to split a data set to do 10-fold cross validation

• user22062

7 years ago

Now I have a `R` data frame (training), can anyone tell me how to randomly split this data set to do 10-fold cross validation? Be sure to repeat the entire process 100 times to achieve satisfactory precision. Be sure to sample case and control sample separately and then combine them to each block. If you use caret::train, you don't even need to care about this. It will be done internally, you can choose the amount of folds. If you insist in doing this "by hand" use stratified sampling of the class as implemented in caret:: createFolds. I have locked this thread because every one of the many answers is treating it as only a coding question rather than one of general statistical interest.

• Ari B. Friedman

7 years ago

`caret` has a function for this:

``````require(caret)
flds <- createFolds(y, k = 10, list = TRUE, returnTrain = FALSE)
names(flds) <- "train"
``````

Then each element of `flds` is a list of indexes for each dataset. If your dataset is called `dat`, then `dat[flds\$train,]` gets you the training set, `dat[ flds[], ]` gets you the second fold set, etc.

• Jake Drew

6 years ago

Here is a simple way to perform 10-fold using no packages:

``````#Randomly shuffle the data
yourData<-yourData[sample(nrow(yourData)),]

#Create 10 equally size folds
folds <- cut(seq(1,nrow(yourData)),breaks=10,labels=FALSE)

#Perform 10 fold cross validation
for(i in 1:10){
#Segement your data by fold using the which() function
testIndexes <- which(folds==i,arr.ind=TRUE)
testData <- yourData[testIndexes, ]
trainData <- yourData[-testIndexes, ]
#Use the test and train data partitions however you desire...
}
`````` -1: caret functions do stratified sampling which you are not doing. What's the point of reinventing the weel if someone has made things simpler for you? Are you kidding? The entire purpose of the answer is to perform 10-fold without having to install the entire caret package. The only good point you make is that people should understand what their code actually does. Young grasshopper, stratified sampling is not always the best approach. For instance, it gives more importance to subgroups with more data which is not always desirable. (Esp if you don't know it's happening). It's about using the best approach for your data. Troll with caution my friend :) @JakeDrew I realise this is an old post now, but would it be possible to ask for some guidance on how to use the test and train data to get the mean average error of a VAR(p) model for each iteration?  @JakeDrew imho both answers deserve a plus 1. One with a package, the other with code ...

• Dan L

7 years ago

Probably not the best way, but here is one way to do it. I'm pretty sure when I wrote this code I had borrowed a trick from another answer on here, but I couldn't find it to link to.

``````# Generate some test data
x <- runif(100)*10 #Random values between 0 and 10
y <- x+rnorm(100)*.1 #y~x+error
dataset <- data.frame(x,y) #Create data frame
plot(dataset\$x,dataset\$y) #Plot the data

#install.packages("cvTools")
library(cvTools) #run the above line if you don't have this library

k <- 10 #the number of folds

folds <- cvFolds(NROW(dataset), K=k)
dataset\$holdoutpred <- rep(0,nrow(dataset))

for(i in 1:k){
train <- dataset[folds\$subsets[folds\$which != i], ] #Set the training set
validation <- dataset[folds\$subsets[folds\$which == i], ] #Set the validation set

newlm <- lm(y~x,data=train) #Get your new linear model (just fit on the train data)
newpred <- predict(newlm,newdata=validation) #Get the predicitons for the validation set (from the model just fit on the train data)

dataset[folds\$subsets[folds\$which == i], ]\$holdoutpred <- newpred #Put the hold out prediction in the data set for later use
}

dataset\$holdoutpred #do whatever you want with these predictions
``````
• Wouter

7 years ago

please find below some other code that i use (borrowed and adapted from another source). Copied it straight from a script that i just used myself, left in the rpart routine. The part probably most of interest are the lines on the creation of the folds. Alternatively - you can use the crossval function from the bootstrap package.

``````#define error matrix
err <- matrix(NA,nrow=1,ncol=10)
errcv=err

#creation of folds
for(c in 1:10){

n=nrow(df);K=10; sizeblock= n%/%K;alea=runif(n);rang=rank(alea);bloc=(rang-1)%/%sizeblock+1;bloc[bloc==K+1]=K;bloc=factor(bloc); bloc=as.factor(bloc);print(summary(bloc))

for(k in 1:10){

#rpart
fit=rpart(type~., data=df[bloc!=k,],xval=0) ; (predict(fit,df[bloc==k,]))
answers=(predict(fit,df[bloc==k,],type="class")==resp[bloc==k])
err[1,k]=1-(sum(answers)/length(answers))

}

err
errcv[,c]=rowMeans(err, na.rm = FALSE, dims = 1)

}
errcv
``````
• user1930111

5 years ago
``````# Evaluate models uses k-fold cross-validation
install.packages("DAAG")
library("DAAG")

cv.lm(data=dat, form.lm=mod1, m= 10, plotit = F)
``````

Everything done for you in one line of code!

``````?cv.lm for information on input and output
``````
• Mr Tsjolder

4 years ago

Because I did not my approach in this list, I thought I could share another option for people who don't feel like installing packages for a quick cross validation

``````# get the data from somewhere and specify number of folds
data <- read.csv('my_data.csv')
nrFolds <- 10

# generate array containing fold-number for each sample (row)
folds <- rep_len(1:nrFolds, nrow(data))

# actual cross validation
for(k in 1:nrFolds) {
# actual split of the data
fold <- which(folds == k)
data.train <- data[-fold,]
data.test <- data[fold,]

# train and test your model with data.train and data.test
}
``````

Note that the code above assumes that the data is already shuffled. If this would not be the case, you could consider adding something like

``````folds <- sample(folds, nrow(data))
``````

License under CC-BY-SA with attribution

Content dated before 6/26/2020 9:53 AM
• {{ error }}