How to split a data set to do 10-fold cross validation

  • Now I have a R data frame (training), can anyone tell me how to randomly split this data set to do 10-fold cross validation?

    Be sure to repeat the entire process 100 times to achieve satisfactory precision.

    Be sure to sample case and control sample separately and then combine them to each block.

    If you use caret::train, you don't even need to care about this. It will be done internally, you can choose the amount of folds. If you insist in doing this "by hand" use stratified sampling of the class as implemented in caret:: createFolds.

    I have locked this thread because every one of the many answers is treating it as only a coding question rather than one of general statistical interest.

  • caret has a function for this:

    flds <- createFolds(y, k = 10, list = TRUE, returnTrain = FALSE)
    names(flds)[1] <- "train"

    Then each element of flds is a list of indexes for each dataset. If your dataset is called dat, then dat[flds$train,] gets you the training set, dat[ flds[[2]], ] gets you the second fold set, etc.

  • Here is a simple way to perform 10-fold using no packages:

    #Randomly shuffle the data
    #Create 10 equally size folds
    folds <- cut(seq(1,nrow(yourData)),breaks=10,labels=FALSE)
    #Perform 10 fold cross validation
    for(i in 1:10){
        #Segement your data by fold using the which() function 
        testIndexes <- which(folds==i,arr.ind=TRUE)
        testData <- yourData[testIndexes, ]
        trainData <- yourData[-testIndexes, ]
        #Use the test and train data partitions however you desire...

    -1: caret functions do stratified sampling which you are not doing. What's the point of reinventing the weel if someone has made things simpler for you?

    Are you kidding? The entire purpose of the answer is to perform 10-fold without having to install the entire caret package. The only good point you make is that people should understand what their code actually does. Young grasshopper, stratified sampling is not always the best approach. For instance, it gives more importance to subgroups with more data which is not always desirable. (Esp if you don't know it's happening). It's about using the best approach for your data. Troll with caution my friend :)

    @JakeDrew I realise this is an old post now, but would it be possible to ask for some guidance on how to use the test and train data to get the mean average error of a VAR(p) model for each iteration?

    @JakeDrew imho both answers deserve a plus 1. One with a package, the other with code ...

  • Probably not the best way, but here is one way to do it. I'm pretty sure when I wrote this code I had borrowed a trick from another answer on here, but I couldn't find it to link to.

    # Generate some test data
    x <- runif(100)*10 #Random values between 0 and 10
    y <- x+rnorm(100)*.1 #y~x+error
    dataset <- data.frame(x,y) #Create data frame
    plot(dataset$x,dataset$y) #Plot the data
    library(cvTools) #run the above line if you don't have this library
    k <- 10 #the number of folds
    folds <- cvFolds(NROW(dataset), K=k)
    dataset$holdoutpred <- rep(0,nrow(dataset))
    for(i in 1:k){
      train <- dataset[folds$subsets[folds$which != i], ] #Set the training set
      validation <- dataset[folds$subsets[folds$which == i], ] #Set the validation set
      newlm <- lm(y~x,data=train) #Get your new linear model (just fit on the train data)
      newpred <- predict(newlm,newdata=validation) #Get the predicitons for the validation set (from the model just fit on the train data)
      dataset[folds$subsets[folds$which == i], ]$holdoutpred <- newpred #Put the hold out prediction in the data set for later use
    dataset$holdoutpred #do whatever you want with these predictions
  • please find below some other code that i use (borrowed and adapted from another source). Copied it straight from a script that i just used myself, left in the rpart routine. The part probably most of interest are the lines on the creation of the folds. Alternatively - you can use the crossval function from the bootstrap package.

    #define error matrix
    err <- matrix(NA,nrow=1,ncol=10)
    #creation of folds
    for(c in 1:10){
    n=nrow(df);K=10; sizeblock= n%/%K;alea=runif(n);rang=rank(alea);bloc=(rang-1)%/%sizeblock+1;bloc[bloc==K+1]=K;bloc=factor(bloc); bloc=as.factor(bloc);print(summary(bloc))
    for(k in 1:10){
    fit=rpart(type~., data=df[bloc!=k,],xval=0) ; (predict(fit,df[bloc==k,]))
    errcv[,c]=rowMeans(err, na.rm = FALSE, dims = 1)
  • # Evaluate models uses k-fold cross-validation
    cv.lm(data=dat, form.lm=mod1, m= 10, plotit = F)

    Everything done for you in one line of code!

    ?cv.lm for information on input and output
  • Because I did not my approach in this list, I thought I could share another option for people who don't feel like installing packages for a quick cross validation

    # get the data from somewhere and specify number of folds
    data <- read.csv('my_data.csv')
    nrFolds <- 10
    # generate array containing fold-number for each sample (row)
    folds <- rep_len(1:nrFolds, nrow(data))
    # actual cross validation
    for(k in 1:nrFolds) {
        # actual split of the data
        fold <- which(folds == k)
        data.train <- data[-fold,]
        data.test <- data[fold,]
        # train and test your model with data.train and data.test

    Note that the code above assumes that the data is already shuffled. If this would not be the case, you could consider adding something like

    folds <- sample(folds, nrow(data))

License under CC-BY-SA with attribution

Content dated before 6/26/2020 9:53 AM