What is the difference between test set and validation set?

  • I found this confusing when I use the neural network toolbox in Matlab.
    It divided the raw data set into three parts:

    1. training set
    2. validation set
    3. test set

    I notice in many training or learning algorithm, the data is often divided into 2 parts, the training set and the test set.

    My questions are:

    1. what is the difference between validation set and test set?
    2. Is the validation set really specific to neural network? Or it is optional.
    3. To go further, is there a difference between validation and testing in context of machine learning?

    The question is answered in the book Elements of statistical learning page 222. The validation set is used for model selection, the test set for final model (the model which was selected by selection process) prediction error.

    @mpiktas Are you referring to the chapter "Model Assessment and Selection"?

    Yes. The page number was from 5th print edition.

    @mpiktas is spot on. Here is the actual text: `The training set is used to fit the models; the validation set is used to estimate prediction error for model selection; the test set is used for assessment of the generalization error of the final chosen model. Ideally, the test set should be kept in a “vault,” and be brought out only at the end of the data analysis.`

    The book Elements of statistical learning" is now reachable under: https://web.stanford.edu/~hastie/Papers/ESLII.pdf

    @mpiktas There is some logic that I am missing: If the validation set is used for model selection, i.e., choose the model that has the best performance on the validation set (rather than the model that has the best performance on the training set), then is it just another overfitting? i.e., overfitting on the validation set? Then how can we expect that the model with the best performance on the validation set will also have best performance on the test set among all the models you are comparing? If the answer is no, then what's the point of the validation set?

    I recommend video 5 of week 1 of third course from NG Deep Learning specialization.

    The updated page number @mpiktas referenced in the 12th edition is still page 222 of the book itself, or 241 of the PDF: `If we are in a data-rich situation, the best approach for both problems is to randomly divide the dataset into three parts: a training set, a validationset, and a test set...`

  • Typically to perform supervised learning, you need two types of data sets:

    1. In one dataset (your "gold standard"), you have the input data together with correct/expected output; This dataset is usually duly prepared either by humans or by collecting some data in a semi-automated way. But you must have the expected output for every data row here because you need this for supervised learning.

    2. The data you are going to apply your model to. In many cases, this is the data in which you are interested in the output of your model, and thus you don't have any "expected" output here yet.

    While performing machine learning, you do the following:

    1. Training phase: you present your data from your "gold standard" and train your model, by pairing the input with the expected output.
    2. Validation/Test phase: in order to estimate how well your model has been trained (that is dependent upon the size of your data, the value you would like to predict, input, etc) and to estimate model properties (mean error for numeric predictors, classification errors for classifiers, recall and precision for IR-models etc.)
    3. Application phase: now, you apply your freshly-developed model to the real-world data and get the results. Since you usually don't have any reference value in this type of data (otherwise, why would you need your model?), you can only speculate about the quality of your model output using the results of your validation phase.

    The validation phase is often split into two parts:

    1. In the first part, you just look at your models and select the best performing approach using the validation data (=validation)
    2. Then you estimate the accuracy of the selected approach (=test).

    Hence the separation to 50/25/25.

    In case if you don't need to choose an appropriate model from several rivaling approaches, you can just re-partition your set that you basically have only training set and test set, without performing the validation of your trained model. I personally partition them 70/30 then.

    See also this question.

    Why wouldn't I choose the best performing model based on the test set, getting rid of the validation set altogether?

    Is it because of overfitting? Or because we want some independent statistics based on the test result, just for error estimation?

    @Sebastian [If you only use the test set: ]"The test set error of the final chose model will underestimate the true test error, sometimes significantly" [Hastie et al]

    The validation set is often used to tune hyper-parameters. For example, in the deep learning community, tuning the network layer size, hidden unit number, regularization term(wether L1 or L2) depends on the validation set

    What is the correct way to split the sets? Should the selection be random? What if you have pictures that are similar? Won't this damage your ability to generalize? If you have two sets taken in separate locations wouldn't it be better to take one as training set and the other as the test set?

    @user695652 I see you quote the Elements of Statistical Learning. But I don't understand intuitively why this is true? When I train my model on the training data set, I did not use any data in the test data set. Also, if I didn't do any feature engineering, i.e., I just use the original set of features in my data set, then there shouldn't be any information leakage. So in this case, why I still need the validation set? Why if I just use the test set, it will underestimate the true test error?

    Is it like validation is testing against the known, and 'testing' is against the unknown?

    @YonatanSimson Models don't usually generalize well enough that you could train in only one location and have it work well in the other one, so the only reason you would do that is if you don't care about your model working as well as possible, but *do* care about testing how well your model generalizes. When your test set comes from the same distribution as the training set, it still tells you how much you overfit because the data isn't exactly the same, and overfitting is about working only on the exact data in your training set.

    @KevinKim user695652 is saying that you will underestimate the true test error if you use the test set to train hyperparameters (size of model, feature selection, etc) instead of using a validation set for that. If you're saying that you don't train any hyperparameters, then you also don't need a validation data set.

    Is it possible i can use the validation set for the testing?

    @alltom I see. But there is still some logic that I am missing: If the validation set is used for model selection, i.e., choose the model that has the best performance on the validation set (rather than the model that has the best performance on the training set), then is it just another overfitting? i.e., overfitting on the validation set? Then how can we expect that the model with the best performance on the validation set will also have best performance on the test set among all the models I am comparing? If the answer is no, then what's the point of the validation set?

    @KevinKim You train a model with examples from the training set, then evaluate the model with examples from the validation set—which it has never seen—to choose the model that generalizes the best. The model that does the best on the validation data with no additional training is most likely to do the best on other data sets (such as the test set), so long as they're all drawn from the same distribution.

    This answer is not precise (/ possibly not even correct) on the use of the `validation` set: see the answer below .

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM