### What are good RMSE values?

Suppose I have some dataset. I perform some regression on it. I have a separate test dataset. I test the regression on this set. Find the RMSE on the test data. How should I conclude that my learning algorithm has done well, I mean what properties of the data I should look at to conclude that the RMSE I have got is good for the data?

I think you have two different types of questions there. One thing is what you ask in the title: "What are good RMSE values?" and another thing is how to compare models with different datasets using RMSE.

For the first, i.e., the question in the title, it is important to recall that RMSE has the same unit as the dependent variable (DV). It means that there is no absolute good or bad threshold, however you can define it based on your DV. For a datum which ranges from 0 to 1000, an RMSE of 0.7 is small, but if the range goes from 0 to 1, it is not that small anymore. However, although the smaller the RMSE, the better, you can make theoretical claims on levels of the RMSE by knowing what is expected from your DV in your field of research. Keep in mind that you can always normalize the RMSE.

For the second question, i.e., about comparing two models with different datasets by using RMSE, you may do that provided that the DV is the same in both models. Here, the smaller the better but remember that small differences between those RMSE may not be relevant or even significant.

What do you mean that you can always normalize RMSE? I see your point about DV range and RMSE. But can we quantify in terms of standard deviation and mean of DV in any way?

Normalizing the RMSE (the NRMSE) may be usefull to make RMSE scale-free. For instance, by transforming it in a percentage: RMSE/(max(DV)-min(DV))

That normalisation doesn't really produce a percentage (e.g. 1 doesn't mean anything in particular), and it isn't any more or less valid than any other form of normalisation. It depends on the distribution of that data. To me, it would make more sense to normalise by the RMSE of the mean, as this would be like saying "what improvement do I get over the dumbest model I can think of"?

DV means the same thing as Y?

@HammanSamuel DV means dependent variable, which could be even better called response variable. A dependent variable can have any name or notation you want. If you call your dependent variable `Y`, then yes, DV means the same thing as `Y`. If you call your dependent variable `FluffyCats`, then no, DV does not mean the same thing as `Y`.

The RMSE for your training and your test sets should be very similar if you have built a good model. If the RMSE for the test set is much higher than that of the training set, it is likely that you've badly over fit the data, i.e. you've created a model that tests well in sample, but has little predictive value when tested out of sample.

It is possible that RMSE values for both training and testing are similar but bad (in some sense). So how to figure out based on data properties if the RMSE values really imply that our algorithm has learned something?

Sure, they can be similar but both bad. You're always trying to minimize the error when building a model. Just because you haven't overfit doesn't mean you've built a good model, just that you've built one that performs consistently on new data. Try using a different combination of predictors or different interaction terms or quadratics. If your RMSE drops considerably and tests well out of sample, then the old model was worse than the new one. It's certainly not an exact science.

If you know that your model is not over/underfitting, but aren't sure if your model's RMSE is decent, what metric do you use to determine this? Compare the RMSE to standard deviation/variance of the target variable?

Even though this is an old thread, I am hoping that my answer helps anyone who is looking for an answer to the same question.

When we talk about time series analysis, most of the time we mean the study of ARIMA models (and its variants). Hence I will start by assuming the same in my answer.

First of all, as the earlier commenter R. Astur explains, there is no such thing as a good RMSE, because it is scale-dependent, i.e. dependent on your dependent variable. Hence one can not claim a universal number as a good RMSE.

Even if you go for scale-free measures of fit such as MAPE or MASE, you still can not claim a threshold of being good. This is just a wrong approach. You can't say "My MAPE is such and such, hence my fit/forecast is good". How I believe you should approach your problem is as follows. First find a couple of "best possible" models, using a logic such as looping over the arima() function outputs in R, and select the best n estimated models based on the lowest RMSE or MAPE or MASE. Since we are talking about one specific series, and not trying to make a universal claim, you can pick either of these measures. Of course you have to do the residual diagnostics, and make sure your best models produce White Noise residuals with well-behaved ACF plots. Now that you have a few good candidates, test the out-of-sample MAPE of each model, and pick the one with the best out-of-sample MAPE.

The resulting model is the best model, in the sense that it:

- Gives you a good in-sample fit, associated with low error measures and WN residuals.
- And avoids overfitting by giving you the best out-of-sample forecast accuracy.

Now, one crucial point is that it is possible to estimate a time series with an ARIMA (or its variants) by including enough lags of the dependent variable or the residual term. However, that fitted "best" model may just over-fit, and give you a dramatically low out-of-sample accuracy, i.e. satisfy my bullet point 1 but not 2.

In that case what you need to do is:

- Add an exogenous explanatory variable and go for ARIMAX,
- Add an endogenous explanatory variable and go for VAR/VECM,
- Or change your approach completely to non-linear machine learning models, and fit them to your time series using a Cross-Validation approach. Fit a neural network or random forest to your time series, for example. And repeat the in-sample and out-of-sample performance comparison. This is a trending approach to time series, and the papers I've seen are applauding the machine learning models for their superior (out-of-sample) forecasting performance.

Hope this helps.

You can't fix particular threshold value for RMSE. We have to look at comparison of RMSE of both test and train datasets. If your model is good then your RMSE of test data is quite simillar to train dataset. Otherwise below conditions met.

RMSE of test > RMSE of train => OVER FITTING of the data.

RMSE of test < RMSE of train => UNDER FITTING of the data.Personally I like the RMSE / standard deviation approach. Range is misleading, you could have a skewed distribution or outliers, whereas standard deviation takes care of this. Similarly, RMSE / mean is totally wrong - what if your mean is zero? However, this does not help to tell you whether you have a good model or not. This challenge is similar to working with binary classifications and asking "is my Gini of 80% good". That depends. Maybe by doing some additional tuning or feature engineering, you could have built a better model that gave you a Gini of 90% (and still validates against the test sample). It also depends on the use case and industry. If you were developing a behaviour credit score, then a Gini of 80% is "pretty good". But if you are developing a new application credit score (which inherently has access to less data) then a Gini of 60% is pretty good. I guess when it comes to whether your model's RMSE / std dev "score" is good or not, you need to develop your own intuition by applying this and learning from many different use cases.

Welcome to CV. Do you explicitly mean RMSE divided by standard deviation? If so, formatting by enclosing it is dollar signs will make that clear e.g. $RMSE/SD$. The reason I ask is that $RMSE/SD$ is a transformed correlation coefficient, which it would be useful to expand on the implications of this in more detail.

Thanks @ReneBt. Yes it is $RMSE/SD$ that I am referring to. So this is a variant of the adjusted R-squared coefficient. Huh. R-squared is also a great way to get some intuition on the skill of a model with a linear target (where 1 = perfect, 0 = random, much like a Gini coefficient for binary classification use cases). No one has mentioned this as an approach yet?

License under CC-BY-SA with attribution

Content dated before 6/26/2020 9:53 AM

Shishir Pandey one year ago

I asked this question 6 years ago, so the new question (asked 2 months ago) should be marked as duplicate.