### AIC guidelines in model selection

I typically use BIC as my understanding is that it values parsimony more strongly than does AIC. However, I have decided to use a more comprehensive approach now and would like to use AIC as well. I know that Raftery (1995) presented nice guidelines for BIC differences: 0-2 is weak, 2-4 is positive evidence for one model being better, etc.

I looked in textbooks and they seem strange on AIC (it looks like a larger difference is weak and a smaller difference in AIC means one model is better). This goes against what I know I have been taught. My understanding is that you want lower AIC.

Does anyone know if Raftery's guidelines extend to AIC as well, or where I might cite some guidelines for "strength of evidence" for one model vs. another?

And yes, cutoffs are not great (I kind of find them irritating) but they are helpful when comparing different kinds of evidence.

Readers here may be interested to read the following excellent CV thread: Is there any reason to prefer the AIC or BIC over the other?

Which textbooks are you referring to when you say "*I looked in textbooks and they seem strange on AIC (it looks like a larger difference is weak and a smaller difference in AIC means one model is better)*" --- and what do they actually say?

Your second para is unclear. You probably mean this: While large differences suggests that the model with the smaller values are preferable, smaller differences are difficult to evaluate. Moreover, statisticians are yet to agree on what differences are 'small' or 'large' - Singer and Willet (2003, p.122)

As to your third para, if you want to adopt the categories of evidential strength advanced by Jeffreys (1961, p. 432) I can give you the full reference.

Yes, that is the paper. Page 139. I often see these guidelines paraded around for both AIC and BIC (although I could only ever find a citation for BIC). As far as textbooks go, I distinctly recall reading a textbook in which it stated that larger AIC values were preferred, while smaller BIC values were preferred. I thought this was bizarre, given the way it is calculated. I will have to pull out the citation for that later.

Also, I would gladly take Jeffreys categories of evidential strength! Please, cite away.

dmartin Correct answer

7 years agoAIC and BIC hold the same interpretation in terms of model comparison. That is, the larger difference in either AIC or BIC indicates stronger evidence for one model over the other (the lower the better). It's just the the AIC doesn't penalize the number of parameters as strongly as BIC. There is also a correction to the AIC (the AICc) that is used for smaller sample sizes. More information on the comparison of AIC/BIC can be found here.

+1. Just to add/clarify: AIC (and AICc) employs KL-divergence. Therefore, exactly because AIC reflects "additional" information the smaller it is the better. In other words as our sample size $N \rightarrow \infty$, the model with the minimum AIC score will possess the smallest Kullback-Leibler divergence and will therefore be the model closest to the "true" model.

You are talking about two different things and you are mixing them up. In the first case you have two models (1 and 2) and you obtained their AIC like $AIC_1$ and $AIC_2$. IF you want to compare these two models based on their AIC's, then model with lower AIC would be the preferred one i.e. if $AIC_1< AIC_2$ then you pick up model 1 and vise versa.

In the 2nd case, you have a set of candidate models like models $(1, 2, ..., n)$ and for each model you calculate AIC differences as $\Delta_i= AIC_i- AIC_{min}$, where $AIC_i$ is the AIC for the $i$th model and $AIC_{min}$ is the minimum of AIC among all the models. Now the model with $\Delta_i >10$ have no support and can be ommited from further consideration as explained in Model Selection and Multi-Model Inference: A Practical Information-Theoretic Approach by Kenneth P. Burnham, David R. Anderson, page 71. So the larger is the $\Delta_i$, the weaker would be your model. Here the best model has $\Delta_i\equiv\Delta_{min}\equiv0.$Aha! This totally cleared up the "larger than" bit. Thanks!

I generally never use AIC or BIC objectively to describe adequate fit for a model. I

*do*use these ICs to compare the relative fit of two predictive models. As far as whether an AIC of "2" or "4" is concerned, it's completely contextual. If you want to get a sense of how a "good" model fits, you can (should) always use a simulation. Your understanding of the AIC is right. AIC receives a*positive*contribution from the parameters and a*negative*contribution from the likelihood. What you're trying to do is maximize the likelihood without loading up your model with a bunch of parameters. So, my bubble bursting opinion is that cut offs for AIC are no good out of context.What if your models doesn't allow any simulation?

Tut-tut! How is that even possible? One can bootstrap the world.

God luck with that ... simulate the world lol

@Stat I'm very serious when I say that I can't conceive of a situation in which it would be impossible to simulate data from a model. At the very least, bootstrapping from the training dataset qualifies as a valid simulation approach.

When bootstrapping is hard cross validation or even simple jackknifing should work. Also, model averaging provides a means for reconciling information from models with similar AICs.

Here is a related question when-is-it-appropriate-to-select-models-by-minimising-the-aic?. It gives you a general idea of what people not unrecognizable in academic world consider appropriate to write and what references to leave in as important.

Generally, it is the differences between the likelihoods or AICs that matter, not their absolute values. You have missed the important word "difference" in your "BIC: 0-2 is weak" in the question - check Raftery's TABLE 6 - and it's strange that no-one wants to correct that.

I myself have been taught to look for MAICE (Minimum AIC Estimate - as Akaike called it). So what? Here is what one famous person wrote to an unknown lady:

`Dear Miss -- I have read about sixteen pages of your manuscript ... I suffered exactly the same treatment at the hands of my teachers who disliked me for my independence and passed over me when they wanted assistants ... keep your manuscript for your sons and daughters, in order that they may derive consolation from it and not give a damn for what their teachers tell them or think of them. ... There is too much education altogether.`

My teachers never heard of papers with titles like "A test whether two AIC's differ significantly" and I can't even remember they ever called AIC a statistic, that would have a sampling distribution and other properties. I was taught AIC is a criterion to be minimized, if possible in some automatic fashion.

Yet another important issue, which I think have been expressed here a few years ago by IrishStat (from memory so apologies if I am wrong as I failed to find that answer) is that AIC, BIC and other criteria have been derived for different purposes and under different conditions (assumptions) so you often can't use them interchangeably if your purpose is forecasting, say. You can't just prefer something inappropriate.

My sources show that I used a quote to Burnham and Anderson (2002, p.70) to write that delta (AIC differences) within 0-2 has a substantial support; delta within 4-7 considerably less support and delta greater than 10 essentially no support. Also, I wrote that "the authors also discussed conditions under which these guidelines may be useful". The book is cited in the answer by Stat, which I upvoted as most relevant.

With regard to information criteria, here is what SAS says:

"Note that information criteria such as Akaike's (AIC), Schwarz's (SC, BIC), and QIC can be used to compare competing nonnested models, but do not provide a test of the comparison. Consequently, they cannot indicate whether one model is significantly better than another. The GENMOD, LOGISTIC, GLIMMIX, MIXED, and other procedures provide information criteria measures."

There are two comparative model testing procedure: a) Vuong test and b) non-parametric Clarke test. See this paper for details.

I find the mathematical notation employed in the cited "paper" (i.e. presentation) non-comprehensible without comments. In particular, what does the line of dashes symbolize? Implication?

No, I do not agree this answer. AIC is an information criteria which is "average" of all possible scenarios. Thus, it is by default a real value number. The magnitude matters. This is different from significance of the tests.

License under CC-BY-SA with attribution

Content dated before 6/26/2020 9:53 AM

gung - Reinstate Monica 7 years ago

Is this (pdf), the Raftery paper you are referring to?