### Is there any reason to prefer the AIC or BIC over the other?

• The AIC and BIC are both methods of assessing model fit penalized for the number of estimated parameters. As I understand it, BIC penalizes models more for free parameters than does AIC. Beyond a preference based on the stringency of the criteria, are there any other reasons to prefer AIC over BIC or vice versa?

I think it is more appropriate to call this discussion as "feature" selection or "covariate" selection. To me, model selection is much broader involving specification of the distribution of errors, form of link function, and the form of covariates. When we talk about AIC/BIC, we are typically in the situation where all aspects of model building are fixed, except the selection of covariates.

Deciding the specific covariates to include in a model does commonly go by the term model selection and there are a number of books with model selection in the title that are primarily deciding what model covariates/parameters to include in the model.

I don't know if your question applies specifically to phylogeny (bioinformatics), but if so, this study can provide some thoughts on this aspect: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2925852/

I've rejected the KIC edit because it doesn't match the existing question and makes the existing answers incomplete. A question about KIC may be opened separately to contrast KIC with AIC or BIC. In doing so, please also specify which KIC (as there are several information criterion that answer to that abbreviation).

@russellpierce: I'm not that OP, but as you saw it was already asked (without definition of KIC) and merged into this. I even searched for definitions of KIC but couldn't find a good one. Can you at least link some here?

@smci I've added https://stats.stackexchange.com/questions/383923/what-is-the-most-common-kic-how-does-it-work to allow people to dig into questions related to the KIC if interested.

• Your question implies that AIC and BIC try to answer the same question, which is not true. The AIC tries to select the model that most adequately describes an unknown, high dimensional reality. This means that reality is never in the set of candidate models that are being considered. On the contrary, BIC tries to find the TRUE model among the set of candidates. I find it quite odd the assumption that reality is instantiated in one of the models that the researchers built along the way. This is a real issue for BIC.

Nevertheless, there are a lot of researchers who say BIC is better than AIC, using model recovery simulations as an argument. These simulations consist of generating data from models A and B, and then fitting both datasets with the two models. Overfitting occurs when the wrong model fits the data better than the generating. The point of these simulations is to see how well AIC and BIC correct these overfits. Usually, the results point to the fact that AIC is too liberal and still frequently prefers a more complex, wrong model over a simpler, true model. At first glance these simulations seem to be really good arguments, but the problem with them is that they are meaningless for AIC. As I said before, AIC does not consider that any of the candidate models being tested is actually true. According to AIC, all models are approximations to reality, and reality should never have a low dimensionality. At least lower than some of the candidate models.

My recommendation is to use both AIC and BIC. Most of the times they will agree on the preferred model, when they don't, just report it.

If you are unhappy with both AIC and BIC and have free time to invest, look up Minimum Description Length (MDL), a totally different approach that overcomes the limitations of AIC and BIC. There are several measures stemming from MDL, like normalized maximum likelihood or the Fisher Information approximation. The problem with MDL is that its mathematically demanding and/or computationally intensive.

Still, if you want to stick to simple solutions, a nice way for assessing model flexibility (especially when the number of parameters are equal, rendering AIC and BIC useless) is doing Parametric Bootstrap, which is quite easy to implement. Here is a link to a paper on it.

Some people here advocate for the use of cross-validation. I personally have used it and don't have anything against it, but the issue with it is that the choice among the sample-cutting rule (leave-one-out, K-fold, etc) is an unprincipled one.

Difference can be viewed purely from mathematical standpoint -- BIC was derived as an asymptotic expansion of log P(data) where true model parameters are sampled according to arbitrary nowhere vanishing prior, AIC was similarly derived with true parameters held fixed

You said that " there are a lot of researchers who say BIC is better than AIC, using model recovery simulations as an argument. These simulations consist of generating data from models A and B, and then fitting both datasets with the two models." Would you be so kind as to point some references. I'm curious about them! :)

I do not believe the statements in this post.

I don't completely agree with Dave especially regarding the objectives being different. I think both methods look to find a good and in some sense optimal set of variables for a model. We really in practice never assume that we can construct a "perfect" model. I think that in a purely probabilistic sense that if we assume that there is a "correct" model then BIC will be consistent and AIC will not. By this the mathematical statisticians mean that as the sample size grows to infinity BIC will find it with probability tending to 1.

I think that is why some people think that AIC does not provide a strong enough penalty.

(-1) Great explanation, but I would like to challenge an assertion. @Dave Kellen Could you please give a reference to where the idea that the TRUE model has to be in the set for BIC? I would like to investigate on this, since in this book the authors give a convincing proof that this is not the case.

These slides http://myweb.uiowa.edu/cavaaugh/ms_lec_2_ho.pdf say that AIC assumes that the generating model is among the set of candidate models.

When you work through the proof of the AIC, for the penalty term to equal the number of linearly independent parameters, the true model must hold. Otherwise it is equal to $\text{Trace}(J^{-1} I)$ where $J$ is the variance of the score, and $I$ is the expectation of the hessian of the log-likelihood, with these expectations evaluated under the truth, but the log-likelihoods are from a mis-specified model. I am unsure why many sources comment that the AIC is independent of the truth. I had this impression, too, until I actually worked through the derivation.

Great answer but I strongly disagree with the statement "reality should never have a low dimensionality". This depends on what "science" you are applying yoru models to