How to determine which distribution fits my data best?

  • I have a dataset and would like to figure out which distribution fits my data best.

    I used the fitdistr() function to estimate the necessary parameters to describe the assumed distribution (i.e. Weibull, Cauchy, Normal). Using those parameters I can conduct a Kolmogorov-Smirnov Test to estimate whether my sample data is from the same distribution as my assumed distribution.

    If the p-value is > 0.05 I can assume that the sample data is drawn from the same distribution. But the p-value doesn't provide any information about the godness of fit, isn't it?

    So in case the p-value of my sample data is > 0.05 for a normal distribution as well as a weibull distribution, how can I know which distribution fits my data better?

    This is basically the what I have done:

    > mydata
     [1] 37.50 46.79 48.30 46.04 43.40 39.25 38.49 49.51 40.38 36.98 40.00
    [12] 38.49 37.74 47.92 44.53 44.91 44.91 40.00 41.51 47.92 36.98 43.40
    [23] 42.26 41.89 38.87 43.02 39.25 40.38 42.64 36.98 44.15 44.91 43.40
    [34] 49.81 38.87 40.00 52.45 53.13 47.92 52.45 44.91 29.54 27.13 35.60
    [45] 45.34 43.37 54.15 42.77 42.88 44.26 27.14 39.31 24.80 16.62 30.30
    [56] 36.39 28.60 28.53 35.84 31.10 34.55 52.65 48.81 43.42 52.49 38.00
    [67] 38.65 34.54 37.70 38.11 43.05 29.95 32.48 24.63 35.33 41.34
    
    # estimate shape and scale to perform KS-test for weibull distribution
    > fitdistr(mydata, "weibull")
         shape        scale   
       6.4632971   43.2474500 
     ( 0.5800149) ( 0.8073102)
    
    # KS-test for weibull distribution
    > ks.test(mydata, "pweibull", scale=43.2474500, shape=6.4632971)
    
            One-sample Kolmogorov-Smirnov test
    
    data:  mydata
    D = 0.0686, p-value = 0.8669
    alternative hypothesis: two-sided
    
    # KS-test for normal distribution
    > ks.test(mydata, "pnorm", mean=mean(mydata), sd=sd(mydata))
    
            One-sample Kolmogorov-Smirnov test
    
    data:  mydata
    D = 0.0912, p-value = 0.5522
    alternative hypothesis: two-sided
    

    The p-values are 0.8669 for the Weibull distribution, and 0.5522 for the normal distribution. Thus I can assume that my data follows a Weibull as well as a normal distribution. But which distribution function describes my data better?


    Referring to elevendollar I found the following code, but don't know how to interpret the results:

    fits <- list(no = fitdistr(mydata, "normal"),
                 we = fitdistr(mydata, "weibull"))
    sapply(fits, function(i) i$loglik)
           no        we 
    -259.6540 -257.9268 
    

    Why would you like to figure out which distribution fits your data best?

    Because I want to generate pseudo-random numbers following the given distribution.

    You can't use KS to check whether a distribution with parameters found from the dataset matches the dataset. See #2 on this page for example, plus alternatives (and other ways the KS test can be misleading).

    Another discussion here with code samples on how to apply KS test when parameters are estimated from the sample.

    `I used the fitdistr() function` .....What's `fitdistr` function? Something from Excel? Or something you wrote yourself in C?

  • COOLSerdash

    COOLSerdash Correct answer

    6 years ago

    First, here are some quick comments:

    • The $p$-values of a Kolmovorov-Smirnov-Test (KS-Test) with estimated parameters will be quite wrong. So unfortunately, you can't just fit a distribution and then use the estimated parameters in a Kolmogorov-Smirnov-Test to test your sample.
    • Your sample will never follow a specific distribution exactly. So even if your $p$-values from the KS-Test would be valid and $>0.05$, it would just mean that you can't rule out that your data follow this specific distribution. Another formulation would be that your sample is compatible with a certain distribution. But the answer to the question "Does my data follow the distribution xy exactly?" is always no.
    • The goal here cannot be to determine with certainty what distribution your sample follows. The goal is what @whuber (in the comments) calls parsimonious approximate descriptions of the data. Having a specific parametric distribution can be useful as a model of the data.

    But let's do some exploration. I will use the excellent fitdistrplus package which offers some nice functions for distribution fitting. We will use the functiondescdist to gain some ideas about possible candidate distributions.

    library(fitdistrplus)
    library(logspline)
    
    x <- c(37.50,46.79,48.30,46.04,43.40,39.25,38.49,49.51,40.38,36.98,40.00,
    38.49,37.74,47.92,44.53,44.91,44.91,40.00,41.51,47.92,36.98,43.40,
    42.26,41.89,38.87,43.02,39.25,40.38,42.64,36.98,44.15,44.91,43.40,
    49.81,38.87,40.00,52.45,53.13,47.92,52.45,44.91,29.54,27.13,35.60,
    45.34,43.37,54.15,42.77,42.88,44.26,27.14,39.31,24.80,16.62,30.30,
    36.39,28.60,28.53,35.84,31.10,34.55,52.65,48.81,43.42,52.49,38.00,
    38.65,34.54,37.70,38.11,43.05,29.95,32.48,24.63,35.33,41.34)
    

    Now lets use descdist:

    descdist(x, discrete = FALSE)
    

    Descdist

    The kurtosis and squared skewness of your sample is plottet as a blue point named "Observation". It seems that possible distributions include the Weibull, Lognormal and possibly the Gamma distribution.

    Let's fit a Weibull distribution and a normal distribution:

    fit.weibull <- fitdist(x, "weibull")
    fit.norm <- fitdist(x, "norm")
    

    Now inspect the fit for the normal:

    plot(fit.norm)
    

    Normal fit

    And for the Weibull fit:

    plot(fit.weibull)
    

    Weibull fit

    Both look good but judged by the QQ-Plot, the Weibull maybe looks a bit better, especially at the tails. Correspondingly, the AIC of the Weibull fit is lower compared to the normal fit:

    fit.weibull$aic
    [1] 519.8537
    
    fit.norm$aic
    [1] 523.3079
    

    Kolmogorov-Smirnov test simulation

    I will use @Aksakal's procedure explained here to simulate the KS-statistic under the null.

    n.sims <- 5e4
    
    stats <- replicate(n.sims, {      
      r <- rweibull(n = length(x)
                    , shape= fit.weibull$estimate["shape"]
                    , scale = fit.weibull$estimate["scale"]
      )
      estfit.weibull <- fitdist(r, "weibull") # added to account for the estimated parameters
      as.numeric(ks.test(r
                         , "pweibull"
                         , shape= estfit.weibull$estimate["shape"]
                         , scale = estfit.weibull$estimate["scale"])$statistic
      )      
    })
    

    The ECDF of the simulated KS-statistics looks like follows:

    plot(ecdf(stats), las = 1, main = "KS-test statistic simulation (CDF)", col = "darkorange", lwd = 1.7)
    grid()
    

    Simulated KS-statistics

    Finally, our $p$-value using the simulated null distribution of the KS-statistics is:

    fit <- logspline(stats)
    
    1 - plogspline(ks.test(x
                           , "pweibull"
                           , shape= fit.weibull$estimate["shape"]
                           , scale = fit.weibull$estimate["scale"])$statistic
                   , fit
    )
    
    [1] 0.4889511
    

    This confirms our graphical conclusion that the sample is compatible with a Weibull distribution.

    As explained here, we can use bootstrapping to add pointwise confidence intervals to the estimated Weibull PDF or CDF:

    xs <- seq(10, 65, len=500)
    
    true.weibull <- rweibull(1e6, shape= fit.weibull$estimate["shape"]
                             , scale = fit.weibull$estimate["scale"])
    
    boot.pdf <- sapply(1:1000, function(i) {
      xi <- sample(x, size=length(x), replace=TRUE)
      MLE.est <- suppressWarnings(fitdist(xi, distr="weibull"))  
      dweibull(xs, shape=MLE.est$estimate["shape"],  scale = MLE.est$estimate["scale"])
    }
    )
    
    boot.cdf <- sapply(1:1000, function(i) {
      xi <- sample(x, size=length(x), replace=TRUE)
      MLE.est <- suppressWarnings(fitdist(xi, distr="weibull"))  
      pweibull(xs, shape= MLE.est$estimate["shape"],  scale = MLE.est$estimate["scale"])
    }
    )   
    
    #-----------------------------------------------------------------------------
    # Plot PDF
    #-----------------------------------------------------------------------------
    
    par(bg="white", las=1, cex=1.2)
    plot(xs, boot.pdf[, 1], type="l", col=rgb(.6, .6, .6, .1), ylim=range(boot.pdf),
         xlab="x", ylab="Probability density")
    for(i in 2:ncol(boot.pdf)) lines(xs, boot.pdf[, i], col=rgb(.6, .6, .6, .1))
    
    # Add pointwise confidence bands
    
    quants <- apply(boot.pdf, 1, quantile, c(0.025, 0.5, 0.975))
    min.point <- apply(boot.pdf, 1, min, na.rm=TRUE)
    max.point <- apply(boot.pdf, 1, max, na.rm=TRUE)
    lines(xs, quants[1, ], col="red", lwd=1.5, lty=2)
    lines(xs, quants[3, ], col="red", lwd=1.5, lty=2)
    lines(xs, quants[2, ], col="darkred", lwd=2)
    

    CI_Density

    #-----------------------------------------------------------------------------
    # Plot CDF
    #-----------------------------------------------------------------------------
    
    par(bg="white", las=1, cex=1.2)
    plot(xs, boot.cdf[, 1], type="l", col=rgb(.6, .6, .6, .1), ylim=range(boot.cdf),
         xlab="x", ylab="F(x)")
    for(i in 2:ncol(boot.cdf)) lines(xs, boot.cdf[, i], col=rgb(.6, .6, .6, .1))
    
    # Add pointwise confidence bands
    
    quants <- apply(boot.cdf, 1, quantile, c(0.025, 0.5, 0.975))
    min.point <- apply(boot.cdf, 1, min, na.rm=TRUE)
    max.point <- apply(boot.cdf, 1, max, na.rm=TRUE)
    lines(xs, quants[1, ], col="red", lwd=1.5, lty=2)
    lines(xs, quants[3, ], col="red", lwd=1.5, lty=2)
    lines(xs, quants[2, ], col="darkred", lwd=2)
    #lines(xs, min.point, col="purple")
    #lines(xs, max.point, col="purple")
    

    CI_CDF


    Automatic distribution fitting with GAMLSS

    The gamlss package for R offers the ability to try many different distributions and select the "best" according to the GAIC (the generalized Akaike information criterion). The main function is fitDist. An important option in this function is the type of the distributions that are tried. For example, setting type = "realline" will try all implemented distributions defined on the whole real line whereas type = "realsplus" will only try distributions defined on the real positive line. Another important option is the parameter $k$, which is the penalty for the GAIC. In the example below, I set the parameter $k = 2$ which means that the "best" distribution is selected according to the classic AIC. You can set $k$ to anything you like, such as $\log(n)$ for the BIC.

    library(gamlss)
    library(gamlss.dist)
    library(gamlss.add)
    
    x <- c(37.50,46.79,48.30,46.04,43.40,39.25,38.49,49.51,40.38,36.98,40.00,
           38.49,37.74,47.92,44.53,44.91,44.91,40.00,41.51,47.92,36.98,43.40,
           42.26,41.89,38.87,43.02,39.25,40.38,42.64,36.98,44.15,44.91,43.40,
           49.81,38.87,40.00,52.45,53.13,47.92,52.45,44.91,29.54,27.13,35.60,
           45.34,43.37,54.15,42.77,42.88,44.26,27.14,39.31,24.80,16.62,30.30,
           36.39,28.60,28.53,35.84,31.10,34.55,52.65,48.81,43.42,52.49,38.00,
           38.65,34.54,37.70,38.11,43.05,29.95,32.48,24.63,35.33,41.34)
    
    fit <- fitDist(x, k = 2, type = "realplus", trace = FALSE, try.gamlss = TRUE)
    
    summary(fit)
    
    *******************************************************************
    Family:  c("WEI2", "Weibull type 2") 
    
    Call:  gamlssML(formula = y, family = DIST[i], data = sys.parent()) 
    
    Fitting method: "nlminb" 
    
    
    Coefficient(s):
                 Estimate  Std. Error  t value   Pr(>|t|)    
    eta.mu    -24.3468041   2.2141197 -10.9962 < 2.22e-16 ***
    eta.sigma   1.8661380   0.0892799  20.9021 < 2.22e-16 ***
    

    According to the AIC, the Weibull distribution (more specifically WEI2, a special parametrization of it) fits the data best. The exact parametrization of the distribution WEI2 is detailled in this document on page 279. Let's inspect the fit by looking at the residuals in a worm plot (basically a de-trended Q-Q-plot):

    WormPlot

    We expect the residuals to be close to the middle horizontal line and 95% of them to lie between the upper and lower dotted curves, which act as 95% pointwise confidence intervals. In this case, the worm plot looks fine to me indicating that the Weibull distribution is an adequate fit.

    +1 Nice analysis. One question, though. Does positive conclusion on compatibility with a particular major distribution (Weibull, in this case) allows to rule out a possibility of a mixture distribution's presence? Or we need to perform a proper mixture analysis and check GoF to rule out that option?

    @AleksandrBlekh It is impossible to have enough power to rule out a mixture: when the mixture is of two almost identical distributions it cannot be detected and when all but one component have very small proportions it cannot be detected, either. Typically (in the absence of a theory which might suggest a distributional form), one fits parametric distributions in order to achieve *parsimonious approximate descriptions* of data. Mixtures are none of those: they require too many parameters and are *too* flexible for the purpose.

    I have to go through your analysis in detail again, but it seems to solve my problem. Thank you very much!

    @whuber: +1 Appreciate your **excellent** explanation!

    @COOLSerdash You've written: ``It seems that possible distributions include the Weibull, Lognormal, Normal and possibly the Gamma distribution``. How could you infer it? (could you please give me references?)

    @Lourenco I looked at the Cullen and Fey graph. The blue point denotes our sample. You see that the point is close to the lines of the Weibull, Lognormal and Gamma (which is between Weibull and Gamma). After fitting each of those distributions, I compared the goodness-of-fit statistics using the function `gofstat` and the AIC. There isn't a consensus about what the best way to determine the "best" distribution is. I like graphical methods and the AIC.

    @COOLSerdash, Understand. One more question: Why haven't you included the Logistic distribution? You've included the Normal only.

    @Lourenco Do you mean the lognormal? The logistic distribution (the "+" sign) is quite a bit away from the observed data. The lognormal would also be a candidate I'd normally look at. For this tutorial, I've chosen to not show it in order to keep the post short. The lognormal shows a worse fit compared to both the Weibull and Normal distribution. The AIC is 537.59 and the graphs also don't look too good.

    @COOLSerdash, I'm considering Logistic (+), Normal (*), and Observation as single points, is this right? If so, isn't Logistic as far as Normal from the Observation?

    @Lourenco You're right. I don't know why I considered the Normal. I will edit my answer accordingly. Thank you.

    May I know why logspline is preferred over ecdf to calculate the p-value of ks statistic? `ecdf(stats)(ks.test(x, "pweibull", shape= fit.weibull$estimate["shape"], scale = fit.weibull$estimate["scale"]))` returns 0.50864 on my computer which is close to the p-value 0.4889511 shown above.

    @KevinZhu Logplines are just another method to fit densities. I don't know if they are better in this case. It's certainly possible to just use the ecdf, especially if the number of simulations is large, I think. Your analysis confirms that the $p$-values are very close so it probably doesn't make a difference.

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM