Calculating optimal number of bins in a histogram

  • I'm interested in finding as optimal of a method as I can for determining how many bins I should use in a histogram. My data should range from 30 to 350 objects at most, and in particular I'm trying to apply thresholding (like Otsu's method) where "good" objects, which I should have fewer of and should be more spread out, are separated from "bad" objects, which should be more dense in value. A concrete value would have a score of 1-10 for each object. I'd had 5-10 objects with scores 6-10, and 20-25 objects with scores 1-4. I'd like to find a histogram binning pattern that generally allows something like Otsu's method to threshold off the low scoring objects. However, in the implementation of Otsu's I've seen, the bin size was 256, and often I have many fewer data points that 256, which to me suggests that 256 is not a good bin number. With so few data, what approaches should I take to calculating the number of bins to use?

    I think Sturges' rule can be used for n < 200; where n is the number of observations

  • The Freedman-Diaconis rule is very robust and works well in practice. The bin-width is set to $h=2\times\text{IQR}\times n^{-1/3}$. So the number of bins is $(\max-\min)/h$, where $n$ is the number of observations, max is the maximum value and min is the minimum value.

    In base R, you can use:

    hist(x, breaks="FD")
    

    For other plotting libraries without this option (e.g., ggplot2), you can calculate binwidth as:

    bw <- 2 * IQR(x) / length(x)^(1/3)
    
    ### for example #####
    ggplot() + geom_histogram(aes(x), binwidth = bw)
    

    As a note, by default (so, if you don't specify `breaks`) R

    @nico. The default in R is breaks="Sturges" which does not always give good results.

    for whatever reason my comment was truncated... I meant to write "by default (so, if you don't specify breaks) R uses the Sturges algorithm"... odd!

    How does one calculate `IQR`?

    @KurtMueller IQR means interquartile range. Look for 1st quartile and 3rd quartile and the difference is IQR. IQR already comes with R so you can use it.

    In R Freedman-Diaconis algorithm is implemented as function `nclass.FD` in grDevices package (installed by default). `hist` uses this function when `braks="FD"`.

    I think this formula may not work if there are multiple occurrences of some data. Taking an equal weighted average for a repetitive data, which is what perhaps this formula is based on, would bias estimate towards more frequently occurring values.

    No. length(x) is the number of observations. range(x) = c(min(x), max(x)).

    You could add more info about the variables. It can save time and help people faster. Whats is: n, max, min ?

    If I am not mistaken, the answer should read `num_bins <- diff(range(x)) / (2 * IQR(x) / length(x)^(1/3))`

    Have you read http://users.stat.umn.edu/~gmeeden/papers/hist.pdf ? What do you think about this approach of minimizing a function in compared to Freedman-Diaconis rule?

    One behavior I encountered with the Freedman Diaconis rule is that it if I have two datasets, one being significantly larger than the other (1000x), each having just one column, both of which come from the same distribution (i.e same IQR), then the number of bins will be 10x more in the larger set. Is this behavior desirable? What would be the effects of this increase in the number of bins as the number of observations increase?

    It should be `binwidth <- (2 * IQR(x)) / length(x)^(1/3)`

    why not point out to already implemented `nclass.FD`?

    `nclass.FD` did not exist nine years ago.

    Jesus christ, have we fallen this far? What happened to judging with our eyes?

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM