How do I group a list of numeric values into ranges?

  • I do have a big list of numeric values (including duplicates) and I do want to group them into ranges in order to see if how do they distribute.

    Let's say there are 1000 values ranging from 0 to 2.000.000 and I do want to group them.

    How can I achieve this, preferably in Excel or SQL.

  • Why group them? Instead, how about estimate the probability density function (PDF) of the distributions from which the data arise? Here's an R-based example:

    set.seed(123)
    dat <- c(sample(2000000, 500), rnorm(100, 1000000, 1000), 
             rnorm(150, 1500000, 100),rnorm(150, 500000, 10),
             rnorm(180, 10000, 10), rnorm(10, 1000, 5), 1:10)
    dens <- density(dat)
    plot(dens)
    

    If the data are strictly bounded (0, 2,000,000) then the kernel density estimate is perhaps not best suited. You could fudge things by asking it to only evaluate the density between the bounds:

    dens2 <- density(dat, from = 0, to = 2000000)
    plot(dens2)
    

    Alternatively there is the histogram - a coarse version of the kernel density. What you specifically talk about is binning your data. There are lots of rules/approaches to selecting equal-width bins (i.e. the number of bins) from the data. In R the default is Sturges rule, but it also includes the Freedman-Diaconis rule and Scott's rule. There are others as well - see the Wikipedia page on histograms.

    hist(dat)
    

    If you are not interested in the kernel density plot or the histogram per se, rather just the binned data, then you can compute the number of bins using the nclass.X family of functions where X is one of Sturges, scott or FD. And then use cut() to bin your data:

    cut.dat <- cut(dat, breaks = nclass.FD(dat), include.lowest = TRUE)
    table(cut.dat)
    

    which gives:

    > cut.dat
      [-2e+03,2.21e+05] (2.21e+05,4.43e+05] (4.43e+05,6.65e+05] (6.65e+05,8.88e+05] 
                    247                  60                 215                  61 
    (8.88e+05,1.11e+06] (1.11e+06,1.33e+06] (1.33e+06,1.56e+06] (1.56e+06,1.78e+06] 
                    153                  51                 205                  50 
       (1.78e+06,2e+06] 
                     58
    

    in R.

    However, binning is fraught with problems, most notably; How do you know that your choice of bins hasn't influenced the resulting impression you get of the way the data are distributed?

    (+1) Thanks for the hint about `nclass.*`. Why isn't your left-most interval left bounded (should set `include.lowest` to TRUE maybe)?

    @chl - that's what you get when doing something quickly. Will edit the code to add `include.lowest`

    A good solution is to create an interactive application that dynamically displays a graphic of the binned counts (*e.g.*, a histogram) or appropriate statistics (*e.g.*, chi-squared) in response to the user's manipulation of the cutpoints, so that this issue can be quickly explored. This is dead easy in Excel, the OP's platform of choice, and also fairly easy in some high-end software like *Mathematica*.

    @whuber I voted up the response although it is certainly not the response expected by the OP, but I like the point about binning and density. Interactive visualization can be made really easy in R too, at least for this kind of viz. app. The Tk-based `rpanel` package provides basic functionalities (see e.g., `rp.slider()` for an illustration with a density plot); `playwith` is a Gtk-based interactive plotting device, and is very handy. Now, I have no doubt that this should be far more easier with Excel (which is the reason why I've also +1 your answer).

    @chl Thanks for the pointers to R interactivity tools.

  • I'll assume that you've already determined the number of categories you'll use. Let's say you want to use 20 categories. Then they will be:

    • Category 1: [0 - 100,000)
    • Category 2: [100,000 - 200,000)
    • Category 3: [200,000 - 300,000)
    • ...
    • Category 19: [1,800,000 - 1,900,000)
    • Category 20: [1,900,000 - 2,000,000]

    Note that the label of each category can be defined as

    FLOOR (x / category_size) + 1
    

    This is trivial to define as a computed column in SQL or as a formula in Excel.

    Note that the last category is infinitesimally larger than the others, since it is closed on both sides. If you happen to get a value of exactly 2,000,000 you might erroneously classify it as falling into category 21, so you have to treat this exception with an ugly "IF" (in Excel) or "CASE" (in SQL).

    Good start! This solution *assigns* ranges; now you have to group them by following up your calculation with a summarize operation.

    @whuber Grouping is something I'm assuming lonut can do if he regularly works with Excel and SQL, and there are many ways to do it. In Excel, for example, he can use a PivotTable or the COUNTIF() function; in SQL he'll probably use COUNT(*) with a GROUP BY CATEGORY clause (where CATEGORY is the computed column).

  • You have requested an Excel or SQL solution. The easiest way in Excel is to use its "Analysis" add-in to create a histogram. It will automatically create the bins (ranges of values) but, optionally, accepts a list of bin cutpoints as input and uses them. The output includes a parallel list of bin counts. This is especially handy for irregular-width bins.

    This is a one-off calculation: if the data change or the cutpoints change, you have to go through the entire dialog again. A more flexible option is to use COUNTIF to count all values less than or equal to any given bin cutpoint. The first differences of such an array give the bin counts.

    Here is a working example. The data are in a column named "Simulation_Z" (which in this particular case is defined to be an entire column, such as $C:$C). The formulae shown below are copied from columns L2:N10 of a sheet in the same workbook. They were created by copying the first one downward (but notice the special formula for the first count in N3).

    Cut Count up                            Count
    -3.0    =COUNTIF(Simulation_Z, "<=" & L3)   =M3
    -2.0    =COUNTIF(Simulation_Z, "<=" & L4)   =M4-M3
    -1.0    =COUNTIF(Simulation_Z, "<=" & L5)   =M5-M4
     0.0    =COUNTIF(Simulation_Z, "<=" & L6)   =M6-M5
     1.0    =COUNTIF(Simulation_Z, "<=" & L7)   =M7-M6
     2.0    =COUNTIF(Simulation_Z, "<=" & L8)   =M8-M7
     3.0    =COUNTIF(Simulation_Z, "<=" & L9)   =M9-M8
    =MAX(Simulation_Z)  =COUNTIF(Simulation_Z, "<=" & L10)  =M10-M9
    

    Column L ("Cut") stipulates the upper limits of each bin.

    This procedure simultaneously defines the bins and computes their counts, which are then available for further testing (e.g., $\chi\text{-squared}$) or plotting.

  • Another option using Excel that gives you a fair amount of flexibility over the number & size of bins is the =frequency(data_array, bins_array) function. It is an array function that expects two arguments. The first argument is your data, the second is your bins that you define.

    Let's assume your data is in cells A1 - A1000 and create bins in cells B1 - B20. You would want to highlight cells C1 - C21 and then type something like =FREQUENCY(A1:A100, B1:B21). Unlike normal functions, array functions must be entered with the key combination SHIFT + CTRL + ENTER. You should see the counts fill down for all bins, if not - you most likely only hit enter and the first cell is calculated.

    There are plenty of good tutorials online explaining this in more detail, here's a decent one.

  • In Excel, a simple way to group numeric data into bins is via the Pivot Table. Pull the numeric variable into the "row labels". Now right-click on any of the values in this right column and choose "Group". You can set the min and max of the overall range and the bin size (equal bins widths for all data).

    This kinda worked but Excel did sort the ranges as text generating a total mess. I wasn't able to convince Excel (2010) to show the ranges in the logical order.

    Do you mean that Excel doesn't recognize that your data are numerical? If you look at the cell format, is it defined as numbers? Just to make sure I understand your comment.

  • Although the FLOOR answer is probably nicer, you can do this with a vlookup.

    The first column of the lookup table (shown below) is the lower bound of each of your bins. Then you enter a formula to make the labels in the second column (formula shown, and result of the formula in the third column -- obviously you'd only have two columns.) Note the last two rows of the table are wonky -- the second last has a different formula because you want it to be the unbounded "greater than" case (unless you don't want that, then change the formula - you can set this up however you like).

    You would enter this formula (assuming the values you want to bin are in Col A, and the lookup table is in cols F:G somewhere. Note the TRUE parameter, so vlookup doesn't need to find an exact match.

    =VLOOKUP(A2,$F$2:$G$24,2,TRUE)
    

    This is what the lookup table would look like:

    Floor   Formula             Result of Formula (label) 
        0   =F2 & " - " &F3-1   0 - 49999
    50000   =F3 & " - " &F4-1   50000 - 99999
    100000  =F4 & " - " &F5-1   100000 - 149999
    150000  =F5 & " - " &F6-1   150000 - 199999
    200000  =F6 & " - " &F7-1   200000 - 249999
    250000  =F7 & " - " &F8-1   250000 - 299999
    300000  =F8 & " - " &F9-1   300000 - 349999
    ....
    1100000 =" >= " &F25        >= 1100000
    
  • To answer this for SQL, you really need to have a database that understands a window function (part of SQL:2003), such as MS SQL Server or PostgreSQL.

    To group val from table data into, say, 13 equal-width bins, use the ntile window function:

    SELECT val, ntile(13) OVER (ORDER BY val)
    FROM data;
    
           val        | ntile
    ------------------+-------
     8908.96283090115 |     1
     9090.72533249855 |     1
      9620.1803535223 |     1
      11068.768799305 |     1
    ...
     1994248.56621772 |    13
     1994786.21594608 |    13
     1995945.97052783 |    13
     1997640.62557369 |    13
    (1000 rows)
    
  • that is simply to use this formula in excel:

    Ceiling(Cell(i,j),1000)/1000

    Then you have group of 1000: 1, 2, 3, ...

    Velcome to our site!

    The concern is that perhaps almost all of the numbers might lie in a few bins, such as from 0 through 999. Your solution will then destroy almost all information about the distribution. It would be helpful to be able to *adapt* the binning to the data distribution in order to reveal as much as possible about the distribution.

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM