### What is the intuition behind beta distribution?

Disclaimer: I'm not a statistician but a software engineer. Most of my knowledge in statistics comes from self-education, thus I still have many gaps in understanding concepts that may seem trivial for other people here. So I would be very thankful if answers included less specific terms and more explanation. Imagine that you are talking to your grandma :)

I'm trying to grasp the

**nature**of**beta distribution**– what it should be used for and how to interpret it in each case. If we were talking about, say, normal distribution, one could describe it as arrival time of a train: most frequently it arrives just in time, a bit less frequently it is 1 minute earlier or 1 minute late and very rarely it arrives with difference of 20 minutes from the mean. Uniform distribution describes, in particular, chance of each ticket in lottery. Binomial distribution may be described with coin flips and so on. But is there such**intuitive explanation**of**beta distribution**?Let's say, $\alpha=.99$ and $\beta=.5$. Beta distribution $B(\alpha, \beta)$ in this case looks like this (generated in R):

But what does it actually mean? Y-axis is obviously a probability density, but what is on the X-axis?

I would highly appreciate any explanation, either with this example or any other.

@whuber: yeah, I understand what PDF is - that was just mistake in my description. Thanks for a valid note!

I'll try and find the reference but I know some of the more bizarre shapes for the generalized Beta distribution with form $a + (b-a)Beta(\alpha_1,\alpha_2)$ have applications such as physics. Also, you can fit it to expert data (min, mode, max) in data-poor environments and it is often better than using a Triangular distribution (unfortunately often used by IEs).

You've obviously never traveled with the railway company Deutsche Bahn. You'd be less optimistic.

David Robinson Correct answer

8 years agoThe short version is that the Beta distribution can be understood as representing a distribution

*of probabilities*, that is, it represents all the possible values of a probability when we don't know what that probability is. Here is my favorite intuitive explanation of this:Anyone who follows baseball is familiar with batting averages—simply the number of times a player gets a base hit divided by the number of times he goes up at bat (so it's just a percentage between

`0`

and`1`

).`.266`

is in general considered an average batting average, while`.300`

is considered an excellent one.Imagine we have a baseball player, and we want to predict what his season-long batting average will be. You might say we can just use his batting average so far- but this will be a very poor measure at the start of a season! If a player goes up to bat once and gets a single, his batting average is briefly

`1.000`

, while if he strikes out, his batting average is`0.000`

. It doesn't get much better if you go up to bat five or six times- you could get a lucky streak and get an average of`1.000`

, or an unlucky streak and get an average of`0`

, neither of which are a remotely good predictor of how you will bat that season.Why is your batting average in the first few hits not a good predictor of your eventual batting average? When a player's first at-bat is a strikeout, why does no one predict that he'll never get a hit all season? Because we're going in with

*prior expectations.*We know that in history, most batting averages over a season have hovered between something like`.215`

and`.360`

, with some extremely rare exceptions on either side. We know that if a player gets a few strikeouts in a row at the start, that might indicate he'll end up a bit worse than average, but we know he probably won't deviate from that range.Given our batting average problem, which can be represented with a binomial distribution (a series of successes and failures), the best way to represent these prior expectations (what we in statistics just call a prior) is with the Beta distribution- it's saying, before we've seen the player take his first swing, what we roughly expect his batting average to be. The domain of the Beta distribution is

`(0, 1)`

, just like a probability, so we already know we're on the right track, but the appropriateness of the Beta for this task goes far beyond that.We expect that the player's season-long batting average will be most likely around

`.27`

, but that it could reasonably range from`.21`

to`.35`

. This can be represented with a Beta distribution with parameters $\alpha=81$ and $\beta=219$:`curve(dbeta(x, 81, 219))`

I came up with these parameters for two reasons:

- The mean is $\frac{\alpha}{\alpha+\beta}=\frac{81}{81+219}=.270$
- As you can see in the plot, this distribution lies almost entirely within
`(.2, .35)`

- the reasonable range for a batting average.

You asked what the x axis represents in a beta distribution density plot—here it represents his batting average. Thus notice that in this case, not only is the y-axis a probability (or more precisely a probability density), but the x-axis is as well (batting average is just a probability of a hit, after all)! The Beta distribution is representing a probability distribution

*of probabilities*.But here's why the Beta distribution is so appropriate. Imagine the player gets a single hit. His record for the season is now

`1 hit; 1 at bat`

. We have to then*update*our probabilities- we want to shift this entire curve over just a bit to reflect our new information. While the math for proving this is a bit involved (it's shown here), the result is*very simple*. The new Beta distribution will be:$\mbox{Beta}(\alpha_0+\mbox{hits}, \beta_0+\mbox{misses})$

Where $\alpha_0$ and $\beta_0$ are the parameters we started with- that is, 81 and 219. Thus, in this case, $\alpha$ has increased by 1 (his one hit), while $\beta$ has not increased at all (no misses yet). That means our new distribution is $\mbox{Beta}(81+1, 219)$, or:

`curve(dbeta(x, 82, 219))`

Notice that it has barely changed at all- the change is indeed invisible to the naked eye! (That's because one hit doesn't really mean anything).

However, the more the player hits over the course of the season, the more the curve will shift to accommodate the new evidence, and furthermore the more it will narrow based on the fact that we have more proof. Let's say halfway through the season he has been up to bat 300 times, hitting 100 out of those times. The new distribution would be $\mbox{Beta}(81+100, 219+200)$, or:

`curve(dbeta(x, 81+100, 219+200))`

Notice the curve is now both thinner and shifted to the right (higher batting average) than it used to be- we have a better sense of what the player's batting average is.

One of the most interesting outputs of this formula is the expected value of the resulting Beta distribution, which is basically your new estimate. Recall that the expected value of the Beta distribution is $\frac{\alpha}{\alpha+\beta}$. Thus, after 100 hits of 300

*real*at-bats, the expected value of the new Beta distribution is $\frac{81+100}{81+100+219+200}=.303$- notice that it is lower than the naive estimate of $\frac{100}{100+200}=.333$, but higher than the estimate you started the season with ($\frac{81}{81+219}=.270$). You might notice that this formula is equivalent to adding a "head start" to the number of hits and non-hits of a player- you're saying "start him off in the season with 81 hits and 219 non hits on his record").Thus, the Beta distribution is best for representing a probabilistic distribution

*of probabilities*: the case where we don't know what a probability is in advance, but we have some reasonable guesses.@ffriend: Glad it helped- I hope you follow baseball (otherwise I wonder if it's understandable!)

Here's a similar example from John Cook using binary Amazon seller rankings with different number of reviews. The discussion of choosing a prior in the comments is particularly illuminating: http://www.johndcook.com/blog/2011/09/27/bayesian-amazon/#comments

You should point out that the prior need not be beta-distributed (unless you go with the Jeffreys' prior, $\alpha_0=\beta_0=1/2$ — only the likelihood must be beta distributed.

+ I like your explanation of how you update the distribution when you have more data.

@DavidRobinson: Any idea what to do if you have some unobserved outcomes? For baseball this does not make much sense since games are public, but for something like product or restaurants ratings this is rather important.

@DimitriyV.Masterov: Could you explain what you mean? While missing data is an issue in many classification problems, in this case (predicting a binomial or multinomial using a prior) there's nothing you can do but ignore it. (That is to say, what good does it do you to know that there were 30 other hits if you don't know anything about their result?)

@DavidRobinson: Suppose that we are dealing with restaurant reviews. You have a restaurant with 1000 transactions, but you observe binary ratings on only 500 of them. The "silent" ones are a mixture of negatives and positives. One area where ignoring this is problematic is popular restaurants, where people are reluctant to be the millionth to rate them.

@DavidRobinson - Nice explanation! Could you clarify where the initial values of α=81 and β=219 are coming from? or they are just examples?

@user27997 Those gave the desired mean of .27, and a standard deviation that is very roughly realistic for batting averages (about .025). Incidentally, I give an explanation of how to calculate α and β from a desired mean and variance here.

Why the beta distribution in particular? Is it just because it works? It seems quite similar to the binomial distribution, except I don't understand the -1 in the exponent.

@wrongusername: it's because the beta is the conjugate prior of the binomial. I link to the math that proves this result in the post.

@DavidRobinson great answer. How should I interpret the y-axis of that distribution? I see how to generate it from the beta function -- but what does it mean?

@bernie2436 It's a probability density. Technically a probability density has an intuitive meaning only when you integrate it; i.e. find the area under the curve. For example, the probability that a player's batting average is between .300 and .350 is equal to the area under the curve from .3 to .35. (If you try this with smaller and smaller intervals, you can see how the y-axis at each point is chosen!)

@bernie2436 For example: suppose there is a probability of 2% that the player's batting average is between .300 and .301. To get the probability density at that point, we would do .02 / (.301 - .300), which computes a density of 20. That will be the value at that point

@Davidrobinson awesome. To clarify in your last comment "at that point" means on the y axis when x is just so slightly over .300?

@bernie2436 That's right!

Very nice answer, but you glossed over the concept of Bayesian estimation towards the end, which is why some readers did not understand how you obtained the posterior distribution.

Thanks for your answer - while what you said makes sense, roughly fitting mean and variance can't be the only way of modeling prior, ya? Should I consider this as a all-purpose generic model, and if my situation is more specific, I should specifically engineer a model for the prior and update the posterior as data comes in? Are there more general-purpose parametrized models like beta distribution?

It may be useful to, instead of using a heuristic to fit the Beta prior, fit it (e.g. with MLE) to historical batting averages, which are not too hard to find.

@MichaelChirico true- in fact, that's what I did in the e-book that expands on my answer here http://varianceexplained.org/r/empirical-bayes-book/

Great explanation! can I ask, how did you get 81 and 219? I understand it fits after the graph plot, but how do I even go about obtaining the params?

@Wboy Thanks! I answer that here, by using the method of moments to get from a desired mean and standard deviation to the beta parameters.

@DavidRobinson you said "I came up with these parameters for two reasons", but you left out why you chose the combined magnitude which is, basically, how much weight you place on the prior. You can get about the same distribution by choosing much larger or smaller numbers resulting, respectively, in much less or more weight on new samples.

@JamesBowery If the combined magnitude were much higher or lower it wouldn't be the same distribution, because it would have a much lower or higher, respectively, standard deviation. That is, the magnitude is covered by the second reason: "this distribution lies almost entirely within `(.2, .35)`".

I see. So the narrower the distribution as specified by a and b, the more difficult it is to shift the mean of the probabilities.

Hi @DavidRobinson, how did you decide $\alpha_0$ and $\beta_0$ to be 81 and 239, but not something else but with the same ratio (like 40, 115)? Is it because you're assuming a player plays (hits or misses) 300 times in a season?

Isn't this sorely lacking the point that beta distribution is CONTINUOUS...this example is basically a binomial approximation to a Bernoulli process (which one often approximates beta-distributios with)? Wouldn't a better example be more along the lines of how large a proportion of the day a player spend on training? That can take any value between 0-1, i.e. being CONTINUOUS, whereas the number of hits he make will be DISCRETE and can't take any value between 0-1.

IIUC, this answer says "The intuition for the beta distribution is that it makes math easy when we do Bayesian updates". I feel this is not an intuition, but a technical convenience. Intuition would be "If we have this setup, the outcome has a beta distribution". For example, I just saw such an intuition in Stéphane Laurent's answer below.

Statistics is fun when people like @DavidRobinson exists. Thanks, David. It made my life easier :)

You made it so easy. Thank you.

License under CC-BY-SA with attribution

Content dated before 6/26/2020 9:53 AM

whuber 8 years ago

The y-axis is not a probability (which is obvious, because by definition a probability cannot lie outside the interval $0,1: a probability per unit of $x$ (and you have described $x$ as a rate).