I really need to spend some time describing Bayesian reasoning. I have a mental image, arising from reading Yudkowsky: for a simple problem, with evident property B and hidden property A; two horizontal bars, one above the other; the upper one of length P(A), the other of length P(~A); each is split horizontally in two, with the P(X|B) part on the left and the P(X|~B) on the right, for X in {A, ~A}. Line up the B vs ~B boundaries of the two bars vertically, draw diagonal line joining the bars' left ends and likewise for the right ends; see where they meet. Is the diagramatic fact useful in visualising the correct inference ? I think it should be: if the two bars are subdivided in the same proportions horizontally, the two diagonal lines shall meet on the centre-line of the bars (i.e. under the B vs ~B position); any imbalance shall shift the meeting-point to one side or the other.

It may help to use, as example, the inquisitor's over-seer. The inquisitor asks the accused whether she is a witch: she denies it. The inquisitor infers that she is a witch, since a witch would be likely to deny it. The over-seer asks, then, what was the purpose of the question; and how its outcome can influence the inquisitor's conviction of the accused's guilt.

christiansand re-incarnation

Another example to toy with: while reporting on the Chinese government's legislation governing re-incarnation (I kid thee not), Newsweek tells me:

… could the next Dalai Lama be American-born?

You'll have to ask him,says Harrison. If so, he'll likely be welcomed into a culture that has increasingly embraced reincarnation over the years. According to a 2005 Gallup poll, 20 percent of all U.S. adults believe in reincarnation. Recent surveys by the Barna Group, a Christian research nonprofit, have found that a quarter of U.S. Christians, including 10 percent of all born-again Christians, embrace it as their favored end-of-life view.

from which I rapidly conclude that re-incarnation is more widely believed in
among folk who identify as christians than among the general adult population,
in the USA; and the born-again

crowd, though *less* apt to believe
in re-incarnation than the general adult population, aren't all fully persuaded
by orthodox christianity.

If you toss a coin (that you have no prior information about) five times and it comes up heads all five times, what's the probability that it'll come up heads the next time you toss it ?

Assume each coin toss is independent and the probability of coming up heads on any given toss is h; we may have started out with a prior expectation that h is about a half, but we may now be less confident. Being good Bayesians, let's suppose we had a prior likelihood distribution on h given by a density function H; so we thought that the likelihood of h lying between a and b is the integral of H between a and b, for all a and b between 0 and 1. To infer our posterior likelihood distribution G, we can compute:

- G(x)
the likelihood of h being x, given our data

- = power(5, x).H(x)/P(five heads)
the probability of our data, given h is x, times the prior likelihood of h being x, divided by the probability of our data

and P(five heads) is independent of x, so it's just a normalization constant. In fact, it'll be integral(: power(5, x).H(x) ←x :), but this is rather unimportant. Now, the probability of the sixth toss coming up heads is then integral(: G(h).h ←h :), which we can't compute without chosing some prior H.

Now, one could try to be clever and build a model of the real world factors
that determine how biased a coin is apt to be and formulate a prior based on
that (e.g. bounded by how biased it *can* be and maybe shaped to reflect
how much effort it takes to make it more than a little biased); then we'll
evolve that in response to the results of some coin-tosses, multiplying by x
←x for every head and by 1 −x ←x for every tail and
renormalising; once it has lots of factors of these two forms, it'll take the
form of a spike that peaks at some rational n/(m+n) for some naturals n, m
(i.e. at some rational n/p between 0 and 1, hence with n ≤ p, hence p = n+m
for some natural m); at this point, it shall be moderately well approximated by
(: power(n, x).power(m, 1 −x) ←x :) which is what we *would*
have as G had we started with H uniform and seen n heads and m tails. The n, m
in question might not match the actual numbers of heads and tails seen up to
this point; and different choices of which n, m to use to represent a given
rational (we can replace any given n, m by k.n and k.m for any natural k; or we
can eliminate such a common factor from them); the latter just gives us some
choice in n, m to improve our G's match to the actual derived distribution,
alongside the choice of which rational near the distribution's mode to
approximate as n/(n+m), with larger n+m giving a tighter peak and lower n+m a
broader peak around the given rational. Since we can expect, after enough data,
to get a derived distribution that looks like one derived from a uniform prior,
we may as well actually use a uniform prior; our eventual derived distribution
shall be roughly the same anyway – and, if the coin is very weird, we
won't have prejudiced ourselves against being able to recognise this.

Now, given the form of distribution we'll thus be dealing with, we're going to have powers of x ←x and of 1 −x ←x in our distribution; the normalisation constant we'll be needing is the integral of such a product; the probability of heads or tails on the next toss shall be a similar integral, with one of the two powers increased by one, divided by this normalisation constant; so we're going to be interested in such integrals. Let us, then, define Q(n, m) to be the integral of power(n, x).power(m, 1 −x) over the unit interval U = {positives < 1}. We can evaluate this using integration by parts:

- Q(n, m)
- = integral(: power(n, x).power(m, 1 −x) ←x |U)
- = (: f(1) −f(0) ←f :)(: power(1+n, x).power(m, 1 −x) ←x :)/(1+n) +integral(: power(1+n, x).power(m−1, 1 −x) ←x |U).m/(1+n)
- = Q(n +1, m −1).m/(n +1)
at least when m > 0; the first term's f(0) is necessarily 0 as it has a factor of power(1+n, 0); while its f(1) has a factor power(m, 0); making both parts of the first term zero.

When m = 0, we have Q(n, 0) = 1/(1 +n), which we can write as n!.m!/(1+n+m)!, with m = 0, in which k! is the factorial of k, k! = k.(k−1)…1 = product(: 1+i ←i |k). So Q(n, m) = n!.m!/(1+n+m)! for every natural n when m = 0. Inductively suppose we have some m for which this holds true for every natural n and consider Q(n, m+1) = Q(n+1, m).(m+1)/(n+1) = (n+1)!.m!.(1+m)/(n+1)/(2+n+m)! = n!.(1+m)!/(1+n+1+m)! which is exactly of the right form for our inductive hypothesis applied to m+1, so we can induce that Q(n, m) = n!.m!/(1+n+m)! for all natural n, m.

So, with uniform prior H = {: 1 ←x |U), understood as representing no prior opinion about the coin's bias, a sequence of n heads and m tails, mixed up in any order, gives us G = (: power(n, x).power(m, 1 −x)/Q(n, m) ←x |U). The probability that the next coin-toss shall be a head is then Q(n+1, m)/Q(n, m) = (1+n)/(2+n+m), while that for a tail is Q(n, m+1)/Q(n, m) = (1+m)/(2+n+m); notice that these do indeed add up to 1 (as expected). In particular, with n = 5, m = 0, we get 6/7; after five coin tosses the same (with no other prior knowledge of the coin's bias), the probability that the next shall continue the run is 6/7; and each successive continuation pushes that probability closer to one.

Note that traditional statistics would say that, under
the null hypothesis

of a fair coin, because the probability five heads in
a row is 1/32, as is that for five tails, the probability of an outcome at least
as far from equal numbers of heads and tails as what we've observed is 1/16 =
8.25%, which is more than 5%, the five heads in a row do not constitute
statistically significant evidence to reject the default of assuming the coin is
fair. Even if it had rejected that hypothesis, it still wouldn't really have
told us what model to replace it with; should we assume the coin always comes up
heads ? Or merely that it gets heads more often than

tails ?
In contrast, Bayes gives us a definite answer to my initial question, that looks
respectably compatible with our intuitions about the situation, and tells us how
to further revise our model as we gain further information by actually doing
that sixth coin-toss.

With G(x) = power(n, x).power(m, 1 −x)/Q(n, m) we get G'(x) = (n −(n +m).x).power(n−1, x).power(m−1, 1 −x)/Q(n, m) which is zero precisely when n = (n +m).x; as we'll shortly see, this is a maximum, giving the mode of our distribution. Earlier, I claimed that some rational – expressed as n/(n+m) for some n, m – near the peak of our eventual derived distribution shall give us a distribution approximating the derived distribution; this shows that any such n, m shall at least give an approximating G whose mode is about right.

Differentiating again, we get

- G''(x).Q(n, m)
- = −(n +m).power(n−1, x).power(m−1, 1 −x) +(n−1).(n −(n +m).x).power(n−2, x).power(m−1, 1 −x) −(m−1).(n −(n +m).x).power(n−1, x).power(m−2, 1 −x)
- = (−(n +m).x.(1 −x) +(n−1).(n −(n +m).x).(1 −x) −(m−1).(n −(n +m).x).x).power(n−2, x).power(m−2, 1 −x)
- = ((n +m).(x.x −x) +(n−1).n −(n−1).n.x +(n +m).(n−1).(x.x −x) −(m−1).n.x +(m−1).(n +m).x.x).power(n−2, x).power(m−2, 1 −x)
- = ((n+m−1).(n +m).x.x −2.n.(n+m−1).x +(n−1).n).power(n−2, x).power(m−2, 1 −x)

in which the first factor, evaluated when G'(x) = 0, i.e. at n/(n+m), is

- (n+m−1).(n +m).n.n/(n+m)/(n+m) −2.n.(n+m−1).n/(n+m) +(n−1).n
- = n.n −n.n/(n+m) −2.n.n +2.n.n/(n+m) +n.n −n
- = n.n/(n+m) −n
- = −n.m/(n+m) = −1/(1/m +1/n)

which is negative, or zero for m = 0. For positive m, this makes the second derivative negative so the stationery point at n/(n+m) is indeed a maximum, i.e. the mode. For m = 0, we have G(x).Q(n, 0) = power(n, x) which is increasing on the whole unit interval and attains its maximum at x = 1 = n/(n+0), so this is indeed the mode, even though it isn't actually a stationery point. (The factor of power(m−1, 1−x) in G'(x) is, in that case, an artifact of how I've expressed G'(x) above.) So our n/(n+m) mode is indeed maximal for G. Next, let's consider where G'' is zero, for n, m at least 2: this requires

- 0 = (n+m−1).(n +m).x.x −2.n.(n+m−1).x +(n−1).n
- = (n +m −1).(x.x.(n +m) −2.n.x) +(n −1).n
- = (n +m −1).(n +m).(x.x −2.n.x/(n +m)
+(n −1).n/(n +m)/(n +m −1))
whence

- power(2, x −n/(n +m))
- = n.n/(n +m)/(n +m) −(n −1).n/(n +m)/(n +m −1)
- = (n/(n +m) −(n −1)/(n +m −1)).n/(n +m)
- = (n.(n +m −1) −(n −1).(n +m)).n/(n +m −1)/(n +m)/(n +m)
- = (n.n +n.m −n −n.n +−n.m +n +m).n/(n +m −1)/(n +m)/(n +m)
- = m.n/(n +m −1)/(n +m)/(n +m)
and

- power(2, (n +m).x −n)
- = m.n/(n +m −1)
- = 1/(1/n +1/m −1/n/m)
- = 1/(1 −(1 −1/n).(1 −1/m))
which is > 1 for n, m > 1.

whence x = n/(n+m) ±1/(n+m)/√(1/n +1/m −1/n/m), so our density G has points of inflection equally spaced on either side of its mode, further out than (n±1)/(n+m). Reparameterising with rational r = n/(n+m) and natural p = n+m, so n = r.p, m = (1−r).p, the peak's half-width is 1/p/√q with q = 1/n +1/m −1/n/m so p.p.q = p/r +p/(1−r) −1/r/(1−r) = (p −1)/r/(1−r), making the half-width √(r.(1−r)/(p−1)). For given r (the proportion of coin-tosses we've seen be heads), varying p (the number of trials over which we've seen this average) makes p.p.q vary approximately as p, at least for p significantly above 1, and this means the peak's width varies as 1/√(p−1); the variance of our distribution thus decreases proportional to the number of trials performed (as one of the precursors to the central limit theorem would tell you to expect).

Since the points of inflexion are evenly spaced on either side of our mode, n/(n+m), let's parameterise about the mode and divide G by its value there to see how G varies near the mode. For any s, we have:

- G((n+s)/(n+m)) / G(n/(n+m))
- = power(n, (n+s)/n).power(m, (m−s)/m)
- = power(n, 1 +s/n).power(m, 1 −s/m)

For large k and small s, power(k, 1 +s/k) is approximately exp(s), so our ratio is approximately exp(s).exp(−s) = 1. We're interested in this where s.s = 1/(1/n +1/m −1/n/m), for our points of inflexion; so s.s is of the same order as n and m, making s/n and s/m of the same order as 1/s, by which scale exp(s) is a poor approximation to power(k, 1 +s/k), for k of the same order as n, m. When n and m are equal, we get n = m factors of 1 −s.s/n/m = 1 −1/(n +m −1) which is roughly exp(n/(2.n −1)), roughly exp(-1/2), a comfortably O(1) quantity but less than 1, as we would expect. So, at least for n and m close to equal, G hasn't died away drastically by the time it reaches its points of inflexion.

When using a non-uniform prior, once we have adapted it in response to some observed coin-tosses, when approximating its derived distribution's G(x) by a power(n, x).power(m, 1 −x)/Q(n, m), let r be the derived distribution's mode; scale the inverse square half-width of the peak about the mode (between points of inflection) by r.(1−r) and add 1 to get p. We can then use n = r.p, m = (1−r).p as the numbers of heads and tails we would need to get roughly the same derived distribution from a uniform prior; the difference between these and the actual numbers of heads and tails seen before making the approximation tell us the number of heads and tails the non-uniform prior effectively expresses as our prejudice about how fair we expected the coin to be. Everything else about the non-uniform prior is effectively forgotten once we have taken into account a reasonable number of coin-tosses.

The foregoing might, of course, give non-whole numbers for n, p; for example, we might chose as prior the distribution whose graph is a half-circle with zero density at 0 and 1; as the radius is 1/2 and we only have half of the circle, above the x-axis, the area under this is π/8, so we need to scale by 8/π. The function to scale maps each x to the square root of 1/4 −power(2, 1/2 −x) = 1/4 −(x.x −x +1/4) = x.(1−x), so our function is just (: √(x.(1−x)) ←x :), which corresponds to half a coin-toss each way. Our derived distribution would, none the less, always be some product of terms in x and 1−x, as a function of x, albeit with non-whole powers (so extending Q using the gamma function, Γ to generalise factorial).

- Eliezer Yudkowsky's intuitive introduction to Bayesian reasoning – that even manages to be funny, in places.
- Information Theory, Inference, and Learning Algorithms by David J.C. MacKay
- Observation selection theory and anthropic reasoning.

3blue1brown points
out that Bayes's rule is easier to think about when phrased in terms of odds
rather than probabilities. If someone claims the odds of some outcome are n
to m

, for some numbers n and m (usually coprime whole numbers are used;
rescaling both by any common factor makes no difference to what it means),
that's a claim that the outcome has probability n/(n+m); if we express the odds
as a ratio q = n/m, then the probability is p = 1/(1 +1/q). Indeed, Bayse's
formula can naturally be rearranged to match this form:

- Post(H|D) = Prior(H).P(D|H)/(Prior(H).P(D|H) +Prior(not H).P(D|not H))
- = 1/(1 +Prior(not H).P(D|not H)/Prior(H).P(D|H))

(here P(A|B) gives our model's probability of A, when given that B is
true, while Prior(B) is how likely we thought outcome B would be; H is a
circumstance our model allows might be true and D is some data we've observed in
a test; Post(H|D) is then how likely Bayes says outcome H is, after seeing test
result D). Reading Post(H|D) as a probability, the matching odds ratio q =
p/(1−p) is just the resulting of scaling Prior(H)/Prior(not H), the prior
odds ratio, by the Bayes factor

P(D|H)/P(D|not H), the ratio of
probabilities of the outcome the test actually produced given, respectively, the
circumstance H to that given all its alternatives.

It is, however, worth noting (as 3b1b does in a closing comment) that this is only a helpful representative if we're comparing one hypothetical circumstance H to a simple contradiction of it; if we actually have many alternative hypothetical circumstances (maybe even a continuum of them) and want to keep track of likelihoods for each of them, the Prior(not H).P(D|not H) gets replaced by sum(: Prior(E).P(D|E) ←E :{hypotheses other than H}) which, technically, is indeed Prior(not H).P(D|not H) but the latter ceases being the natural form in which to think of it, as it gives rise to a single scaling Prior(D) = sum(: Prior(E).P(D|E) ←E :{hypotheses}) for use in Post(H|D) = Prior(H).P(D|H)/Prior(D) which lets us simply scale each Prior(H) by its P(D|H) to get the shape of the distribution Post(H|D), with Prior(D) just being the normalisation needed to make that distribution have unit total.

Written by Eddy.