Bayesian Inference

In classical logic, inference is the process by which we proceed from some given truths, via a proof, to a conclusion. The classical model takes statements to have the option of being either true or false; in the real world, we often encounter statements which are at best probably true or probably false, so can we adapt the classical process of inference to a mechanism which works also for such statements ?

One approach is to make a classical statement which associates a probability with each possibility and declares that reality will constitute a random sample drawn from the available possibilities, in accordance with the probability distribution indicated. This has the virtue of delivering a classical statement; however, its relation to reality is a little strained - we only see one of a family of potentialities, yet our model claims the others could have happened. Any experimental test of the given reading of the classical statement must, consequently, involve repeating the experiment many times and observing the distribution of outcomes.

So, one way or another, inference in the non-classical world needs to discuss probability distributions or, at least, structures having a similar form. We need to be able to combine data from experiments with prior knowledge to refine our knowledge; the process of inference takes what we know (prior knowledge and experimental data) and extracts other information from it. The Bayesian schoool approaches this in a systematic way.

The Bayesian Model

Suppose we're studying a system whose general form we know subject to determining some parameters of the system. This gives us a mapping, H, from possible values for the parameters to potential models. Our prior knowledge of the system may be very vague about the value of the parameters; in which case it is encoded as a measure on the parameter space (:H|). [A measure on U is a generic mapping, unite(: (V: :{(V::U)}) ← V :{linear spaces}), which turns any mapping from U to a linear space into a member of that linear space; the measure of a particular (V:f:U) is described as the integral of f over U, with respect to the given measure.] It is usually taken as read that this measure integrates the constant scalar function ({one}:|U) to some (finite, non-zero) scalar value, enabling us to scale the measure so that this integral (known as the total of the measure or the measure of U) is one. However, I shall not presume this, though the analysis shall clearly be easiest when it holds; when U is infinite so may its total measure be.

So we have a mapping, H, from some parameter space to models of our system and a measure, p, on (:H|) describing our prior knowledge of the parameters. For any h in (:H|), H(h) is a model of our system; h encodes the values of all our parameters. Given some experiment, we can infer the probabilities that model H(h) gives to the possible outcomes of the experiment; if we conduct the experiment and get a particular outcome, we can compare possible values of h to see which models declare that outcome most probable, and how much difference h-variation makes. For any model, H(h), we obtain a probability distribution, call it q(h), on the space of possible outcomes for our experiment; let D be some collection of possible outcomes; q(h) maps ({one}:|D), the constant unit scalar function on D, to the probability, according to H(h), that the experiment's outcome shall fall in D. When our measure, p, on (:H|) is a probability distribution,

the probability that the experiment's outcome is in D: is p({scalars}: q(h)({one}:|D) ←h :(:H|)), which integrates (over (:H|), using p) the probability of the experiment's outcome as a function of the parameters to our model.

Now, q(h) and p are linear, so p(: q(h)(f) ←h :) = p(: q(h)←h :)(f) i.e. p(q,f), reducing the above to p(q, ({one}:|D)), in which ({one}:|D) could be replaced with an arbitrary (V:f:{outcomes}) with V a linear space. Indeed, q is a mapping from (:H|) to the linear domain {measures on {outcomes}} so, integrating over (:H|), p(q) is a measure on {outcomes}.

Now, suppose we repeat our experiment several times and build up a function r from outcomes to {naturals} indicating how often each outcome arose. We can scale to get ({scalars}: r/sum(r) :{outcomes}), the relative frequencies of the outcomes. We can integrate r over {outcomes} using p(q) to get the probability of seeing this distribution of outcomes, p(q,r).

Now, q(h)({one}:|D) is the probability, given that the parameters take the values encoded in h, of the outcome being in D. Now, the probability of A given B is just the probability of A and B divided by the probability of B; and, conversely, the probability of A and B is just the probability of A times the probability of B given A; so the probability of A given B is just the probability of B given A times the probability of A, divided by that of B. The thing we want as our inferred knowledge of the system, if our outcome really turns out to be in D, is the probability distribution, on (:H|), which gives the probability of the parameters taking value h given that the outcome was in D. When this is pulled back to obtain a distribution on h, it scale's p's density at h by the factor q(h,({one}:|D)) times the probability of the outcome being in D. This last (the probability of an outcome in D) is implicitly independent of h and we don't necessarily have any way to discover it; however, its independence of h means that the scaled density we get only depends on this unknown to the extent of an overall scaling, so we can just scale p, pointwise, by q(h,({one}:|D)) if we get an outcome in D, or by q(h,r) if we observe relative frequency distribution r; our result is indeterminate up to an overall scaling, but the relative probabilities of the different values h can take are not affected.

So Bayesians use Bayes' theorem (which re-arranges probabilities involving given, as above) to transform a prior distribution, p, on (:H|) into an inferred distribution, P, on (:H|). Suppose we observe ({scalars}: r |{outcomes}) as the relative frequencies. For any linear domain V and mapping (V:f:(:H|)), P(f) = p(V: f(h).q(h,r) ←h :(:H|)) times some overall global scaling; i.e. P is just p scaled, pointwise at h, by q(h,r).

This is the process of Bayesian inference: our data serves to define a transformation between distributions on the parameter space associated with our family of models.

Written by Eddy.