In classical logic, inference
is the process by which we proceed from
some given truths, via a proof, to a conclusion. The classical model takes
statements to have the option of being either true or false; in the real world,
we often encounter statements which are at best probably
true or
probably
false, so can we adapt the classical process of inference to a
mechanism which works also for such statements ?
One approach is to make a classical statement which associates a probability
with each possibility and declares that reality will constitute a random sample
drawn from the available possibilities, in accordance with the probability
distribution indicated. This has the virtue of delivering a classical
statement; however, its relation to reality is a little strained - we only see
one of a family of potentialities, yet our model claims the others could
have
happened. Any experimental test of the given reading of the classical
statement must, consequently, involve repeating the experiment many times and
observing the distribution of outcomes.
So, one way or another, inference in the non-classical world needs to
discuss probability distributions or, at least, structures having a similar
form. We need to be able to combine data from experiments with prior
knowledge
to refine our knowledge; the process of inference takes what we
know (prior knowledge and experimental data) and extracts other information from
it. The Bayesian schoool approaches this in a systematic way.
Suppose we're studying a system whose general form we know subject to
determining some parameters
of the system. This gives us a mapping, H,
from possible values for the parameters to potential models. Our prior
knowledge of the system may be very vague about the value of the parameters; in
which case it is encoded as a measure on the parameter space (:H|). [A measure
on U is a generic mapping, unite(: (V: :{(V::U)}) ← V :{linear spaces}),
which turns any mapping from U to a linear space into a member of that linear
space; the measure of a particular (V:f:U) is described as the integral
of f over
U, with respect to the given measure.] It is usually taken as
read that this measure integrates the constant scalar function ({one}:|U) to
some (finite, non-zero) scalar value, enabling us to scale the measure so that
this integral (known as the total of the measure or the measure of U
) is
one. However, I shall not presume this, though the analysis shall clearly be
easiest when it holds; when U is infinite
so may its total measure
be.
So we have a mapping, H, from some parameter space to models of our system and a measure, p, on (:H|) describing our prior knowledge of the parameters. For any h in (:H|), H(h) is a model of our system; h encodes the values of all our parameters. Given some experiment, we can infer the probabilities that model H(h) gives to the possible outcomes of the experiment; if we conduct the experiment and get a particular outcome, we can compare possible values of h to see which models declare that outcome most probable, and how much difference h-variation makes. For any model, H(h), we obtain a probability distribution, call it q(h), on the space of possible outcomes for our experiment; let D be some collection of possible outcomes; q(h) maps ({one}:|D), the constant unit scalar function on D, to the probability, according to H(h), that the experiment's outcome shall fall in D. When our measure, p, on (:H|) is a probability distribution,
is p({scalars}: q(h)({one}:|D) ←h :(:H|)), which integrates (over (:H|), using p) the probability of the experiment's outcome as a function of the parameters to our model.
Now, q(h) and p are linear, so p(: q(h)(f) ←h :) = p(: q(h)←h :)(f) i.e. p(q,f), reducing the above to p(q, ({one}:|D)), in which ({one}:|D) could be replaced with an arbitrary (V:f:{outcomes}) with V a linear space. Indeed, q is a mapping from (:H|) to the linear domain {measures on {outcomes}} so, integrating over (:H|), p(q) is a measure on {outcomes}.
Now, suppose we repeat our experiment several times and build up a function r from outcomes to {naturals} indicating how often each outcome arose. We can scale to get ({scalars}: r/sum(r) :{outcomes}), the relative frequencies of the outcomes. We can integrate r over {outcomes} using p(q) to get the probability of seeing this distribution of outcomes, p(q,r).
Now, q(h)({one}:|D) is the probability, given that the parameters take the
values encoded in h, of the outcome being in D. Now, the probability of A
given B
is just the probability of A and B
divided by the probability
of B; and, conversely, the probability of A and B
is just the probability
of A times the probability of B given A
; so the probability of A given
B
is just the probability of B given A
times the probability of A,
divided by that of B. The thing we want as our inferred
knowledge of the
system, if our outcome really turns out to be in D, is the probability
distribution, on (:H|), which gives the probability of the parameters taking
value h given that the outcome was in D. When this is pulled back to obtain a
distribution on h, it scale's p's density at h by the factor q(h,({one}:|D))
times the probability of the outcome being in D. This last (the probability of
an outcome in D) is implicitly independent of h and we don't necessarily have
any way to discover it; however, its independence of h means that the scaled
density we get only depends on this unknown to the extent of an overall scaling,
so we can just scale p, pointwise, by q(h,({one}:|D)) if we get an outcome in D,
or by q(h,r) if we observe relative frequency distribution r; our result is
indeterminate up to an overall scaling, but the relative probabilities of the
different values h can take are not affected.
So Bayesians use Bayes' theorem (which re-arranges probabilities involving
given
, as above) to transform a prior distribution, p, on (:H|) into an
inferred distribution, P, on (:H|). Suppose we observe ({scalars}: r
|{outcomes}) as the relative frequencies. For any linear domain V and mapping
(V:f:(:H|)), P(f) = p(V: f(h).q(h,r) ←h :(:H|)) times some overall global
scaling; i.e. P is just p scaled, pointwise at h, by q(h,r).
This is the process of Bayesian inference: our data serves to define a transformation between distributions on the parameter space associated with our family of models.
Written by Eddy.