The so-called normal
or bell-shaped
distribution has density
proportional to the exponential of a negative suitably-scaled square of its
variate's difference from a mid-point. It arises naturally as the limiting
distribution of the average outcome of a large number of trials of any random
process that has finite mean and variance. (So don't try this with the
dx/(x.x +1)/π distribution on the reals; its mean is ill-defined and its
variance is infinite.) The selected mid-point is mean, mode and median of the
distribution; the scaling used controls the variance.
Unfortunately, it is all too common for over-simplistic treatments of statistics to naïvely suppose that it's a sensible distribution to use to model pretty much any process. Whenever a random variate is, by specification, necessarily positive (to take a common example), the gaussian distribution, at best, makes it easy to estimate the scales of values; but it (at least in principle) allows a non-zero (albeit possibly very small) probability of a negative outcome. In such cases, a gamma distribution is generally preferable; it is conceptually more suitable, because its variate is by specification positive – and suitable choice of its two parameters can make it fit as nicely as the gaussian model will. In any case, one should actually plot the distribution of a data-set, to check it has a shape reasonably close to that of the model, before modelling it with a gaussian (or any other distribution) with the same mean and variance.
All the same, the gaussian is a particularly interesting distribution, with some neat properties; in particular, because it dies away to zero very strongly once one gets well away from its mean, it ensures well-defined expected values for diverse functions of the random variate. Indeed, it is for this reason that it plays a crucial rôle in the justification of Fourier's transform.
One of the interesting properties of gaussian distributions is that, if you multiply two of them together and renormalize (e.g. because the originals were the likelihood distributions of some common datum, given distinct experimental data sets), you get another. (Note that this has nothing to do with the distribution of a variate obtained by multiplying together two variates, each of which has a gaussian distribution.) Leaving out the normalizations, this is because
in which the final square(a −c) term is constant, so swallowed
up by normalization. So the product, of one gaussian with mean a and variance
b with another having mean c and variance d, has mean (a.d +c.b)/(b +d) and
variance 1/(1/b +1/d) = (b +d)/b.d. This actually also works in general
linear spaces, albeit square(x−a)/b gets replaced by
(x−a)/b\(x−a), with x and a now being vectors and b a quadratic
form, and likewise for c and d; the product's mean is (a/b +c/d)/(1/b +1/d)
and its variance (the quadratic form corresponding to b and d) is 1/(1/b
+1/d), with division
by a quadratic form being understood as
contraction with its inverse.
When we have several data, each of which has a gaussian distribution,
we can describe these data collectively
as a single vector-valued variate, whose components are the separate
data. Absent messy correlations, the distribution of this vector variate is
apt to also be gaussian, albeit we now have to interpret suitably-scaled
square of its variate's difference from a mid-point
a little more
carefully than in the simple one-dimensional case. The mid-point is, easily
enough, a vector, like the variate's values, making the differences in
question also vectors. Let the vector space in which the variate takes its
values be V.
Just as the variance called for a suitably-weighted average, the
distribution's density needs a suitably-scaled square
of the variate's
offset from its mean; however, this time, we must have a real number, since
we'll be negating this value and using it as input to the exponential
function. The suitable scaling
to be used should, furthermore,
be division
by twice the variance, by analogy with the one-dimensional
case. If, when we diagonalised our variance in V⊗V by suitable choice
of basis in V, no diagonal entries are zero, we can indeed construct an
inverse to it, in dual(V)⊗dual(V); this will serve as a metric on V and
is, clearly, a natural
metric for our variate, so we can indeed use it
as inner product to obtain a (scalar) squared length
of the (vector)
difference between our variate's mean and a candidate value for the
variate. No problem.
However, we are not, in general, guaranteed that no diagonal entries are zero. If the variance has some diagonal entries that are zero, it means that all values of our variate differ from the variate's mean by vectors that lie in some (strict) sub-space of V. In such a case, we can infer a new variate, the original minus its mean, that lies in this sub-space and conduct the analysis as above, using this sub-space in place of V. However, we can actually do better.
Consider our variance, σ, and the difference, v, between our
variate's mean and some value (with non-zero probability) of our
variate. Because v×v times some positive scaling did show up in the sum
(or integral) that gave us σ (and all terms in the sum or integral were
of this form, so there were no negative values to cancel out
the one
from v), v is indeed in the subspace on which σ is
positive-definite. In particular, this is the sub-space {σ·w: w
in dual(V)}. Although we can't construct an inverse for σ
we can identify which u in dual(V) satisfy v =
σ·u. Because we could have done the analysis restricted to a
sub-space of V, there is some such u; however, because σ is degenerate,
there is more than one such u. Still, the difference between two
such values of u is necessarily mapped to zero by σ; in fact, any such
difference must map v (or any other difference between our variate's mean and
a candidate value of our variate) to zero, so u·v doesn't depend
on which such u we chose. Since u is, in a suitable sense, the
result of dividing v by σ, I'll write v/σ\v for this scalar result
of dividing v by the variance and contracting the result with v. For values
of v outside the sub-space with positive probability, there is no suitable u
and (as readily becomes evident when working in a basis that diagonalises
σ) the appropriate value for v/σ\v is (positive) infinity; when
negated and exponentiated, this duly maps to zero to make our gaussian's
density zero as required. When σ isn't degenerate, we can apply the
same reasoning; we merely happen to have no forbidden values for v or
ambiguity in u; and v/σ\v is simply
v·(σ−1)·v.
One further quirk arises, that complicates the degenerate case. The scale factor needed to normalize our gaussian includes √det(σ) as a factor. This isn't a number: it's a fully anti-symmetrized tensor describing a volume element, which is exactly the thing we need to define integration in our vector space – a scalar function times √det(σ) specifies a distribution. When σ is invertible, this gives a sensible distribution which agrees, with any positive-definite metric, about which volumes are non-zero. However, when σ is degenerate, √det(σ) is also zero (of the relevant tensor kind). Still, using a basis which diagonalises σ, we can take the fully-antisymmetrized product of the basis members with non-zero diagonal entries in σ and scale this by the square root of the product of their diagonal entries. The result is a tensor of the right kind to define an integration on the sub-space of V on which σ isn't degenerate; or, equivalently, on the hyper-plane, parallel to this sub-space, that passes through our variate's mid-point.
Thus, although we can use exp(−v/σ\v/2) to characterize the variation in density, we effectively have to restrict to the hyperplane in order to obtain a distribution. So, hereafter, I'll presume we've done that restriction and V is the relevant hyperplane, within the vector space of possible values of the diverse gaussian-distributed co-ordinates we initially used to produce a vector-valued variate. We can thus presume that σ is positive-definite.
The natural point from which to measure the radius is the centre of the gaussian, so we effectively take this as origin. For the radius defined by any other metric than (the inverse of) the variance, we may expect the distribution to be somewhat convoluted. However, the variance provides a natural sense of scale in terms of which to consider our distribution, so let us instead look at the radius obtained from the metric it implies. As a result, we effectively reduce the variance to the canonical positive-definite quadratic form, by choosing a basis in which its matrix's diagonal entries are 1 and all other entries are 0. If our space (possibly the hyperplane on which the density is non-zero) has dimension n, our basis represents it as simply the set of ({reals}:|n) lists; the density then has the form, at ({reals}:x|n),
with a suitable scaling to impose normalization. Since integral(: exp(−y.y/2).dy ←y :{reals}) is √(2.π) and our distribution's co-ordinates are independently distributed (the distribution can be written as a product of terms, each involving only one co-ordinate), the required scaling is simply power(−n/2, 2.π).
As in the two- and three-dimensional cases, we can switch from cartesian co-ordinates, ({reals}:x|n), defined by our basis to polar co-ordinates: a radial parameter r = √sum(: x(i).x(i) ←i |n) and a family of angular parameters describing the unit (n−1)-sphere. Since the distribution has spherical symmetry, the sole relevance of these angular parameters is that integrating a function that doesn't vary with angle over the unit sphere simply scales the function by the sphere's surface measure, which is 2.power(n/2, π)/Γ(n/2). The volume element product(dx) becomes dr.power(n−1, r) times the surface element whose integral is this surface measure. We thus obtain radius distribution
For the half-squared-radius parameter u = r.r/2 we can thus infer distribution
i.e. this half-squared-radius parameter, v/σ\v/2, is gamma-distributed, with shape parameter α = n/2, half the dimension of our vector space, and scale parameter β = 1 (because we chose the natural co-ordinates for our variance).
We can now work out the moments (expected values of powers) of the radius:
substitute u = r.r/2, r.dr = du:
For even k = 2.h, this is product(: (2.i+n) ←i :h); thus, in particular, E(r.r) = n. For odd k = 2.h+1 we need formulae for the Γ function at half-integer inputs, since either n/2 or (k+n)/2 must be odd. It suffices to know:
First, let's substitute k=2.h+1:
whence, for n = 2.m, we get
while, for n = 2.m +1, we get
For h = 0 we get E(r) = √(2.π).m.chose(2.m, m)/power(m, 4) when even n = 2.m; or E(r) = 2.power(m, 4)/chose(2.m, m)/√(2.π) when odd n = 2.m +1. For small values of n we thus obtain
n | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|
E(r) | √(2/π) = 0.798 | √(π/2) = 1.25 | 2.√(2/π) = 1.60 | 3.√(π/2)/2 = 1.88 | 8.√(2/π)/3 = 2.13 | 15.√(π/2)/8 = 2.35 | 16.√(2/π)/5 = 2.55 |
variance | 1−2/π = 0.363 | 2−π/2 = 0.429 | 3−8/π = 0.454 | 4−9.π/8 = 0.466 | 5−128/9/π = 0.473 | 6−225.π/128 = 0.478 | 7−512/25/π = 0.481 |
Our radius, r, is the square root of the sum, over co-ordinates,
of the square of each co-ordinate's difference from its mean. Another common
measure of deviation is the root mean square
, RMS, where r is
the root sum square
; so RMS is simply r/√n; its mean is thus
1/√n times that tabulated above (which asymptotically approaches 1 from
below); and its variance is just 1/n times that tabulated above, or one minus
the square of the mean (so it asymptotically tends to 0 from
above).
n | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|
E(RMS) | √(2/π) = 0.789 | (√π)/2 = 0.886 | 2.√(2/π/3) = 0.921 | 3.√(π/2)/4 = 0.940 | 8.√(2/π/5)/3 = 0.952 | 14.√(π/3)/16 = 0.959 | 16.√(2/π/7)/5 = 0.965 |
variance | 1−2/π = 0.363 | 1 −π/4 = 0.215 | 1−8/π/3 = 0.151 | 1−9.π/32 = 0.116 | 1−128/45/π = 0.095 | 1−75.π/256 = 0.080 | 1−512/175/π = 0.069 |
s = standard deviation | 0.603 | 0.463 | 0.389 | 0.341 | 0.308 | 0.282 | 0.262 |
E/s | 1.324 | 1.913 | 2.370 | 2.755 | 3.094 | 3.400 | 3.681 |
If all co-ordinates of a multi-dimensional gaussian-distributed variate have standard deviation s, equal across co-ordinates, we can scale the variate down by s, do the above analysis and scale back up again. The values of E(RMS) and the standard deviation are both then scaled by s, so their ratios remain the same; also, the E/s ratio for the radius is the same as that for the RMS. Notice that vectors near the variate's mean (with small RMS or r) become highly improbable as E/s grows; while the n=1 distribution is highly skew (so RMS < 0.186 = E(RMS)−s, has higher probability than the roughly 16% one might expect for values more than one standard deviation below the mean), the distributions for RMS (and r) get progressively less skew as n increases; so that, at n=3, RMS < 0.424 = E(RMS)−1.28×s has probability well estimated by the 10% chance of a normal random variate being 1.28 standard deviations below its mean. At n=6, the probability of RMS < 0.381 = E(RMS)−2.05×s is likewise well estimated by the 2% chance of a normal random variate being 2.05 standard deviations below its mean.
For a discrete distribution, such as a sum of die-rolls, modelled by a
gaussian, one commonly-used notion of near mean
is that all
co-ordinates are at most some specified number of steps away from the
mean. When the mean is an achievable outcome, the number of steps is whole;
otherwise, each it is typically in the centre of a hyper-cube of achievable
outcomes, so the number of steps is half plus a whole number. In either case,
doubling the number of steps away from the middle yields a whole number of
steps as the length of each side of the hyper-cube.
The constraint asks for the vector variate to lie within a huper-cube; to
estimate the probability of the discrete variate falling in the requested
range, we need to add a half to the number of steps (so as to split the
gaussian-approximated density between the last near mean
value and the
first non-near value). Provided the co-ordinates are independent, however,
the probabilities can then be computed simply by working out the probability
for each co-ordinate to lie within the chosen number of steps of the mean,
then multiplying the probabilities for the various components to get the
probability for lying inside the selected hyper-cube. Of course, since each
co-ordinate's probability is less than one, the more dimensions we have, the
smaller the product will be for any given number of steps.
To select a hyper-cube with half the distribution inside it, when all
co-ordinates are identically distributed, we need power(1/n, 1/2), the n-th
root of half, as the probability for each co-ordinate to lie within its range;
this increases with n, asymptotically approaching 1. The number of standard
deviations each side of the mean required for each co-ordinate's range to
achieve this thus grows with dimension as (starting at dimension 1): 0.67,
1.05, 1.26, 1.41, 1.52, 1.60, 1.67, 1.73, 1.79, 1.83, 1.87, 1.91, 1.94, 1.97,
2. In fifteen dimensions, you need an interval four standard deviations wide,
centred on the mean, to capture half the probability in the near mean
hyper-cube.
Of course, a hyper-cube is a poor test of near
, given that its
corners are √n times as far from the centre as the face-centres are; it
would be better to use a sphere, hence to chose a radius. To capture half the
probability, we just need to look at the median radius. To find an answer
comparable to within w/2 steps of the mean
for some whole w, we can use
r < w/2 or we can bound at the radius that gives a sphere whose volume is
equal to that of the hyper-cube of side w. This last is simply w times
power(1/n, Γ(1+n/2))/√π, the radius
of the n-sphere of hyper-volume 1, which grows with dimension n (starting
at 1) as: 0.5, 0.564, 0.620, 0.671, 0.717, 0.761, 0.801, 0.839, …
increasing steadily (and unboundedly, roughly as √(n/2/e/π) for large
n) thereafter.