The Gaussian Distribution

The so-called normal or bell-shaped distribution has density proportional to the exponential of a negative suitably-scaled square of its variate's difference from a mid-point. It arises naturally as the limiting distribution of the average outcome of a large number of trials of any random process that has finite mean and variance. (So don't try this with the dx/(x.x +1)/π distribution on the reals; its mean is ill-defined and its variance is infinite.) The selected mid-point is mean, mode and median of the distribution; the scaling used controls the variance.

Unfortunately, it is all too common for over-simplistic treatments of statistics to naïvely suppose that it's a sensible distribution to use to model pretty much any process. Whenever a random variate is, by specification, necessarily positive (to take a common example), the gaussian distribution, at best, makes it easy to estimate the scales of values; but it (at least in principle) allows a non-zero (albeit possibly very small) probability of a negative outcome. In such cases, a gamma distribution is generally preferable; it is conceptually more suitable, because its variate is by specification positive – and suitable choice of its two parameters can make it fit as nicely as the gaussian model will. In any case, one should actually plot the distribution of a data-set, to check it has a shape reasonably close to that of the model, before modelling it with a gaussian (or any other distribution) with the same mean and variance.

All the same, the gaussian is a particularly interesting distribution, with some neat properties; in particular, because it dies away to zero very strongly once one gets well away from its mean, it ensures well-defined expected values for diverse functions of the random variate. Indeed, it is for this reason that it plays a crucial rôle in the justification of Fourier's transform.

Product of gaussians

One of the interesting properties of gaussian distributions is that, if you multiply two of them together and renormalize (e.g. because the originals were the likelihood distributions of some common datum, given distinct experimental data sets), you get another. (Note that this has nothing to do with the distribution of a variate obtained by multiplying together two variates, each of which has a gaussian distribution.) Leaving out the normalizations, this is because

exp(−square(x−a)/b/2).exp(−square(x−c)/d/2)
= exp(−(square(x−a)/b +square(x−c)/d)/2)
= exp(−((x−a).(x−a)/b +(x−c).(x−c)/d)/2)
= exp(−(x.x/b −2.a.x/b +a.a/b +x.x/d −2.c.x/d +c.c/d)/2)
= exp(−(x.x.(1/b +1/d) −2.x.(a/b +c/d) +a.a/b +c.c/d)/2)
= exp(−((1/b +1/d).square(x −(a/b +c/d)/(1/b +1/d)) −square(a/b +c/d)/(1/b +1/d) +a.a/b +c.c/d)/2)
= exp(−((1/b +1/d).square(x −(a.d +c.b)/(b +d)) −square(a −c)/(b +d))/2)

in which the final square(a −c) term is constant, so swallowed up by normalization. So the product, of one gaussian with mean a and variance b with another having mean c and variance d, has mean (a.d +c.b)/(b +d) and variance 1/(1/b +1/d) = (b +d)/b.d. This actually also works in general linear spaces, albeit square(x−a)/b gets replaced by (x−a)/b\(x−a), with x and a now being vectors and b a quadratic form, and likewise for c and d; the product's mean is (a/b +c/d)/(1/b +1/d) and its variance (the quadratic form corresponding to b and d) is 1/(1/b +1/d), with division by a quadratic form being understood as contraction with its inverse.

Multi-dimensional gaussians

When we have several data, each of which has a gaussian distribution, we can describe these data collectively as a single vector-valued variate, whose components are the separate data. Absent messy correlations, the distribution of this vector variate is apt to also be gaussian, albeit we now have to interpret suitably-scaled square of its variate's difference from a mid-point a little more carefully than in the simple one-dimensional case. The mid-point is, easily enough, a vector, like the variate's values, making the differences in question also vectors. Let the vector space in which the variate takes its values be V.

Just as the variance called for a suitably-weighted average, the distribution's density needs a suitably-scaled square of the variate's offset from its mean; however, this time, we must have a real number, since we'll be negating this value and using it as input to the exponential function. The suitable scaling to be used should, furthermore, be division by twice the variance, by analogy with the one-dimensional case. If, when we diagonalised our variance in V⊗V by suitable choice of basis in V, no diagonal entries are zero, we can indeed construct an inverse to it, in dual(V)⊗dual(V); this will serve as a metric on V and is, clearly, a natural metric for our variate, so we can indeed use it as inner product to obtain a (scalar) squared length of the (vector) difference between our variate's mean and a candidate value for the variate. No problem.

The degenerate case

However, we are not, in general, guaranteed that no diagonal entries are zero. If the variance has some diagonal entries that are zero, it means that all values of our variate differ from the variate's mean by vectors that lie in some (strict) sub-space of V. In such a case, we can infer a new variate, the original minus its mean, that lies in this sub-space and conduct the analysis as above, using this sub-space in place of V. However, we can actually do better.

Consider our variance, σ, and the difference, v, between our variate's mean and some value (with non-zero probability) of our variate. Because v×v times some positive scaling did show up in the sum (or integral) that gave us σ (and all terms in the sum or integral were of this form, so there were no negative values to cancel out the one from v), v is indeed in the subspace on which σ is positive-definite. In particular, this is the sub-space {σ·w: w in dual(V)}. Although we can't construct an inverse for σ we can identify which u in dual(V) satisfy v = σ·u. Because we could have done the analysis restricted to a sub-space of V, there is some such u; however, because σ is degenerate, there is more than one such u. Still, the difference between two such values of u is necessarily mapped to zero by σ; in fact, any such difference must map v (or any other difference between our variate's mean and a candidate value of our variate) to zero, so u·v doesn't depend on which such u we chose. Since u is, in a suitable sense, the result of dividing v by σ, I'll write v/σ\v for this scalar result of dividing v by the variance and contracting the result with v. For values of v outside the sub-space with positive probability, there is no suitable u and (as readily becomes evident when working in a basis that diagonalises σ) the appropriate value for v/σ\v is (positive) infinity; when negated and exponentiated, this duly maps to zero to make our gaussian's density zero as required. When σ isn't degenerate, we can apply the same reasoning; we merely happen to have no forbidden values for v or ambiguity in u; and v/σ\v is simply v·(σ−1)·v.

One further quirk arises, that complicates the degenerate case. The scale factor needed to normalize our gaussian includes √det(σ) as a factor. This isn't a number: it's a fully anti-symmetrized tensor describing a volume element, which is exactly the thing we need to define integration in our vector space – a scalar function times √det(σ) specifies a distribution. When σ is invertible, this gives a sensible distribution which agrees, with any positive-definite metric, about which volumes are non-zero. However, when σ is degenerate, √det(σ) is also zero (of the relevant tensor kind). Still, using a basis which diagonalises σ, we can take the fully-antisymmetrized product of the basis members with non-zero diagonal entries in σ and scale this by the square root of the product of their diagonal entries. The result is a tensor of the right kind to define an integration on the sub-space of V on which σ isn't degenerate; or, equivalently, on the hyper-plane, parallel to this sub-space, that passes through our variate's mid-point.

Thus, although we can use exp(−v/σ\v/2) to characterize the variation in density, we effectively have to restrict to the hyperplane in order to obtain a distribution. So, hereafter, I'll presume we've done that restriction and V is the relevant hyperplane, within the vector space of possible values of the diverse gaussian-distributed co-ordinates we initially used to produce a vector-valued variate. We can thus presume that σ is positive-definite.

Radius distribution

The natural point from which to measure the radius is the centre of the gaussian, so we effectively take this as origin. For the radius defined by any other metric than (the inverse of) the variance, we may expect the distribution to be somewhat convoluted. However, the variance provides a natural sense of scale in terms of which to consider our distribution, so let us instead look at the radius obtained from the metric it implies. As a result, we effectively reduce the variance to the canonical positive-definite quadratic form, by choosing a basis in which its matrix's diagonal entries are 1 and all other entries are 0. If our space (possibly the hyperplane on which the density is non-zero) has dimension n, our basis represents it as simply the set of ({reals}:|n) lists; the density then has the form, at ({reals}:x|n),

exp(−sum(:x(i).x(i) ←i |n)/2).product(: dx(i) ←i |n)
= product(: exp(−x(i).x(i)/2).dx(i) ←i |n)

with a suitable scaling to impose normalization. Since integral(: exp(−y.y/2).dy ←y :{reals}) is √(2.π) and our distribution's co-ordinates are independently distributed (the distribution can be written as a product of terms, each involving only one co-ordinate), the required scaling is simply power(−n/2, 2.π).

As in the two- and three-dimensional cases, we can switch from cartesian co-ordinates, ({reals}:x|n), defined by our basis to polar co-ordinates: a radial parameter r = √sum(: x(i).x(i) ←i |n) and a family of angular parameters describing the unit (n−1)-sphere. Since the distribution has spherical symmetry, the sole relevance of these angular parameters is that integrating a function that doesn't vary with angle over the unit sphere simply scales the function by the sphere's surface measure, which is 2.power(n/2, π)/Γ(n/2). The volume element product(dx) becomes dr.power(n−1, r) times the surface element whose integral is this surface measure. We thus obtain radius distribution

exp(−r.r/2).power(n−1, r).dr.(2.power(n/2, π)/Γ(n/2)).power(−n/2, 2.π)
= 2.exp(−r.r/2).power(n−1, r).dr/power(n/2, 2)/Γ(n/2)

For the half-squared-radius parameter u = r.r/2 we can thus infer distribution

i.e. this half-squared-radius parameter, v/σ\v/2, is gamma-distributed, with shape parameter α = n/2, half the dimension of our vector space, and scale parameter β = 1 (because we chose the natural co-ordinates for our variance).

Moments

We can now work out the moments (expected values of powers) of the radius:

E(power(k, r))
= integral(: 2.exp(−r.r/2).power(k+n−1, r).dr/power(n/2, 2)/Γ(n/2) ←r :{positives})

substitute u = r.r/2, r.dr = du:

= integral(: exp(−u).power((k+n)/2−1, u).du ←r :{positives}).power(k/2, 2)/Γ(n/2)
= power(k/2, 2).Γ((k+n)/2)/Γ(n/2)

For even k = 2.h, this is product(: (2.i+n) ←i :h); thus, in particular, E(r.r) = n. For odd k = 2.h+1 we need formulae for the Γ function at half-integer inputs, since either n/2 or (k+n)/2 must be odd. It suffices to know:

First, let's substitute k=2.h+1:

E(power(2.h+1, r))
= power(h +1/2, 2).Γ(h +(1+n)/2)/Γ(n/2)
= product(: 2.i+1+n ←i |h).(√2).Γ((1+n)/2)/Γ(n/2)

whence, for n = 2.m, we get

E(power(2.h+1, r))
= product(: 2.i +2.m +1 ←i |h).(√2).Γ(m +1/2)/Γ(m)
= √(2.π).product(: 2.i +2.m +1 ←i :h).m.chose(2.m, m)/power(m, 4)

while, for n = 2.m +1, we get

E(power(2.h+1, r))
= product(: 2.i +2.m +2 ←i |h).(√2).Γ(m +1)/Γ(m +1/2)
= power(h +1/2, 2).product(: i +m +1 ←i |h).power(m, 4).m!.m!/(√π)/(2.m)!
= power(2.m +h +1, 2).(m+h)!.m!/(2.m)!/√(2.π)

Mean and variance

For h = 0 we get E(r) = √(2.π).m.chose(2.m, m)/power(m, 4) when even n = 2.m; or E(r) = 2.power(m, 4)/chose(2.m, m)/√(2.π) when odd n = 2.m +1. For small values of n we thus obtain

n1234567
E(r)√(2/π) = 0.798√(π/2) = 1.252.√(2/π) = 1.603.√(π/2)/2 = 1.888.√(2/π)/3 = 2.1315.√(π/2)/8 = 2.3516.√(2/π)/5 = 2.55
variance1−2/π = 0.3632−π/2 = 0.4293−8/π = 0.4544−9.π/8 = 0.4665−128/9/π = 0.4736−225.π/128 = 0.4787−512/25/π = 0.481

Our radius, r, is the square root of the sum, over co-ordinates, of the square of each co-ordinate's difference from its mean. Another common measure of deviation is the root mean square, RMS, where r is the root sum square; so RMS is simply r/√n; its mean is thus 1/√n times that tabulated above (which asymptotically approaches 1 from below); and its variance is just 1/n times that tabulated above, or one minus the square of the mean (so it asymptotically tends to 0 from above).

n1234567
E(RMS)√(2/π) = 0.789(√π)/2 = 0.8862.√(2/π/3) = 0.9213.√(π/2)/4 = 0.9408.√(2/π/5)/3 = 0.95214.√(π/3)/16 = 0.95916.√(2/π/7)/5 = 0.965
variance1−2/π = 0.3631 −π/4 = 0.2151−8/π/3 = 0.1511−9.π/32 = 0.1161−128/45/π = 0.0951−75.π/256 = 0.0801−512/175/π = 0.069
s = standard deviation0.6030.4630.3890.3410.3080.2820.262
E/s1.3241.9132.3702.7553.0943.4003.681

If all co-ordinates of a multi-dimensional gaussian-distributed variate have standard deviation s, equal across co-ordinates, we can scale the variate down by s, do the above analysis and scale back up again. The values of E(RMS) and the standard deviation are both then scaled by s, so their ratios remain the same; also, the E/s ratio for the radius is the same as that for the RMS. Notice that vectors near the variate's mean (with small RMS or r) become highly improbable as E/s grows; while the n=1 distribution is highly skew (so RMS < 0.186 = E(RMS)−s, has higher probability than the roughly 16% one might expect for values more than one standard deviation below the mean), the distributions for RMS (and r) get progressively less skew as n increases; so that, at n=3, RMS < 0.424 = E(RMS)−1.28×s has probability well estimated by the 10% chance of a normal random variate being 1.28 standard deviations below its mean. At n=6, the probability of RMS < 0.381 = E(RMS)−2.05×s is likewise well estimated by the 2% chance of a normal random variate being 2.05 standard deviations below its mean.

Probabilities of being near mean

For a discrete distribution, such as a sum of die-rolls, modelled by a gaussian, one commonly-used notion of near mean is that all co-ordinates are at most some specified number of steps away from the mean. When the mean is an achievable outcome, the number of steps is whole; otherwise, each it is typically in the centre of a hyper-cube of achievable outcomes, so the number of steps is half plus a whole number. In either case, doubling the number of steps away from the middle yields a whole number of steps as the length of each side of the hyper-cube.

The constraint asks for the vector variate to lie within a huper-cube; to estimate the probability of the discrete variate falling in the requested range, we need to add a half to the number of steps (so as to split the gaussian-approximated density between the last near mean value and the first non-near value). Provided the co-ordinates are independent, however, the probabilities can then be computed simply by working out the probability for each co-ordinate to lie within the chosen number of steps of the mean, then multiplying the probabilities for the various components to get the probability for lying inside the selected hyper-cube. Of course, since each co-ordinate's probability is less than one, the more dimensions we have, the smaller the product will be for any given number of steps.

To select a hyper-cube with half the distribution inside it, when all co-ordinates are identically distributed, we need power(1/n, 1/2), the n-th root of half, as the probability for each co-ordinate to lie within its range; this increases with n, asymptotically approaching 1. The number of standard deviations each side of the mean required for each co-ordinate's range to achieve this thus grows with dimension as (starting at dimension 1): 0.67, 1.05, 1.26, 1.41, 1.52, 1.60, 1.67, 1.73, 1.79, 1.83, 1.87, 1.91, 1.94, 1.97, 2. In fifteen dimensions, you need an interval four standard deviations wide, centred on the mean, to capture half the probability in the near mean hyper-cube.

Of course, a hyper-cube is a poor test of near, given that its corners are √n times as far from the centre as the face-centres are; it would be better to use a sphere, hence to chose a radius. To capture half the probability, we just need to look at the median radius. To find an answer comparable to within w/2 steps of the mean for some whole w, we can use r < w/2 or we can bound at the radius that gives a sphere whose volume is equal to that of the hyper-cube of side w. This last is simply w times power(1/n, Γ(1+n/2))/√π, the radius of the n-sphere of hyper-volume 1, which grows with dimension n (starting at 1) as: 0.5, 0.564, 0.620, 0.671, 0.717, 0.761, 0.801, 0.839, … increasing steadily (and unboundedly, roughly as √(n/2/e/π) for large n) thereafter.


Valid CSSValid HTML 4.01 Written by Eddy.