Statistical variates

Usual introductions to statistics discuss what happens when you have some numeric quantity you can measure for each member of some population; you can add the numbers up and divide by how many there were to obtain the mean (the usual average); you can sort the numbers into ascending order and pick the middle one (or some value between the middle two, if there's an even number of them) to get a median. The quantity, that varies among members of the population, is described as a variate; when it's a simple numeric quantity, it's a scalar variate.

If what you've actually got is measured values from a moderately large, randomly-drawn sample from a larger population, the values you'll see for these computed quantities (mean and median) should be fairly stable under randomly selecting a few more members of the broader population to add to the sample set. The interesting quantities to compute from sample data are the ones that have this sort of stability under growing your sample size, or under replacing one sample with a fresh (but, again, randomly drawn) sample of comparable size. (It's OK if there are some samples you could, in principle, draw that would give wildly different values, as long as the probability of drawing such a sample is very small.) Such computed values can be obtained from values measured from practical-sized samples of a larger population and used to estimate the value we would compute, if we had data from the population as a whole.

Such a value, computed from values of a variate measured on a population or sample and stable under choice of sample, is known as a statistic (the same term is also sometimes used of the variate, from whose value for each individual it is computed; hence vital statistics of a person). A statistic needn't be a simple number, even when the variate it characterizes is; but those most commonly discussed are.

Distribution

While some members of the population may coincide in their value of the variate, as long as the population is finite you only see a finite number of values, even if the values the variate could in principle take form a continuum. This means you can define a mapping from possible values of the variate to the fractional frequency of each – that is, the number of times it arose, divided by the size of your sample. We can then express the mean and other computed properties of our sample data in terms of this fractional frequency function – for example, the mean is the sum of the values that arose, each multiplied by its fractional frequency, sum(: x.p(x) ←x ;) if p is the fractional frequency function.

Discrete or continuous

When our variate can only take certain discrete values, and the number of values it can (in principle) possibly take is small compared to the sizes of samples we can reasonably work with, this fractional frequency function shall be reasonably stable under changes of sample; i.e., it is a statistic. Given that it contains nearly all the data of our sample, not just a single number, it's quite a good statistic to know, too. However, when the number of possible values is large compared to all samples we can practically deal with, adding new members to a sample shall tend to add spikes at new values and perhaps make some existing spikes a little taller, while scaling all existing spikes down a bit. For reasonably large sample sizes, making a few spikes a little taller while scaling all spikes down a bit isn't a significant change; but adding new spikes all over the place is rather more of a dramatic change. I'll refer to the case with a few (relative to practical sample sizes) discrete values as discrete and to the alternative as continuous (even if it's actually discrete, but with a huge number of possible values); in the discrete case, the fractional frequency function is a useful statistic, but not so in the continuous case; it's useful (we can still compute things from it), but not stable under sampling.

In the continuous case, however, we can work with a density function for which, for all reasonably broad intervals within our range of possible values, the integral of the density function over the range differs negligibly from the fraction of members of our population for which the variate falls in the given range. We can then express other statistics in terms of this density function – for example, the mean shall be the integral of the variate's possible values, each times its value of the density – in the same way as we used the fractional frequency function in the discrete case (but substituting integration in place of summation); the negligible discrepancies in integrals of the density function over intervals, as compared to sums of the fractional frequency function within them, should produce negligible differences between answers obtained from the density function and those obtained from the fractional frequency function.

Computing a density

One can compute such a density function in various ways, given the data from a sample. You can, at each possible value within your variate's range of values, sum the fractional frequency over values within some interval about this point and divide by the width of the interval, to obtain a measure of how common values near this one are. When working with a sample from a larger population, this should, at least for good choices of interval-width, be reasonably stable under adding a few more randomly-chosen members of the full population to your sample and under replacing your sample with a fresh (randomly-drawn) one.

If you use a very narrow interval, the density shall consist of a spike at each of the finitely many values your population actually exhibited, and adding new members of the larger population to your sample shall tend to add new spikes, just as happened with the fractional frequency function. If you use a very broad interval, the density function shall be nicely continuous, smoothing out the spikes, but it'll lose information about variation; the integrals of your density over intervals shall differ significantly from the sums of fractional density over values in such intervals. Somewhere in between, you'll typically get some range of interval-widths among which the resulting density depends little on the width of interval and, as a result, gives reasonably accurate integrals over intervals.

There may be some variation, across the range of the variate's values, in which interval-widths are good; for example, you'll typically need broader intervals where the density is lower, to avoid spikiness. You can use some judicious strategy for choosing good widths for intervals across the range of values, e.g. finding the interval centred on each point that contains some fixed number of members of your population, then dividing that fixed number by population size and interval-width to get your density at that point, optionally with some funky heuristic adjustments, e.g. reducing the computed value in so far as the numbers of members either side of the middle of the interval are significantly different, so that the density dies off quickly outside the range of actually observed values.

What matters is that your process for turning a sample's values into a density should be reasonably stable under changes of reasonably large sample, drawn randomly from a larger population, while giving integrals over intervals that match reasonably well to sums of relative frequency over values within those intervals. Such a density function is then a statistic of your population; and usually more useful than any simple number.

Distribution and density

The processes of summing a function, scaled by a relative frequency, or integrating the function, scaled by a density, are both linear maps from the linear space of functions on which they act, producing as output a value of the same kind as the function on which they act. We have a set V of possible values of our variate; for each function (:f|V), the linear map gives us either E(f) = sum:(: f(v).p(v) ←v |V), with p being the fractional frequency function, or E(f) = integral(: f(v).p(v).dv ←v |V), with p as the density function. The answer, in either case, is a value of the same kind as f's outputs times a scaling; if f is ({reals}:f|V), then E(f) is real; if (W:f|V) for some real-linear space W, then E(f) shall be in W. Thus E is, for each real-linear space W, a linear map (W: |{mappings (W:|V)}); in particular, it maps the linear space {mappings ({reals}:|V)} linearly to {reals}; as such, it acts as a member of dual({mappings ({reals}:|V)}). Furthermore, its action on {mappings (W:|V)} can be inferred (for example, by taking co-ordinates in W, letting it act on each, then recombining) from its action as a member of dual({mappings ({reals}:|V)}).

Using a member of dual({mappings ({scalars}:|V)}) in this way, for some collection V and suitable set of scalars, gives us an operation known as a distribution on V. It generalizes summation and integration, throwing in some weighting by relative frequencies or densities, to turn a function on V into a summary of it; for example, it turns the identity on V into E(: x ←x |V) = integral(: x.p(x) ←x :) or sum(: x.p(x) ←x :), which is the mean of our variate. Our distribution, E, serves to unify the discussion of fractional frequency and density; and is a statistic (i.e. we can compute it from our sample data and it's reasonably stable under change of which sample we use). I'll refer to the fractional frequency function or density function as the density of the distribution, E.

While we're on the subject, a mode of a variate is a value at which its density is maximal. (Note that it can matter whether you chose to describe the variate as discrete or continuous in this case, at least when the discrete values aren't evenly spaced.) It's entirely possible to have more than one mode – for example, on a fair die, each possible outcome is a mode; and the commonest innocent deviation, in dice, from fairness is that the thicknesses aren't quite exactly equal, so the distance from each face to the one opposite isn't the same in all directions; the pair of faces with the smallest separation then shows up more often and the two values those faces show are both modes.

Mixing discrete and continuous

It may happen that a population actually exhibits a mixture of continuous and discrete values for some variate; for example, half of the population's values for the variate take whole number values within some interval, while the rest of the population's values are continuously distributed over that same interval, with arbitrary fractional parts. This is unusual, if it ever happens, and almost certainly a case of a population better studied by splitting it into two sub-populations to be studied separately, but it is important that the theory be able to handle it naturally, when it arises. The distribution describing such a population has E(f) = sum(: w(x).f(x) ←x |D) +integral(: p(x).f(x).dx ←x :C) for some relative frequency function w on some discrete set D and density function p on some continuum C. There may, of course, be some overlap between C and D, e.g. D may be a subset of C.

One can still model such a mixed distribution as a continuum distribution by using Dirac's delta function, δ. This is defined as the density of a distribution for which E(f) = f(0), giving the above variate a continuum density of (: p(x) +sum(: w(y).δ(x −y) ←y |C) ←x :). The δ function isn't really a function at all, although it can be thought of as a limit of various families of well-behaved density functions, with density maximal at 0 (hence at x = y in each term of the sum) and dropping off rapidly to either side. It's actually just a formal representation for use, within an integral, to indicate that the integral is really a combination of an integral and a sum; a δ in the integrand indicates a term that really belongs in the sum. It unifies the descriptions of the two cases, sum and integral, by letting a sum be represented by an integral with δs in its density.

If this happens, the fractional frequency function's values at the discrete values of the variate shall be stable under increasing the size of sample over which you collect data, while its spikes for the individual values actually observed within the continuum shall get smaller and smaller, in inverse proportion to the size of your sample; you can use this to discern which values are in each set. Then obtain your density for the continuum values as above. Interpolating it smoothly through the values where the discrete set produces spikes in the relative frequency function, you can estimate how many of your observed values of the discrete set actually arose from members of the continuum happening to hit each discrete value; subtracting this much from the relative frequencies at the discrete values, you're left with the relative frequency contributions due to the discrete part of the distribution.

Handling measurement error

If your data are obtained with some measurement error, the actually discrete values may show up spread across a range about the discrete value; you'll observe a density that looks like the one using δ, above, but with δ replaced by a function characterizing your measurement errors. In such a case, the apparently continuous part of your observed density may be an illusion caused by the tails of your error function actually being broader than you realize. All the same, you can reasonably treat this case as a simple continuum case during most of the analysis, even if you have reasons to model it in terms of a discrete or mixed distribution. You work with the measured value as your variate, even if you model it in terms of an underlying variate reflected by the measured value, with errors.

If you're able to measure your errors (e.g. by repeatedly measuring the same member of the population and observing how the answers vary; then doing the same for several other members of the population), you can characterize the shape of peaks you would get from actual discrete values; for each actual value y, you'll then get some density e(y, x) for the measured value x. This shall typically depend on x and y only via x−y (cf. the δ form above) or x/y but, in principle, may have any form.

If you think you've got a discrete or mixed distribution, you can then look for the discrete values y in D and relative frequency function (: w |D) for which the result of subtracting (: sum(: w(y).e(y, x) ←y |D) ←x :) from your observed density gives you a suitably smooth residual density to attribute to any continuum part of the population. If what you've got is really just a discrete distribution, the residual density shall be negligible; and the value of each w(y) shall typically be proportional to the value of your observed distribution's peak near y.

Otherwise, or once you've subtracted off the discrete contribution to get a residual density, you've got the density for your continuum variate; and you need to look for a density p for which (: integral(: p(y).e(y, x) ←y :) ←x :) is a good match to your observed (or residual) density – the density also has error bars around each value, so the underlying variate's density, p, is only seen via your error processes. When e has a suitably straightforward form, the relation between observed and underlying densities is a convolution, which can be expressed via the Fourier transform, which can be exploited to help infer the underlying distribution from the observed one.

Correlation among quantities

In reality, given a population, there are typically several quantities one can sensibly measure, and sensibly reduce to numeric values, for each member of the population; and much that's interesting about the population can be learned by studying relationships among these variates.

In some cases, variates may have intrinsic dependencies: if you measure the shoulder-to-shoulder and front-to-back widths, girth, height and weight of people, the last is naturally going to be roughly proportional to the product of the two widths and the height; you can probably refine that by using some more complicated formula also involving girth, to reduce the imprecision implied by my use of roughly in that. Likewise, the girth shall be roughly proportional to the sum of the two widths. These are cases where one can readily see why the relationship arises.
In other cases, circumstances peculiar to the population being studied shall reveal themselves in relationships found between variates. Often, once noticed, such relationships can be studied more closely and there turns out to be a cause for them; once that cause is incorporated into our model of the population, we can reclassify the relationship into the preceding category. Even when no cause can be (or has yet been) found, one can fruitfully incorporate such a relationship into one's model of the population, although it can be important to distinguish between such happenstance factors in the model and causal factors, particularly if the model represents the happenstance in the same way as it represents causal factors. Even if there is an actual causal link, once found it may turn out to work differently than how it's represented by the initial incorporation of the happenstance into the model.

Such relationships are known as correlations among the variates. In order to study them, one must study the diverse variates together; and the way to represent several numeric quantities together is as a list of numbers understood as a vector. We thus have a vector variate, composed of scalar variates.

The range of values of our vector variate is a sub-set of some vector space, V. Its dual is the space of linear maps from it to scalars; its members are known as covectors. For each component variate, there's a covector that selects this component from each vector, ignoring all others. These component co-vectors span dual(V); any covector is some weighted sum of the scalar variates we combined into our vector. Any such weighted sum can be construed as a variate in its own right; while the variates we actually measure may be construed as (in the sense that we measure them directly) primitive, such composite variates may prove more interesting in some cases, either by representing a correlation among our primitive variates or by representing a controlling property, that some primitive variates are more strongly correlated with than with one another. The study of the variance, below, shall let us discover these more interesting variates.

Just like numbers, we can add vectors up and divide them by how many there were, to obtain the mean. Just as for simple numeric values, the mean is reasonably stable under variation of which reasonably large sample we draw from a larger population. Indeed, each component of the mean is just the mean of the variate that was used for that component; the components of the mean can be computed independently – using a vector representation doesn't gain much if all you're interested in is the mean.

As for a simple scalar variate, we can compute the fractional frequency function for a sample's data and, when suitable, attempt to approximate this by a distribution. Now, in place of intervals about each potential value of the variate, we'll want to use small volumes about each value and we'll want to ensure that the integrals of density over reasonably sensibly-shaped volumes shall match up with the sums of fractional frequencies of values within such volumes. Deciding which volumes are of interest shall be easier once we've studied the geometry of our space a bit, in light of the variance. Once we have a density, or when the fractional frequency function is suitable for use in its own right, we can represent our sample's data by its distribution on V, just as we did when V was some subset of {scalars}. A good graphical visualization of the density's variation in V can most likely give excellent insights into the data under study.

Vector median

One can likewise determine the median of each component separately and make a vector of the results. However, when working with vectors, one should avoid defining properties of a vector in terms of its components, wherever possible. Vector quantities can't generally be ordered, so it's not immediately clear how we should generalize the median to the case of vector data. The characteristic property of the median is that the population is split evenly between the two sides of it: when comparing how many members of the population have their measured value greater than the median to how many have values less than it, the difference is at most the number that have the median as their value.

Note that this implies, in a population of even size, that any value between the middle two members is a median; it's usual to take the middle of this interval and term it the median in this case, but this is merely convention. For a population with an odd number of members, or one in which the middle members coincide, there is only one median, so it does make sense to speak of the median.

In a vector space, however, a point doesn't split the space in two; what you need to do that is a plane of codimension 1; in general this is a set of form {p+v: w·v = 0} for some p in V and covector w in dual(V). (If we replace w with a mapping ({lists ({scalars}:|n)}: |V) for some natural n, and the 0 with the zero list ({0}:|n), the set {p+v: w(v) = zero} is a plane of codimension n.) Note that the given point p is a member of this set; any point in the set is described as being on the plane and the plane is said to pass through each point on it. Such a plane splits V in two (actually in three; the things on each side and the things on the plane; but the plane has no volume in the space, so tends to be ignored in the description) by using w to collapse V down to {scalars}: the v in V with w(v) < w(p) lie on one side of the plane while those with w(v) > w(p) lie on the other side; for any two vectors on the same side, the straight line from one to the other doesn't pass through the plane; for any two vectors on different sides, any continuous path starting at one and ending at the other necessarily passes through the plane. Thus each half into which the plane splits V is path-connected and the plane separates them.

This, in turn, suggests a definition of median that we can apply to vector quantities: a point p is a median of a sample of data precisely if, for every plane of co-dimension 1 through it, the numbers of members of the sample whose variate vectors lie on each side of the plane differ by, at most, the number of members whose variate vectors lie on the plane itself. As each of our component variates is a member of dual(V), any median of our vector distribution must have, as each component, a median of that component's variate. Note, however, that not every distribution has a median, in this sense.

Variance

The place where the vector representation really comes into its own is in describing how the values vary about their mean = E(: x ←x :). For a simple scalar variate, this is usually summarized by its variance = E(: power(2, x−mean) ←x :), which is necessarily non-negative; and is positive if there are two or more distinct values at which the distribution's density is non-zero. The square root of the variance is known as the standard deviation and gives a sense of typical differences, within the population, from the mean.

For a vector quantity, the natural thing to replace power(2) with is the tensor square of the vector, (V⊗V: v×v ←v |V). Every output of this is a positive semi-definite symmetric quadratic form on dual(V); individual outputs are singular but sums of outputs (from inputs that, collectively, span V) can be invertible. So we get a tensor variance, defined by

S = E(V⊗V: (x−mean)×(x−mean) ←x |V), where
mean = E(: x ←x |V)

Now, S is a positive semi-definite quadratic form on dual(V); if the values of v−mean, for v at which E's density is non-zero, span V, then S shall be invertible and positive definite, allowing us to use its inverse as a metric on V, with respect to which displacements between our variate's observed values have comparable lengths regardless of direction.

Notice that our original data were whatever we measured, in whatever units we chose to measure them; we might have measured girth in inches, height in metres, widths in feet and weight in stone (one stone is 14 pounds, about 6.35 kg). So typical differences in girth would get big numbers while differences in height would get relatively small ones. So using S's inverse to determine our notion of how big differences are is clearly more apt (better suited to the data we're working with) than simply summing the squares of differences in the numbers we originally measured.

Singular variance

Now suppose there were some linear dependence among our data; some covector w which maps the variate vector from every member of our population to the same value. Then it'll also map the mean to that same value; hence it'll map every difference from the mean to zero. Consequently, S·w shall be zero (S is a sum of tensor products and each of its right factors is in w's kernel), so S shall be singular.

This would happen, for example, if we included people's widths, side-to-side and back-to-front, as well as the sum of these two among our component variates; w would be the sum of the component covectors for the widths minus the component covector for their sum. This example is perhaps rather silly, given that the correlation is obvious by specification; but one may all too easily include a variate that seems like it's at least not totally dependent on others, only to have it turn out to be dependent on some of the others after all. For example, girth might turn out to be entirely determined by the two widths, in some way.

Fortunately, even when S is singular, it turns out we can still induce a metric on the differences between our population's vector variate values. We can diagonalise S, by a suitable choice of basis (V:b|n) of V, with n the dimension of V, which shall put S into the form sum(: s(i).b(i)×b(i) ←i |n) and, by choice of scaling of the basis vectors b(i), we can make each s(i) be either 1 or 0 (in general, −1 might also be needed; but we know S is positive semi-definite, so this won't arise). The basis (dual(V):p|n) dual to b – defined by p(i)·b(j) = 0 unless i = j, in which case it's 1 – then has S·p(i) = 0 for each i with s(i) = 0; these {p(i): s(i) = 0} are a basis of S's kernel; they represent the linear dependencies among the primitive variates we initially measured. The remaining {p(i): s(i) = 1} are then linearly independent variates of our population's data, derived from the primitive variates.

We obtained S as a sum of v×v terms where each v was a difference between mean and a variate vector of some member of our population. For any covector w, the contribution this makes to w·S·w is power(2, w·v), which can't be negative. So if there's any member of our population whose difference from mean w doesn't map to zero, w·S·w must be positive, since it gets a positive contribution from this member and no negative contributions to cancel it out. Since S is positive semi-definite, a covector w has w·S·w = 0 precisely if w is in the kernel of S. Consequently, any w in the kernel of S must in fact map to zero every difference between mean and a variate vector from a member of our population; hence also every difference between such actual variate vectors. So let v be any difference between actual variate vectors, or difference between mean and an actual variate vector. The p(i) with s(i) = 0 are a basis of the kernel of S, so each maps v to zero and we can omit this i from the sum(: b(i).p(i, v) ←i :n) = v to express v as a sum of the b(i) with s(i) = 1; so these b(i) span the differences between actual variate vectors.

Now suppose we have two such differences, u and v; to obtain an inner product of these, in terms appropriate to the actual variability of our data, we want something that we can think of as u/S\v, u over S under v or u divided by S, contracted with v, which would be the contraction of u and v on either side of S's inverse if it had one; but S may be singular, so we have to work a bit harder to give meaning to u/S\v. Let's start by considering what u/S would be: it should be a covector w for which u = w·S and changing w by any member of the kernel of S won't change the value of w·S, so we have ambiguity in w. However, the ambiguity is by members of S's kernel, each of which maps v to zero, so contracting w·v shall get a consistent answer, regardless of the ambiguity in w. This answer is then what we want to use for u/S\v. So it suffices that there be some w in dual(V) for which w·S is u; can we find such a w ? Well, try w = sum(: p(i).p(i, u) ←i |n). The i for which s(i) is zero all have p(i, u) = 0, so contribute nothing to this sum; and S is the sum over the remaining i of b(i)×b(i), so contracting it with this w shall give us sum(: b(i).p(i, u) ←i |n) = u, as we wanted. So yes, we do get a value for u/S and we can give meaning to u/S\v for all differences u, v between our actual variate values. Thus we can use S to define a metric on differences between actual variate values, even when it's singular.

Principal co-ordinates

Now let's look at those dual bases, b and p, from our diagonalisation of S; these can be thought of as providing co-ordinates in V which simplify your population's variability. You have some choice of diagonalising basis; you can replace the b(i) with s(i) = 1 via any recombination of them that preserves the sum of their tensor squares, just like applying an isometry to the basis vectors of a metric – you can rotate your co-ordinate axes, as long as you apply the rotation consistently. Likewise, you can replace any b(i) with s(i) = 0, applying any non-zero scaling to it and adding arbitrary multiples of the other b(i) with s(i) = 0. After such rearrangements the s(i) won't be changed, but you'll also need to revise your p(i) to match.

You can use such rearrangements to make each b(i) have most of its primitive components small compared to just a few relatively large ones. You can also use it to isolate directions in which your variate's variation is mostly noise (this may be somewhat subjective: variations which, to the best of your understanding, are simply random, not actually indicative of anything interesting; but see the discussion of the distribution, below, for how to make this less subjective); by aligning some b(i) with the noisy directions, you can factor the noise out and get clearer signals out of the other directions.

Each p(i) with s(i) = 0 represents a degeneracy in our actual variate values; it represents an interdependence among the primitive variates; such a p(i) is a variate that's constant on our population. When s(i) is 1, its b(i) looks like a difference between actual variate values; and its p(i) is a variate in which our population's variations are of order 1. In particular, these p(i) with s(i) = 1 are entirely independent of one another, so we can think of them as decomposing the primitive variates we started with into their independent parts.

If one of the b(i) with s(i) = 1 is small when considered in terms of our original primitive variates, this means there was little variability within our population in its direction; the corresponding p(i) will be big in the sense that it'll multiply differences in our variate values by large numbers in order to make its variate vary by of order 1 among your population. Such a p(i) represents your population almost having an interdependence among its variate values; all members of your population have almost the same value for this variate. That can be worth investigating to see whether it points to some other quantity you could measure, possibly not a linear function of the variates you started with, that is approximated by this variate and is actually (or at least more nearly) constant across your population.

If one of the b(i) with s(i) = 1 is large when considered in terms of your original primitives, it describes a direction in which your population has large variations. This variability may be nothing more than noise, in which case paying attention to the other b(i) eliminates this noise from the primitive data you originally collected, which can be very helpful. In other cases, where it has significant components of primitive variates you know to be meaningful (non-noisy), its p(i) is a variate that's a strong signal in your population.

Once you've identified which of the b(i) with s(i) = 1 are interesting, discarding the ones that are noisy, you can use the corresponding p(i) as the co-ordinates in which you represent your variate, ignoring its values in other directions. This isolates the information-rich part of your data, comprising the details it'll actually be interesting to study. You then have the option of rescaling these co-ordinates, so that the corresponding s(i) are no longer all 1, to fit your understanding of the relative significance of the various variates represented by the p(i).

Determining the distribution

You can compute the variance, and perform all the analysis above, working directly with your raw data. Doing so makes it much easier to turn your discrete measured data into a density function, since the variance provides the appropriate metric with respect to which to chose spherical volumes about points, over which to average the observed density.

In particular, when S is singular, there's a surface on which all your data actually lie; so you want a density on that surface, that you integrate over portions of that surface, rather than a density on the larger vector space, that you integrate over its volumes; volume integrals of scalar functions that are non-zero only on some surface, of lower dimension than the volumes, always give zero answers, so aren't much use to you. Restricting attention to the surface on which your data actually lie (by subtracting the mean and switching to using the p(i) with non-zero s(i) as your variates) will give you the right space in which to determine your distribution's density.

When you change basis while preserving the s(i), e.g. to isolate noisy directions or to make each direction's components mostly small, with respect to your primitive variates, compared to a few relatively large ones, you won't change the density. However, if you rescale your basis (e.g. to give greater prominence to the directions you think deserve it), you'll also need to rescale the density function (but not the fractional frequency) by the product of your scalings of the individual (S-orthogonal) components; this is the Jacobian (determinant) of your co-ordinate transformation. For linear transformations, such a scaling is the same for all positions, so doesn't affect the relative values of the distribution. More generally, if you change to new coordinates that depend non-linearly on the original variates, you'll need to scale the density (but not the fractional frequency function) by the Jacobian of the transformation at each point; but, for such a dramatic upheaval, it may simply be better to recompute the density from the raw data, as re-expressed by your transformation.

Studying the distribution

Changing to co-ordinates suggested by the variance of your data can make your distribution easier to understand. Using the geometry with respect to which your variance is unit-diagonalised, in particular, can make it easier to see how the distribution varies with direction. When you spot a pattern in your distribution, you learn something about the population under study.

For example, if your data has some noise in it, this should typically manifest itself as a direction S-orthogonal to your interesting signals. Along the noisy direction, the distribution shall vary in some manner that reflects the kind of noise – uniform on some interval, gaussian, gamma or whatever – and it'll follow the same pattern all over, as you move about S-orthogonal to the noisy direction. The parameters of the pattern (e.g. mean and variance for gaussian noise) may vary; that doesn't change the fact that it's noise, but the way in which it varies may contain useful information.

When the dimension of your vector space is larger than you can conveniently visualize, it can help to project it down to a space you can visualize better; this can make better use of your geometric intuitions, although those same intuitions can all too easily leap to false answers, so it's important to actually check what you think you see. Decent modern software systems should surely be able to perform such projections and let you vary the directions that have been contracted out.

Non-linear interdependences

Interdependences among your variates aren't necessarily linear; it may be, for example, that your variate values all lie near some curved surface in your variate vector space. In such a case, there's typically some scalar function of position that's constant on the surface; and this scalar function of position is a perfectly good numeric quantity you could use as a component variate, in which your population shall have small variability. It may well also correspond to some quantity you can measure directly from your population, that merely happens to be strongly correlated with the given function of the primitive variates you were previously using. Discovering such interdependences and associated variates can greatly improve your model of the population you're studying.

Once you've identified such a non-linear variate, either as a new quantity you can measure or as a non-linear function of the existing variates, it makes sense to select co-ordinates that vary over the surface on which the new variate was constant, to effectively straighten out the surface. These may themselves be non-linear functions of the original variates, but it'll often suffice to just rotate your basis, that unit-diagonalises the variance, to make one of its components roughly line up with the new variate, then discard this co-ordinate and combine the remainder with the new variate.

You may, equally, find that your distribution's density varies in some consistent manner, along one direction (cf. discussion of noisy directions, above), independent of variation in the S-perpendicular directions. If the variation along this consistent direction doesn't follow any particularly common pattern (e.g. uniform, gaussian, gamma), it may be constructive to reparameterise your co-ordinate in this direction, to make the variation fit some sensible pattern, if some reasonably sensible reparameterisation suffices to achieve this effect. You can then replace your co-ordinate in this direction with the duly reparameterised version of it, recompute your variance and distribution, then resume analyzing the distribution in the search for further ways to clarify it.

Splitting into sub-populations

One circumstance that can make a huge difference to one's interpretation of the data arises when the population would be better understood as being comprised of several distinct sub-populations. This is apt to show itself by the data being clustered around two (or more) distinct surfaces within the space of possible values. For example, given vital statistics and data on eating habits, income, wealth and frequency of various classes of illness, one may find the population split into two groups by whether they smoke, by their sex or by what type of job they do. In particular, this can distort the apparent correlations the above variance analysis aims to discover, making it important to actually look at the distribution and not simplistically assume that it's (to take a commonly assumption) roughly gaussian. In extreme cases, it can cause the population to exhibit a correlation that's actually contrary to what would be found in each of the sub-populations, if one studied them separately.

Simpson's paradox

This last (known as Simpson's paradox, among other names) is caused by differences between the sub-populations. Suppose, for example, we have some medical condition that's been identified by its symptoms but, unbeknownst to us, there are actually two different strains, A and B, of whatever microbe causes it. We conduct a medical trial of two treatments S and T, unwittingly (through some happenstance) administering S more to patients with strain A and T more to patients with strain B. Suppose S is effective some fraction f of the time against strain A and some fraction g of the time against strain B; while T achieves fraction F against A and G against B, with F > f > G > g; strain A is easier to treat and treatment T is more effective against each strain. However, we haven't measured F, f, G and g; we've only measured some averages of F with f and G with g. If n patients with strain A got treatment S, N with strain A got T, m with strain B got S and M with strain B got T, our observed data are:

n+m patients were treated with S, which worked for f.n +g.m of them, implying a success-rate (f.n +g.m)/(n+m);
N+M patients were treated with T, which worked for F.N +G.M of them, implying a success-rate (F.N +G.M)/(N+M).

Now, my given happenstance was that S went more to patients with A than T did, so n/m > N/M. A fraction like (f.n +g.m)/(n+m) is a weighted sum of f and g; with n and m positive, it falls somewhere between f and g, closer to f when n/m is close to 1, closer to g when n/m is close to 0. So, when n/m is close to 1, with f > g, (f.n +g.m)/(n+m) is close below f; when N/M is close to 0, with F > G, (F.N +G.M)/(N+M) is close above G; since f > G, this can lead to (f.n +g.m)/(n+m) being greater than (F.N +G.M)/(N+M), so that treatment S appears more effective (because we happened to give it more than T to the more easily-treated strain of the disease) than T, even though F > f and G > g, i.e. T is more effective on each strain considered separately. Unless we discover that there are actually two strains present, we'll thus be mislead by our medical trial as to which treatment is more effective.

If there's really no difference discernible between the two strains of our disease, other than the ease of treating them, the probability of trying one treatment disproportionately more on one strain than the other is, of course, quite modest; and repeat trials are unlikely to repeat the same happenstance, so the error is apt to come to light. However, all manner of circumstances may conspire to cause a difference, that does affect which treatment is applied, to escape notice – especially when, instead of a random trial, you collect data from the field. If there are two treatments in common use, one more expensive than the other, but doctors are under the impression (for whatever reason) that the more expensive one is more effective, but the cheaper one is good enough in most cases, they're apt to administer the more expensive one to patients with more severe symptoms. Such patients are commonly simply those who've let the disease run its course further before they went to the doctor; until the symptoms got severe, they thought they'd just get better on their own. If, however, there are two strains of the disease and the one that's harder to treat produces more severe symptoms early on, this is apt to lead to doctors preferentially using the more effective treatment against the strain that's harder to treat, exactly as above.

Kindred confounding effects can show up in all manner of other situations, where those collecting the data fail to consider a difference that, none the less, is relevant.

Bumpy distributions

If your population is in fact made up of several sub-populations, one of the ways this may show itself is in the distribution having more than one mode, or at least more than one local maximum in its density. This is a hint that each local maximum is in fact (roughly) the mode of one of the sub-populations; it may well be worth looking for something else you can measure, from your population, that separates the sub-populations out.

When one population is significantly bigger than the rest, its variations may drown out their local maxima, so sub-populations don't always show up even as local modes. They may even have roughly the same mode, but spread out to different degrees. Such cases may show up as the distribution almost fitting some pattern (e.g. gaussian, gamma, etc.) that your existing models suggest it might, but deviating from it. While such deviations may indicate other flaws in your model, one easy flaw for them to be revealing is a hidden splitting of the population into sub-populations that would be more robustly modelled separately.

One of the benefits of measuring seemingly irrelevant data about your population (in so far as measuring the extra details incurs little extra cost on top of measuring what you knew you were interested in anyway) is that they can help to expose a split into sub-populations. For example, the variability of an incidental variate may be higher in some sub-populations than others; if the incidental variate is largely independent of the variate you were mainly studying, it's going to show up in your final statistics as a noise variable, that you'll be discarding, but its variability may show a correlation with your signal variables, which would hint at the presence of sub-populations.

Depicting distributions

Whether you're trying to see the form of a distribution or communicate it to an audience, it's important to make the information accessible, without letting it get hidden by all the data.

Use images

If a picture is worth a thousand words, then surely it can also be a means by which to express information that would otherwise have overloaded the audience with the cost of a thousand words. Just so, depicting the distribution of a variate (which can include marking where its mean and median arise and showing its standard deviation in some way) can convey a far better sense of the data than any simple summary by a small set of numbers. If you then depict two distributions side-by-side in this way, it can make the differences between them apparent and intelligible: indeed, you can include all of the data in the depictions in such a way as to let readers chose the level of detail they want to look into. Consider the distributions of sums of, in the one case, three simple dice (cubic, with the values one through six on the faces, as usual) and, in the other case, the highest three of four such dice. Toggle the check-box below to flip between images representing the two distributions (it may glitch the first time, while loading the second image; but it should be smooth after that):

(Click on either image, when displayed, to get it full-screen on its own.) Now, I actually made these images for another purpose, so didn't quite use the same scheme to depict the details, which could be misleading – the grey parts at the ends represent one twelfth of outcomes in one case but one eighteenth in the other case – but, because what I've depicted is the relative frequencies, one can simply see the difference in shape at first sight.

Those who are so inclined can garner more information, the more they're willing to look into the details; but stopping short of reading every last detail won't deceive you significantly. If you do read the writing, once you've done so and worked out what it means the parts of the picture mean, you can step back from the details and view the images, ignoring the writing; you'll see most of the relevant information quite clearly enough. You can see the differences at a glance; you get more detail if you look into it; and, once you've done so, you can see how those details change, likewise at a glance.

Providing the ability to toggle between the images makes it particularly easy to see what's changed – and what hasn't changed, which is just as important; I have framed the data the same way in both cases. This is an example of a blink comparator, implemented using some very simple CSS.

Use SVG

It's been traditional to present data in tables and as graphs: tables have the advantage of providing the raw data, graphs provide the data in a form that can be visualized. The scalable vector graphics (SVG) format for images combines these virtues: the image is an XML document which contains data from which the graph is drawn; but, if the author takes a little care, this data can actually be (modulo some trivial processing, such as rescaling, that can be documented in the XML source) the raw data collected.

Chose your representation

Using an image is all well and good, but it's also important to use an image that actually makes clear the details that interest you.

Lying with statistics

It is often said that there are lies, damned lies and statistics (often enough that it's not clear who coined the phrase); this arises directly from the fact that most people have a very limited grasp of statistics, which makes it easy for liars to use statistics in deceiving folk. This is made all the worse by the fact that statistics, as taught, is mostly about grinding a large body of data down to a small set of summary numbers, rather than about studying distributions.

The problem with reducing a large body of data down to a single number that's supposed to summarize it is that it discards nearly all of the data. This is, of course, the point of the exercise – to avoid data overload – but it's exactly what opens the process up to abuse. When you discard most of the data, you are tacitly making value judgments about which of the data to keep, which can have a profound impact on what conclusions you draw from the summary you obtain thereby. In particular, if an argument involves statistics, your choice of which statistics to extract from the raw data can significantly affect which way the argument goes. Those who seek to deceive are all too aware of this and more than happy to exploit it.

Summary

While most courses on statistics devote most of their time and attention to the mechanics of computing statistics from the data and testing whether differences in statistics, between populations, is significant (against some assumed hypothesis), the thing that's actually important about a population is its distribution, from which such statistics are computed.

Two wildly different distributions may produce the same value for their statistics; if your model claims a gaussian distribution (as far too many do) and all you test is its mean and variance, you can easily conclude that your data do not significantly disagree with the model, even when the distribution of your data is so flagrantly non-gaussian that it does contradict the model. If all you ever do is compute the numbers, you won't notice this; you should first look at the distribution to see whether its shape plausibly matches the shape anticipated by the model. If it doesn't, there probably is some statistic of it that you can compute, that'll differ significantly from the model's prediction; but you won't notice that unless you look at the distribution itself.

Looking at numbers as summaries can be wildly misleading and, in any case, hides most of the information. Studying the distribution is the path to learning anything useful about the population; and taking the trouble to record seemingly extraneous (but easily measured) details about each member of the population may save you from misinterpreting the information you collect about the subject-matter you thought you were studying.

I should note, in passing, that significance testing is an archaism that should be ditched in favour of using Bayesian inference. See XKCD 1132 for an illustration of the difference.

Written by Eddy.