Usual introductions to statistics discuss what happens when you have some
numeric quantity you can measure for each member of some population; you can
add the numbers up and divide by how many there were to obtain the mean (the
usual average); you can sort the numbers into ascending order and pick the
middle one (or some value between the middle two, if there's an even number of
them) to get a median. The quantity, that varies among members of the
population, is described as a variate; when it's a simple
numeric quantity, it's a scalar variate
.
If what you've actually got is measured values from a moderately large, randomly-drawn sample from a larger population, the values you'll see for these computed quantities (mean and median) should be fairly stable under randomly selecting a few more members of the broader population to add to the sample set. The interesting quantities to compute from sample data are the ones that have this sort of stability under growing your sample size, or under replacing one sample with a fresh (but, again, randomly drawn) sample of comparable size. (It's OK if there are some samples you could, in principle, draw that would give wildly different values, as long as the probability of drawing such a sample is very small.) Such computed values can be obtained from values measured from practical-sized samples of a larger population and used to estimate the value we would compute, if we had data from the population as a whole.
Such a value, computed from values of a variate measured on a population
or sample and stable under choice of sample, is known as a statistic
(the same term is also sometimes used of the variate, from whose value for
each individual it is computed; hence vital statistics
of a person). A
statistic needn't be a simple number, even when the variate it characterizes
is; but those most commonly discussed are.
While some members of the population may coincide in their value of the variate, as long as the population is finite you only see a finite number of values, even if the values the variate could in principle take form a continuum. This means you can define a mapping from possible values of the variate to the fractional frequency of each – that is, the number of times it arose, divided by the size of your sample. We can then express the mean and other computed properties of our sample data in terms of this fractional frequency function – for example, the mean is the sum of the values that arose, each multiplied by its fractional frequency, sum(: x.p(x) ←x ;) if p is the fractional frequency function.
When our variate can only take certain discrete values, and the number of
values it can (in principle) possibly take is small compared to the sizes of
samples we can reasonably work with, this fractional frequency function shall
be reasonably stable under changes of sample; i.e., it is a statistic. Given
that it contains nearly all the data of our sample, not just a single number,
it's quite a good statistic to know, too. However, when the number of
possible values is large compared to all samples we can practically deal with,
adding new members to a sample shall tend to add spikes at new values and
perhaps make some existing spikes a little taller, while scaling all existing
spikes down a bit. For reasonably large sample sizes, making a few spikes a
little taller while scaling all spikes down a bit isn't a significant change;
but adding new spikes all over the place is rather more of a dramatic
change. I'll refer to the case with a few (relative to practical sample
sizes) discrete values as discrete
and to the alternative
as continuous
(even if it's actually discrete, but with a huge number
of possible values); in the discrete case, the fractional frequency function
is a useful statistic, but not so in the continuous case; it's useful (we can
still compute things from it), but not stable under sampling.
In the continuous case, however, we can work with a density
function
for which, for all reasonably broad intervals within our range of
possible values, the integral of the density function over the range differs
negligibly from the fraction of members of our population for which the
variate falls in the given range. We can then express other statistics in
terms of this density function – for example, the mean shall be the
integral of the variate's possible values, each times its value of the density
– in the same way as we used the fractional frequency function in the
discrete case (but substituting integration in place of summation); the
negligible discrepancies in integrals of the density function over intervals,
as compared to sums of the fractional frequency function within them, should
produce negligible differences between answers obtained from the density
function and those obtained from the fractional frequency function.
One can compute such a density function in various ways, given the data from a sample. You can, at each possible value within your variate's range of values, sum the fractional frequency over values within some interval about this point and divide by the width of the interval, to obtain a measure of how common values near this one are. When working with a sample from a larger population, this should, at least for good choices of interval-width, be reasonably stable under adding a few more randomly-chosen members of the full population to your sample and under replacing your sample with a fresh (randomly-drawn) one.
If you use a very narrow interval, the density shall consist of a spike at each of the finitely many values your population actually exhibited, and adding new members of the larger population to your sample shall tend to add new spikes, just as happened with the fractional frequency function. If you use a very broad interval, the density function shall be nicely continuous, smoothing out the spikes, but it'll lose information about variation; the integrals of your density over intervals shall differ significantly from the sums of fractional density over values in such intervals. Somewhere in between, you'll typically get some range of interval-widths among which the resulting density depends little on the width of interval and, as a result, gives reasonably accurate integrals over intervals.
There may be some variation, across the range of the variate's values, in which interval-widths are good; for example, you'll typically need broader intervals where the density is lower, to avoid spikiness. You can use some judicious strategy for choosing good widths for intervals across the range of values, e.g. finding the interval centred on each point that contains some fixed number of members of your population, then dividing that fixed number by population size and interval-width to get your density at that point, optionally with some funky heuristic adjustments, e.g. reducing the computed value in so far as the numbers of members either side of the middle of the interval are significantly different, so that the density dies off quickly outside the range of actually observed values.
What matters is that your process for turning a sample's values into a density should be reasonably stable under changes of reasonably large sample, drawn randomly from a larger population, while giving integrals over intervals that match reasonably well to sums of relative frequency over values within those intervals. Such a density function is then a statistic of your population; and usually more useful than any simple number.
The processes of summing a function, scaled by a relative frequency, or integrating the function, scaled by a density, are both linear maps from the linear space of functions on which they act, producing as output a value of the same kind as the function on which they act. We have a set V of possible values of our variate; for each function (:f|V), the linear map gives us either E(f) = sum:(: f(v).p(v) ←v |V), with p being the fractional frequency function, or E(f) = integral(: f(v).p(v).dv ←v |V), with p as the density function. The answer, in either case, is a value of the same kind as f's outputs times a scaling; if f is ({reals}:f|V), then E(f) is real; if (W:f|V) for some real-linear space W, then E(f) shall be in W. Thus E is, for each real-linear space W, a linear map (W: |{mappings (W:|V)}); in particular, it maps the linear space {mappings ({reals}:|V)} linearly to {reals}; as such, it acts as a member of dual({mappings ({reals}:|V)}). Furthermore, its action on {mappings (W:|V)} can be inferred (for example, by taking co-ordinates in W, letting it act on each, then recombining) from its action as a member of dual({mappings ({reals}:|V)}).
Using a member of dual({mappings ({scalars}:|V)}) in this way, for some collection V and suitable set of scalars, gives us an operation known as a distribution on V. It generalizes summation and integration, throwing in some weighting by relative frequencies or densities, to turn a function on V into a summary of it; for example, it turns the identity on V into E(: x ←x |V) = integral(: x.p(x) ←x :) or sum(: x.p(x) ←x :), which is the mean of our variate. Our distribution, E, serves to unify the discussion of fractional frequency and density; and is a statistic (i.e. we can compute it from our sample data and it's reasonably stable under change of which sample we use). I'll refer to the fractional frequency function or density function as the density of the distribution, E.
While we're on the subject, a mode of a variate is a value at which its density is maximal. (Note that it can matter whether you chose to describe the variate as discrete or continuous in this case, at least when the discrete values aren't evenly spaced.) It's entirely possible to have more than one mode – for example, on a fair die, each possible outcome is a mode; and the commonest innocent deviation, in dice, from fairness is that the thicknesses aren't quite exactly equal, so the distance from each face to the one opposite isn't the same in all directions; the pair of faces with the smallest separation then shows up more often and the two values those faces show are both modes.
It may happen that a population actually exhibits a mixture of continuous and discrete values for some variate; for example, half of the population's values for the variate take whole number values within some interval, while the rest of the population's values are continuously distributed over that same interval, with arbitrary fractional parts. This is unusual, if it ever happens, and almost certainly a case of a population better studied by splitting it into two sub-populations to be studied separately, but it is important that the theory be able to handle it naturally, when it arises. The distribution describing such a population has E(f) = sum(: w(x).f(x) ←x |D) +integral(: p(x).f(x).dx ←x :C) for some relative frequency function w on some discrete set D and density function p on some continuum C. There may, of course, be some overlap between C and D, e.g. D may be a subset of C.
One can still model such a mixed distribution as a continuum
distribution by using Dirac's delta function
,
δ. This is defined as the density of a distribution for which E(f) =
f(0), giving the above variate a continuum density of (: p(x) +sum(:
w(y).δ(x −y) ←y |C) ←x :). The
δ function
isn't really a function at all,
although it can be thought of as a limit of various families of well-behaved
density functions, with density maximal at 0 (hence at x = y in each term of
the sum) and dropping off rapidly to either side. It's actually just a formal
representation for use, within an integral, to indicate that the integral is
really a combination of an integral and a sum; a δ in the integrand
indicates a term that really belongs in the sum. It unifies the descriptions
of the two cases, sum and integral, by letting a sum be represented by an
integral with δs in its density.
If this happens, the fractional frequency function's values at the discrete values of the variate shall be stable under increasing the size of sample over which you collect data, while its spikes for the individual values actually observed within the continuum shall get smaller and smaller, in inverse proportion to the size of your sample; you can use this to discern which values are in each set. Then obtain your density for the continuum values as above. Interpolating it smoothly through the values where the discrete set produces spikes in the relative frequency function, you can estimate how many of your observed values of the discrete set actually arose from members of the continuum happening to hit each discrete value; subtracting this much from the relative frequencies at the discrete values, you're left with the relative frequency contributions due to the discrete part of the distribution.
If your data are obtained with some measurement error, the actually discrete values may show up spread across a range about the discrete value; you'll observe a density that looks like the one using δ, above, but with δ replaced by a function characterizing your measurement errors. In such a case, the apparently continuous part of your observed density may be an illusion caused by the tails of your error function actually being broader than you realize. All the same, you can reasonably treat this case as a simple continuum case during most of the analysis, even if you have reasons to model it in terms of a discrete or mixed distribution. You work with the measured value as your variate, even if you model it in terms of an underlying variate reflected by the measured value, with errors.
If you're able to measure your errors (e.g. by repeatedly measuring the same member of the population and observing how the answers vary; then doing the same for several other members of the population), you can characterize the shape of peaks you would get from actual discrete values; for each actual value y, you'll then get some density e(y, x) for the measured value x. This shall typically depend on x and y only via x−y (cf. the δ form above) or x/y but, in principle, may have any form.
If you think you've got a discrete or mixed distribution, you can then look for the discrete values y in D and relative frequency function (: w |D) for which the result of subtracting (: sum(: w(y).e(y, x) ←y |D) ←x :) from your observed density gives you a suitably smooth residual density to attribute to any continuum part of the population. If what you've got is really just a discrete distribution, the residual density shall be negligible; and the value of each w(y) shall typically be proportional to the value of your observed distribution's peak near y.
Otherwise, or once you've subtracted off the discrete contribution to get
a residual density, you've got the density for your continuum variate; and you
need to look for a density p for which (: integral(: p(y).e(y, x) ←y :)
←x :) is a good match to your observed (or residual) density – the
density also has error bars around each value, so the underlying variate's
density, p, is only seen via your error processes. When e has a suitably
straightforward form, the relation between observed and underlying densities
is a convolution
, which can be expressed via
the Fourier transform, which can be exploited to
help infer the underlying distribution from the observed one.
In reality, given a population, there are typically several quantities one can sensibly measure, and sensibly reduce to numeric values, for each member of the population; and much that's interesting about the population can be learned by studying relationships among these variates.
roughlyin that. Likewise, the girth shall be roughly proportional to the sum of the two widths. These are cases where one can readily see why the relationship arises.
happenstancefactors in the model and
causalfactors, particularly if the model represents the happenstance in the same way as it represents causal factors. Even if there is an actual causal link, once found it may turn out to work differently than how it's represented by the initial incorporation of the happenstance into the model.
Such relationships are known as correlations among the variates. In order to study them, one must study the diverse variates together; and the way to represent several numeric quantities together is as a list of numbers understood as a vector. We thus have a vector variate, composed of scalar variates.
The range of values of our vector variate is a sub-set of some vector
space, V. Its dual is the space of linear maps from it to scalars; its
members are known as covectors. For each component variate, there's a
covector that selects this component from each vector, ignoring all
others. These component co-vectors span dual(V); any covector is some
weighted sum of the scalar variates we combined into our vector. Any such
weighted sum can be construed as a variate in its own right; while the
variates we actually measure may be construed as (in the sense that we measure
them directly) primitive
, such composite variates may prove more
interesting in some cases, either by representing a correlation among our
primitive variates or by representing a controlling property, that some
primitive variates are more strongly correlated with than with one
another. The study of the variance, below, shall let us discover these more
interesting variates.
Just like numbers, we can add vectors up and divide them by how many there were, to obtain the mean. Just as for simple numeric values, the mean is reasonably stable under variation of which reasonably large sample we draw from a larger population. Indeed, each component of the mean is just the mean of the variate that was used for that component; the components of the mean can be computed independently – using a vector representation doesn't gain much if all you're interested in is the mean.
As for a simple scalar variate, we can compute the fractional frequency function for a sample's data and, when suitable, attempt to approximate this by a distribution. Now, in place of intervals about each potential value of the variate, we'll want to use small volumes about each value and we'll want to ensure that the integrals of density over reasonably sensibly-shaped volumes shall match up with the sums of fractional frequencies of values within such volumes. Deciding which volumes are of interest shall be easier once we've studied the geometry of our space a bit, in light of the variance. Once we have a density, or when the fractional frequency function is suitable for use in its own right, we can represent our sample's data by its distribution on V, just as we did when V was some subset of {scalars}. A good graphical visualization of the density's variation in V can most likely give excellent insights into the data under study.
One can likewise determine the median of each component separately and make a vector of the results. However, when working with vectors, one should avoid defining properties of a vector in terms of its components, wherever possible. Vector quantities can't generally be ordered, so it's not immediately clear how we should generalize the median to the case of vector data. The characteristic property of the median is that the population is split evenly between the two sides of it: when comparing how many members of the population have their measured value greater than the median to how many have values less than it, the difference is at most the number that have the median as their value.
Note that this implies, in a population of even size,
that any value between the middle two members is a median
;
it's usual to take the middle of this interval and term it the median
in this case, but this is merely convention. For a population with an odd
number of members, or one in which the middle members coincide, there is only
one median, so it does make sense to speak of the median
.
In a vector space, however, a point doesn't split the space in two; what
you need to do that is a plane of codimension 1
; in general this is a
set of form {p+v: w·v = 0} for some p in V and covector w in
dual(V). (If we replace w with a mapping ({lists ({scalars}:|n)}: |V) for
some natural n, and the 0 with the zero list ({0}:|n), the set {p+v: w(v) =
zero} is a plane of codimension n.) Note that the given point p is a member
of this set; any point in the set is described as being on
the plane
and the plane is said to pass through
each point on it. Such a plane
splits V in two
(actually in three; the things on each side and the
things on the plane; but the plane has no volume
in the space, so tends
to be ignored in the description) by using w to collapse V down to {scalars}:
the v in V with w(v) < w(p) lie on one side of the plane while those with
w(v) > w(p) lie on the other side; for any two vectors on the same side,
the straight line from one to the other doesn't pass through the plane; for
any two vectors on different sides, any continuous path starting at one and
ending at the other necessarily passes through the plane. Thus
each half
into which the plane splits V is path-connected and the
plane separates
them.
This, in turn, suggests a definition of median that we can apply to vector quantities: a point p is a median of a sample of data precisely if, for every plane of co-dimension 1 through it, the numbers of members of the sample whose variate vectors lie on each side of the plane differ by, at most, the number of members whose variate vectors lie on the plane itself. As each of our component variates is a member of dual(V), any median of our vector distribution must have, as each component, a median of that component's variate. Note, however, that not every distribution has a median, in this sense.
The place where the vector representation really comes into its own is in
describing how the values vary about their mean = E(: x ←x :). For a
simple scalar variate, this is usually summarized by
its variance = E(: power(2, x−mean) ←x :), which
is necessarily non-negative; and is positive if there are two or more distinct
values at which the distribution's density is non-zero. The square root of
the variance is known as the standard deviation
and gives a sense of
typical differences, within the population, from the mean.
For a vector quantity, the natural thing to replace power(2) with is the tensor square of the vector, (V⊗V: v×v ←v |V). Every output of this is a positive semi-definite symmetric quadratic form on dual(V); individual outputs are singular but sums of outputs (from inputs that, collectively, span V) can be invertible. So we get a tensor variance, defined by
Now, S is a positive semi-definite quadratic form on dual(V); if the values of v−mean, for v at which E's density is non-zero, span V, then S shall be invertible and positive definite, allowing us to use its inverse as a metric on V, with respect to which displacements between our variate's observed values have comparable lengths regardless of direction.
Notice that our original data were whatever we measured, in whatever units
we chose to measure them; we might have measured girth in inches, height in
metres, widths in feet and weight in stone (one stone is 14 pounds, about
6.35 kg). So typical differences in girth would get big numbers while
differences in height would get relatively small ones. So using S's inverse
to determine our notion of how big
differences are is clearly more apt
(better suited to the data we're working with) than simply summing the squares
of differences in the numbers we originally measured.
Now suppose there were some linear dependence among our data; some covector w which maps the variate vector from every member of our population to the same value. Then it'll also map the mean to that same value; hence it'll map every difference from the mean to zero. Consequently, S·w shall be zero (S is a sum of tensor products and each of its right factors is in w's kernel), so S shall be singular.
This would happen, for example, if we included people's widths, side-to-side and back-to-front, as well as the sum of these two among our component variates; w would be the sum of the component covectors for the widths minus the component covector for their sum. This example is perhaps rather silly, given that the correlation is obvious by specification; but one may all too easily include a variate that seems like it's at least not totally dependent on others, only to have it turn out to be dependent on some of the others after all. For example, girth might turn out to be entirely determined by the two widths, in some way.
Fortunately, even when S is singular, it turns out we can still induce a metric on the differences between our population's vector variate values. We can diagonalise S, by a suitable choice of basis (V:b|n) of V, with n the dimension of V, which shall put S into the form sum(: s(i).b(i)×b(i) ←i |n) and, by choice of scaling of the basis vectors b(i), we can make each s(i) be either 1 or 0 (in general, −1 might also be needed; but we know S is positive semi-definite, so this won't arise). The basis (dual(V):p|n) dual to b – defined by p(i)·b(j) = 0 unless i = j, in which case it's 1 – then has S·p(i) = 0 for each i with s(i) = 0; these {p(i): s(i) = 0} are a basis of S's kernel; they represent the linear dependencies among the primitive variates we initially measured. The remaining {p(i): s(i) = 1} are then linearly independent variates of our population's data, derived from the primitive variates.
We obtained S as a sum of v×v terms where each v was a difference between mean and a variate vector of some member of our population. For any covector w, the contribution this makes to w·S·w is power(2, w·v), which can't be negative. So if there's any member of our population whose difference from mean w doesn't map to zero, w·S·w must be positive, since it gets a positive contribution from this member and no negative contributions to cancel it out. Since S is positive semi-definite, a covector w has w·S·w = 0 precisely if w is in the kernel of S. Consequently, any w in the kernel of S must in fact map to zero every difference between mean and a variate vector from a member of our population; hence also every difference between such actual variate vectors. So let v be any difference between actual variate vectors, or difference between mean and an actual variate vector. The p(i) with s(i) = 0 are a basis of the kernel of S, so each maps v to zero and we can omit this i from the sum(: b(i).p(i, v) ←i :n) = v to express v as a sum of the b(i) with s(i) = 1; so these b(i) span the differences between actual variate vectors.
Now suppose we have two such differences, u and v; to obtain an inner
product of these, in terms appropriate to the actual variability of our data,
we want something that we can think of as u/S\v, u over S under v
or u divided by S, contracted with v
, which would be the contraction of
u and v on either side of S's inverse if it had one; but S may be singular, so
we have to work a bit harder to give meaning to u/S\v. Let's start by
considering what u/S would be: it should be a covector w for which u =
w·S and changing w by any member of the kernel of S won't change the
value of w·S, so we have ambiguity in w. However, the ambiguity is by
members of S's kernel, each of which maps v to zero, so contracting w·v
shall get a consistent answer, regardless of the ambiguity in w. This answer
is then what we want to use for u/S\v. So it suffices that there
be some w in dual(V) for which w·S is u; can we find such a
w ? Well, try w = sum(: p(i).p(i, u) ←i |n). The i for which s(i)
is zero all have p(i, u) = 0, so contribute nothing to this sum; and S is the
sum over the remaining i of b(i)×b(i), so contracting it with this w
shall give us sum(: b(i).p(i, u) ←i |n) = u, as we wanted. So yes, we do
get a value for u/S and we can give meaning to u/S\v for all differences u, v
between our actual variate values. Thus we can use S to define a metric on
differences between actual variate values, even when it's singular.
Now let's look at those dual bases, b and p, from our diagonalisation of S; these can be thought of as providing co-ordinates in V which simplify your population's variability. You have some choice of diagonalising basis; you can replace the b(i) with s(i) = 1 via any recombination of them that preserves the sum of their tensor squares, just like applying an isometry to the basis vectors of a metric – you can rotate your co-ordinate axes, as long as you apply the rotation consistently. Likewise, you can replace any b(i) with s(i) = 0, applying any non-zero scaling to it and adding arbitrary multiples of the other b(i) with s(i) = 0. After such rearrangements the s(i) won't be changed, but you'll also need to revise your p(i) to match.
You can use such rearrangements to make each b(i) have most of its
primitive components small compared to just a few relatively large ones. You
can also use it to isolate directions in which your variate's variation is
mostly noise
(this may be somewhat subjective: variations which, to the
best of your understanding, are simply random, not actually indicative of
anything interesting; but see the discussion of the distribution, below, for
how to make this less subjective); by aligning some b(i) with the noisy
directions, you can factor the noise out and get clearer signals out of the
other directions.
Each p(i) with s(i) = 0 represents a degeneracy in our actual variate values; it represents an interdependence among the primitive variates; such a p(i) is a variate that's constant on our population. When s(i) is 1, its b(i) looks like a difference between actual variate values; and its p(i) is a variate in which our population's variations are of order 1. In particular, these p(i) with s(i) = 1 are entirely independent of one another, so we can think of them as decomposing the primitive variates we started with into their independent parts.
If one of the b(i) with s(i) = 1 is small
when considered in terms
of our original primitive variates, this means there was little variability
within our population in its direction; the corresponding p(i) will
be big
in the sense that it'll multiply differences in our variate
values by large numbers in order to make its variate vary by of order 1 among
your population. Such a p(i) represents your population almost having an
interdependence among its variate values; all members of your population have
almost the same value for this variate. That can be worth investigating to
see whether it points to some other quantity you could measure, possibly not a
linear function of the variates you started with, that is approximated by this
variate and is actually (or at least more nearly) constant across your
population.
If one of the b(i) with s(i) = 1 is large
when considered in terms
of your original primitives, it describes a direction in which your population
has large variations. This variability may be nothing more than noise
,
in which case paying attention to the other b(i) eliminates this noise from
the primitive data you originally collected, which can be very helpful. In
other cases, where it has significant components of primitive variates you
know to be meaningful (non-noisy), its p(i) is a variate that's a strong
signal in your population.
Once you've identified which of the b(i) with s(i) = 1 are interesting, discarding the ones that are noisy, you can use the corresponding p(i) as the co-ordinates in which you represent your variate, ignoring its values in other directions. This isolates the information-rich part of your data, comprising the details it'll actually be interesting to study. You then have the option of rescaling these co-ordinates, so that the corresponding s(i) are no longer all 1, to fit your understanding of the relative significance of the various variates represented by the p(i).
You can compute the variance, and perform all the analysis above, working
directly with your raw data. Doing so makes it much easier to turn your
discrete measured data into a density function, since the variance provides
the appropriate metric with respect to which to chose spherical
volumes
about points, over which to average the observed density.
In particular, when S is singular, there's a surface on which all your data actually lie; so you want a density on that surface, that you integrate over portions of that surface, rather than a density on the larger vector space, that you integrate over its volumes; volume integrals of scalar functions that are non-zero only on some surface, of lower dimension than the volumes, always give zero answers, so aren't much use to you. Restricting attention to the surface on which your data actually lie (by subtracting the mean and switching to using the p(i) with non-zero s(i) as your variates) will give you the right space in which to determine your distribution's density.
When you change basis while preserving the s(i), e.g. to isolate noisy directions or to make each direction's components mostly small, with respect to your primitive variates, compared to a few relatively large ones, you won't change the density. However, if you rescale your basis (e.g. to give greater prominence to the directions you think deserve it), you'll also need to rescale the density function (but not the fractional frequency) by the product of your scalings of the individual (S-orthogonal) components; this is the Jacobian (determinant) of your co-ordinate transformation. For linear transformations, such a scaling is the same for all positions, so doesn't affect the relative values of the distribution. More generally, if you change to new coordinates that depend non-linearly on the original variates, you'll need to scale the density (but not the fractional frequency function) by the Jacobian of the transformation at each point; but, for such a dramatic upheaval, it may simply be better to recompute the density from the raw data, as re-expressed by your transformation.
Changing to co-ordinates suggested by the variance of your data can make your distribution easier to understand. Using the geometry with respect to which your variance is unit-diagonalised, in particular, can make it easier to see how the distribution varies with direction. When you spot a pattern in your distribution, you learn something about the population under study.
For example, if your data has some noise
in it, this should
typically manifest itself as a direction S-orthogonal to your interesting
signals. Along the noisy direction, the distribution shall vary in some
manner that reflects the kind of noise – uniform on some interval,
gaussian, gamma or whatever – and it'll follow the same pattern all
over, as you move about S-orthogonal to the noisy direction. The parameters
of the pattern (e.g. mean and variance for gaussian noise) may vary; that
doesn't change the fact that it's noise, but the way in which it
varies may contain useful information.
When the dimension of your vector space is larger than you can conveniently visualize, it can help to project it down to a space you can visualize better; this can make better use of your geometric intuitions, although those same intuitions can all too easily leap to false answers, so it's important to actually check what you think you see. Decent modern software systems should surely be able to perform such projections and let you vary the directions that have been contracted out.
Interdependences among your variates aren't necessarily linear; it may be, for example, that your variate values all lie near some curved surface in your variate vector space. In such a case, there's typically some scalar function of position that's constant on the surface; and this scalar function of position is a perfectly good numeric quantity you could use as a component variate, in which your population shall have small variability. It may well also correspond to some quantity you can measure directly from your population, that merely happens to be strongly correlated with the given function of the primitive variates you were previously using. Discovering such interdependences and associated variates can greatly improve your model of the population you're studying.
Once you've identified such a non-linear variate, either as a new quantity
you can measure or as a non-linear function of the existing variates, it makes
sense to select co-ordinates that vary over the surface on which the new
variate was constant, to effectively straighten out
the surface. These
may themselves be non-linear functions of the original variates, but it'll
often suffice to just rotate
your basis, that unit-diagonalises the
variance, to make one of its components roughly line up with the new variate,
then discard this co-ordinate and combine the remainder with the new
variate.
You may, equally, find that your distribution's density varies in some consistent manner, along one direction (cf. discussion of noisy directions, above), independent of variation in the S-perpendicular directions. If the variation along this consistent direction doesn't follow any particularly common pattern (e.g. uniform, gaussian, gamma), it may be constructive to reparameterise your co-ordinate in this direction, to make the variation fit some sensible pattern, if some reasonably sensible reparameterisation suffices to achieve this effect. You can then replace your co-ordinate in this direction with the duly reparameterised version of it, recompute your variance and distribution, then resume analyzing the distribution in the search for further ways to clarify it.
One circumstance that can make a huge difference to one's interpretation of the data arises when the population would be better understood as being comprised of several distinct sub-populations. This is apt to show itself by the data being clustered around two (or more) distinct surfaces within the space of possible values. For example, given vital statistics and data on eating habits, income, wealth and frequency of various classes of illness, one may find the population split into two groups by whether they smoke, by their sex or by what type of job they do. In particular, this can distort the apparent correlations the above variance analysis aims to discover, making it important to actually look at the distribution and not simplistically assume that it's (to take a commonly assumption) roughly gaussian. In extreme cases, it can cause the population to exhibit a correlation that's actually contrary to what would be found in each of the sub-populations, if one studied them separately.
This last (known as Simpson's paradox, among other names) is caused by differences between the sub-populations. Suppose, for example, we have some medical condition that's been identified by its symptoms but, unbeknownst to us, there are actually two different strains, A and B, of whatever microbe causes it. We conduct a medical trial of two treatments S and T, unwittingly (through some happenstance) administering S more to patients with strain A and T more to patients with strain B. Suppose S is effective some fraction f of the time against strain A and some fraction g of the time against strain B; while T achieves fraction F against A and G against B, with F > f > G > g; strain A is easier to treat and treatment T is more effective against each strain. However, we haven't measured F, f, G and g; we've only measured some averages of F with f and G with g. If n patients with strain A got treatment S, N with strain A got T, m with strain B got S and M with strain B got T, our observed data are:
Now, my given happenstance was that S went more to patients with A than T did, so n/m > N/M. A fraction like (f.n +g.m)/(n+m) is a weighted sum of f and g; with n and m positive, it falls somewhere between f and g, closer to f when n/m is close to 1, closer to g when n/m is close to 0. So, when n/m is close to 1, with f > g, (f.n +g.m)/(n+m) is close below f; when N/M is close to 0, with F > G, (F.N +G.M)/(N+M) is close above G; since f > G, this can lead to (f.n +g.m)/(n+m) being greater than (F.N +G.M)/(N+M), so that treatment S appears more effective (because we happened to give it more than T to the more easily-treated strain of the disease) than T, even though F > f and G > g, i.e. T is more effective on each strain considered separately. Unless we discover that there are actually two strains present, we'll thus be mislead by our medical trial as to which treatment is more effective.
If there's really no difference discernible between the two strains of our
disease, other than the ease of treating them, the probability of trying one
treatment disproportionately more on one strain than the other is, of course,
quite modest; and repeat trials are unlikely to repeat the same happenstance,
so the error is apt to come to light. However, all manner of circumstances
may conspire to cause a difference, that does affect which treatment is
applied, to escape notice – especially when, instead of a random trial,
you collect data from the field
. If there are two treatments in common
use, one more expensive than the other, but doctors are under the impression
(for whatever reason) that the more expensive one is more effective, but the
cheaper one is good enough in most cases, they're apt to administer the more
expensive one to patients with more severe symptoms. Such patients are
commonly simply those who've let the disease run its course further before
they went to the doctor; until the symptoms got severe, they thought they'd
just get better on their own. If, however, there are two strains of the
disease and the one that's harder to treat produces more severe symptoms early
on, this is apt to lead to doctors preferentially using the more effective
treatment against the strain that's harder to treat, exactly as
above.
Kindred confounding effects can show up in all manner of other situations, where those collecting the data fail to consider a difference that, none the less, is relevant.
If your population is in fact made up of several sub-populations, one of the ways this may show itself is in the distribution having more than one mode, or at least more than one local maximum in its density. This is a hint that each local maximum is in fact (roughly) the mode of one of the sub-populations; it may well be worth looking for something else you can measure, from your population, that separates the sub-populations out.
When one population is significantly bigger than the rest, its variations may drown out their local maxima, so sub-populations don't always show up even as local modes. They may even have roughly the same mode, but spread out to different degrees. Such cases may show up as the distribution almost fitting some pattern (e.g. gaussian, gamma, etc.) that your existing models suggest it might, but deviating from it. While such deviations may indicate other flaws in your model, one easy flaw for them to be revealing is a hidden splitting of the population into sub-populations that would be more robustly modelled separately.
One of the benefits of measuring seemingly irrelevant data about your
population (in so far as measuring the extra details incurs little extra cost
on top of measuring what you knew you were interested in anyway) is that they
can help to expose a split into sub-populations. For example, the variability
of an incidental variate may be higher in some sub-populations than others; if
the incidental variate is largely independent of the variate you were mainly
studying, it's going to show up in your final statistics as a noise
variable, that you'll be discarding, but its variability may show a
correlation with your signal
variables, which would hint at the
presence of sub-populations.
Whether you're trying to see the form of a distribution or communicate it to an audience, it's important to make the information accessible, without letting it get hidden by all the data.
If a picture is worth
a thousand words, then surely it can also be
a means by which to express information that would otherwise have overloaded
the audience with the cost of a thousand words. Just so, depicting
the distribution of a variate (which can include marking where its mean and
median arise and showing its standard deviation in some way) can convey a far
better sense of the data than any simple summary by a small set of
numbers. If you then depict two distributions side-by-side in this way, it
can make the differences between them apparent and intelligible: indeed, you
can include all of the data in the depictions in such a way as to let readers
chose the level of detail they want to look into. Consider the distributions
of sums of, in the one case, three simple dice (cubic, with the values one
through six on the faces, as usual) and, in the other case, the highest three
of four such dice. Toggle the check-box below to flip between images
representing the two distributions (it may glitch the first time, while
loading the second image; but it should be smooth after that):
(Click on either image, when displayed, to get it full-screen on its own.) Now, I actually made these images for another purpose, so didn't quite use the same scheme to depict the details, which could be misleading – the grey parts at the ends represent one twelfth of outcomes in one case but one eighteenth in the other case – but, because what I've depicted is the relative frequencies, one can simply see the difference in shape at first sight.
Those who are so inclined can garner more information, the more they're
willing to look into the details; but stopping short of reading every last
detail won't deceive you significantly. If you do read the writing, once
you've done so and worked out what it means the parts of the picture mean, you
can step back from the details and view the images, ignoring the writing;
you'll see most of the relevant information quite clearly enough. You can see
the differences at a glance
; you get more detail if you look into it;
and, once you've done so, you can see how those details change, likewise at
a glance
.
Providing the ability to toggle between the images makes it particularly easy to see what's changed – and what hasn't changed, which is just as important; I have framed the data the same way in both cases. This is an example of a blink comparator, implemented using some very simple CSS.
It's been traditional to present data in tables and as graphs: tables have the advantage of providing the raw data, graphs provide the data in a form that can be visualized. The scalable vector graphics (SVG) format for images combines these virtues: the image is an XML document which contains data from which the graph is drawn; but, if the author takes a little care, this data can actually be (modulo some trivial processing, such as rescaling, that can be documented in the XML source) the raw data collected.
Using an image is all well and good, but it's also important to use an image that actually makes clear the details that interest you.
It is often said that there are lies, damned lies and statistics
(often enough that it's not clear who coined the phrase); this arises directly
from the fact that most people have a very limited grasp of statistics, which
makes it easy for liars to use statistics in deceiving folk. This is made all
the worse by the fact that statistics, as taught, is mostly about grinding a
large body of data down to a small set of summary
numbers, rather than
about studying distributions.
The problem with reducing a large body of data down to a single number
that's supposed to summarize
it is that it discards nearly all of the
data. This is, of course, the point of the exercise – to avoid data
overload – but it's exactly what opens the process up to abuse. When
you discard most of the data, you are tacitly making value judgments about
which of the data to keep, which can have a profound impact on what
conclusions you draw from the summary you obtain thereby. In particular, if
an argument involves statistics, your choice of which statistics to extract
from the raw data can significantly affect which way the argument goes. Those
who seek to deceive are all too aware of this and more than happy to exploit
it.
While most courses on statistics devote most of their time and attention
to the mechanics of computing statistics from the data and testing whether
differences in statistics, between populations, is significant
(against
some assumed hypothesis), the thing that's actually important about a
population is its distribution, from which such statistics are
computed.
Two wildly different distributions may produce the same value for their statistics; if your model claims a gaussian distribution (as far too many do) and all you test is its mean and variance, you can easily conclude that your data do not significantly disagree with the model, even when the distribution of your data is so flagrantly non-gaussian that it does contradict the model. If all you ever do is compute the numbers, you won't notice this; you should first look at the distribution to see whether its shape plausibly matches the shape anticipated by the model. If it doesn't, there probably is some statistic of it that you can compute, that'll differ significantly from the model's prediction; but you won't notice that unless you look at the distribution itself.
Looking at numbers as summaries can be wildly misleading and, in any case, hides most of the information. Studying the distribution is the path to learning anything useful about the population; and taking the trouble to record seemingly extraneous (but easily measured) details about each member of the population may save you from misinterpreting the information you collect about the subject-matter you thought you were studying.
I should note, in passing, that significance testing is an archaism that should be ditched in favour of using Bayesian inference. See XKCD 1132 for an illustration of the difference.
Written by Eddy.