I was once, in an exam, faced with a question that bluntly told me
to: state and prove Chebychev's theorem

(there is some diversity in how
his name is transcribed to the Latin alphabet). For me, the hard part of this
wasn't the mathematics, it was the remembering of which thing goes with which
arbitrary label. Fortunately, the examiner wasn't cruel and stupid: this was
only the first part of the question and the second quite obviously needed me
to address the problem of a random variate being more than a given number of
standard deviations from its mean. I could indeed remember a result about
that, so I wrote Chebychev's theorem

at the top of a fresh sheet of
paper, left a modest-sized space, then wrote Proof

and began deriving a
bound on such a probability. When I was done, I wrote a tidied-up statement
of what I'd proved at the top (and went on to get full marks on that question
– apparently, I'd derived the right result). Thanks to that examniner,
I can now remember which result is normally attributed to Chebyshev.

As long as a (one-dimensional) distribution *has* a (well-defined)
mean and variance, the mean is a formalisation of where its values are centred
and the standard deviation (the square root of the variance) is a measure of how
far the distribution deviates from that centre. (For distributions on more than
one dimension, the mean and variance can still be defined, but the variance
doesn't have a square root unless the distribution is one-dimensional.) We can
always shift the distribution so that its mean is 0 and rescale its variation so
that its standard deviation is 1; what remains is the *shape* of the
distribution – and Chebyshev's result provides a constraint on how far
that shape can deviate from being simply a lump in the middle

. This only
works for well-defined mean and finite variance, though: indeed, it can equally
be read as saying tha a distribution that doesn't satisfy Chebyshev's bound
won't have finite variance, as illustrated by a family of distributions
with power-law tails.

So consider a random variate X whose probability distribution has mean m and standard deviation s. We can define a function Y that maps each set of X's values to a function that's 1 on members of the set, 0 elsewhere: Y(S, x) is 1 if x is in S, else 0. Then the probability that X's value falls in S is simply the expected value of Y(S), as expected value of a function is just the integral of probability for each value times the function at that value; as Y(S) is zero on values outside S, E(Y(S)) is thus the integral of our distribution (times 1, as this is Y(S)'s value) over S. For any positive real u, consider the probability that X is more than u.s away from m:

- P(m−u.s > X or X > m+u.s)
- = E(Y({x: ((x−m)/s/u)
^{2}> 1}))Now, with S = {x: ((x−m)/s/u)

^{2}> 1}, we know that ((x−m)/s/u)^{2}> 1 for x in S and the same function is ≥ 0 everywhere, so it is ≥ Y(S) everywhere; hence the expected value of it is ≥ E(Y(S)), too. - ≤ E(: ((x−m)/s/u)
^{2}←x :) - = 1/u/u

from the definition of the variance. So the probability of being more than u standard deviations from the mean is at most 1/u/u, no matter how strangely the distribution is shaped. Of course, for u ≤ 1, this doesn't tell us anything of much use, as every probability is ≤ 1 ≤ 1/u/u in this case, but it gives us a definite constraint for u > 1. That constraint is typically weaker than you'd get from specific knowledge of the actual distribution; but it applies regardless of the distribution, so we can use it any time we know the mean and variance, even if we know nothing else about the distribution.

For example, in synthetic aperture radar, the image formed of the terrain
being surveyed is corrupted with a quantum-mechanically-induced class of
severely random noise known as *coherent speckle*; when it looks at a
large area of terrain that's all the same

in any sensible terms, the
image formed represents that terrain by an area whose pixels are drawn from a
negative exponential distribution; that is, the probability that any given
pixel's value is > x for any given x is exp(−x/m) for some value m
(that'll be the same for all pixels in the given area
of same

-ness). [Thus the probability of an answer within a factor u
> 1 of the mean is P(m/u ≤ X ≤ m.u) = exp(−1/u)
−exp(−u) and, in particular, at u = 2, the probability of being
within a factor of two of correct

is exp(−1/2)
−exp(−2) ≈ .4712 < 1/2; the probability of being within a
factor 2.105 of the mean is 1/2.] Differentiating, we obtain the probability
density function (: exp(−x/m)/m ←x :); it takes but a little work
(or a quick look at my page on the gamma
distribution, of which the negative exponential distribution is the case
α = 1) to show that its expected value of power(n), for any natural n,
is m^{n}.n! = power(n,m).Γ(n+1), where Γ is
the gamma function (so Γ(n+1) is just
the factorial, n!). So mean is m and variance is 2.m.m −m.m = m.m,
hence standard deviation is m. The probability that the variate is < 0
(i.e. more than one standard deviation below than the mean) is zero, as the
variate only takes positive values. So, for u ≥ 1, Chebyshev's bound says
that the probability of a value > (1+u).m is ≤ 1/u/u. That's not as
strong as our given knowledge that it's actually exp(−(1+u)), but we'd
be able to infer it only from measurement of the signal's mean and variance,
even if we didn't know the actual distribution.