]> Chebyshev's bound

Chebyshev's bound

I was once, in an exam, faced with a question that bluntly told me to: state and prove Chebychev's theorem (there is some diversity in how his name is transcribed to the Latin alphabet). For me, the hard part of this wasn't the mathematics, it was the remembering of which thing goes with which arbitrary label. Fortunately, the examiner wasn't cruel and stupid: this was only the first part of the question and the second quite obviously needed me to address the problem of a random variate being more than a given number of standard deviations from its mean. I could indeed remember a result about that, so I wrote Chebychev's theorem at the top of a fresh sheet of paper, left a modest-sized space, then wrote Proof and began deriving a bound on such a probability. When I was done, I wrote a tidied-up statement of what I'd proved at the top (and went on to get full marks on that question – apparently, I'd derived the right result). Thanks to that examniner, I can now remember which result is normally attributed to Chebyshev.

As long as a (one-dimensional) distribution has a (well-defined) mean and variance, the mean is a formalisation of where its values are centred and the standard deviation (the square root of the variance) is a measure of how far the distribution deviates from that centre. (For distributions on more than one dimension, the mean and variance can still be defined, but the variance doesn't have a square root unless the distribution is one-dimensional.) We can always shift the distribution so that its mean is 0 and rescale its variation so that its standard deviation is 1; what remains is the shape of the distribution – and Chebyshev's result provides a constraint on how far that shape can deviate from being simply a lump in the middle. This only works for well-defined mean and finite variance, though: indeed, it can equally be read as saying tha a distribution that doesn't satisfy Chebyshev's bound won't have finite variance, as illustrated by a family of distributions with power-law tails.

So consider a random variate X whose probability distribution has mean m and standard deviation s. We can define a function Y that maps each set of X's values to a function that's 1 on members of the set, 0 elsewhere: Y(S, x) is 1 if x is in S, else 0. Then the probability that X's value falls in S is simply the expected value of Y(S), as expected value of a function is just the integral of probability for each value times the function at that value; as Y(S) is zero on values outside S, E(Y(S)) is thus the integral of our distribution (times 1, as this is Y(S)'s value) over S. For any positive real u, consider the probability that X is more than u.s away from m:

P(m−u.s > X or X > m+u.s): = E(Y({x: ((x−m)/s/u)² > 1}))
Now, with S = {x: ((x−m)/s/u)² > 1}, we know that ((x−m)/s/u)² > 1 for x in S and the same function is ≥ 0 everywhere, so it is ≥ Y(S) everywhere; hence the expected value of it is ≥ E(Y(S)), too.; ≤ E(: ((x−m)/s/u)² ←x :); = 1/u/u

from the definition of the variance. So the probability of being more than u standard deviations from the mean is at most 1/u/u, no matter how strangely the distribution is shaped. Of course, for u ≤ 1, this doesn't tell us anything of much use, as every probability is ≤ 1 ≤ 1/u/u in this case, but it gives us a definite constraint for u > 1. That constraint is typically weaker than you'd get from specific knowledge of the actual distribution; but it applies regardless of the distribution, so we can use it any time we know the mean and variance, even if we know nothing else about the distribution.

For example, in synthetic aperture radar, the image formed of the terrain being surveyed is corrupted with a quantum-mechanically-induced class of severely random noise known as coherent speckle; when it looks at a large area of terrain that's all the same in any sensible terms, the image formed represents that terrain by an area whose pixels are drawn from a negative exponential distribution; that is, the probability that any given pixel's value is > x for any given x is exp(−x/m) for some value m (that'll be the same for all pixels in the given area of same-ness). [Thus the probability of an answer within a factor u > 1 of the mean is P(m/u ≤ X ≤ m.u) = exp(−1/u) −exp(−u) and, in particular, at u = 2, the probability of being within a factor of two of correct is exp(−1/2) −exp(−2) ≈ .4712 < 1/2; the probability of being within a factor 2.105 of the mean is 1/2.] Differentiating, we obtain the probability density function (: exp(−x/m)/m ←x :); it takes but a little work (or a quick look at my page on the gamma distribution, of which the negative exponential distribution is the case α = 1) to show that its expected value of power(n), for any natural n, is mⁿ.n! = power(n,m).Γ(n+1), where Γ is the gamma function (so Γ(n+1) is just the factorial, n!). So mean is m and variance is 2.m.m −m.m = m.m, hence standard deviation is m. The probability that the variate is < 0 (i.e. more than one standard deviation below than the mean) is zero, as the variate only takes positive values. So, for u ≥ 1, Chebyshev's bound says that the probability of a value > (1+u).m is ≤ 1/u/u. That's not as strong as our given knowledge that it's actually exp(−(1+u)), but we'd be able to infer it only from measurement of the signal's mean and variance, even if we didn't know the actual distribution.

Written by Eddy.