]>
Fourier is a mathematical poem.
Lord Kelvin
A great many physical systems are governed by (up to) second-order linear partial differential equations. One of the neater features of linear partial differential equations is that if you add two solutions to the equation, or multiply a solution by some constant scaling, you get another solution; linear combinations of solutions are themselves solutions. This makes it practical, and often desirable, to identify some family of simple solutions for which every solution is simply a linear combination of members of the chosen family. Sometimes, decomposing a function into such a linear combination can make it easier to see how that function will interact with the differential equation; this can even be helpful when the simple functions in the family aren't solutions – e.g. if their interaction with the differential equation is suitably straightforward. Where rotations are prominent in a theory, the spherical harmonics may be used; where translations are more prominent, however, sinusoids prove more apt.
The Fourier transform decomposes any function from a linear space to scalars into a linear combination of sinusoids: the outputs of ({complex}: f |V), (| (: exp(i.k·x) ←x :) ←k :dual(V)) provides an easy form of these (with i as a square root of −1); when it is important to only work with real scalars, it may still make sense to use these (and not mind the complex intermediate values) but otherwise one can use the union of {(: sin(k·x) ←x |V): k in dual(V)} and {(: cos(k·x) ←x |V): k in dual(V)}. These functions have the nice property that the second derivative is a negative multiple of the function itself; f'' = −k×k.f. This tends to simplify their relationships with partial differential equations.
In deriving the Fourier transform, I'll work with functions ({complex}: |V) for some vector space V over complex. However, since we can apply that analysis to each component with respect to a basis, and the transform is linear, we can reassemble from the transforms of components a transform of any function (U: |V) with U and V linear over the complex numbers.
We can extend any vector space over the reals to a vector space over the
complex numbers by simply accepting multiplication by complex factors; this
amounts to replacing a real vector space V with V&tensor;{complex}, albeit while
retaining a knowledge of which of its members are real
(i.e. in the
original V), a concept that vector spaces over the complex numbers don't
necessarily entertain. This, in turn, can let us distinguish real
transforms from complex ones and thus obtain sine and cosine transforms of
functions between real vector spaces, implicitly by going via the complex
extensions of the spaces and then showing that we get back to the real spaces in
the end.
For a function ({complex}: f |V) from some real-linear space to the complex scalars, we have ({complex}: Fourier(f) |dual(V)) characterized by
from which we need to be able to infer Fourier(f) from f. Fourier determined that the correct answer was of form
(It'll become evident, below, that the scaling by 2.π could be split between Fourier and its inverse; the above is just one definition one can use for the Fourier transform, and I don't actually know what Fourier originally did about the scaling.) To check this answer, we need to evaluate, for any x in V (expecting f(x) as answer):
The inner integral, over dual(V), reduces to integral(: 1←k
|dual(V)) when x and y are equal; this is clearly infinite. When x and y
differ, the integrand is a sinusoid in k; strictly, formally, the integral thus
cannot converge. We shall, in due course, see that this
inner integral yields a very strange function
of x−y (known as
Dirac's delta function). In the face of this oddity, it proves constructive to
approach the problem from another direction.
In order to justify the definition of the Fourier transform and, in particular, the scaling it involves, let us first study its inverse; and restrict attention to a particularly well-behaved family of functions.
Since we're doing integration, we need a measure
on our real vector
space; this is the equivalent of a notion of volumen in three-dimensional space,
area in two dimensions and length in one. One way to obtain that is from a
metric (it defines a unit of length and notion of orthogonality that we can use
to define integration); this is, formally, a linear iso (dual(V):g|V) which
is symmetric
, meaning g(v, u) = g(u, v) for all v, u in V. It'll usually
also be positive definite
, i.e. g(v, v) >0 for every non-zero v in V.
(Strictly, we don't need to specify a metric; we only need to specify the square
root of its determinant, which maps volume elements in V to their (positive)
volumes; but in practice it is usual to specify a metric, and this simplifies
the discussion.) Our integration over V is tacitly defined in terms of the
metric, g, via √det(g), where det maps linear maps (on finite-dimensional
spaces) to their determinants. If we shift co-ordinates in V, we'll need a
factor of the determinant of a linear map that describes the co-ordinate change
(the Jacobean of the change of co-ordinates).
For any positive-definite metric G on our linear space V, we can define the gaussian ({positives}: exp(−G(x,x)/2) ←x |V). We can translate this through any vector in V and multiply it pointwise by an arbitrary sinusoid to get a ({complex}: |V) version not centred on zero (as the given one is). Thus we can define a mapping W for which
for each positive-definite metric G on V, each (mid-point) m in V and each (wave co-vector) h in dual(V). That may seem like a hairy mess to be defining, but let's look at the effect of applying our unscaled transform to one of these with G positive-definite, taking dim to be the dimension of V;
this is insensitive to us re-naming k+h to k and x−m to x, so we can re-write it as
integration over all of V is insensitive to translation of the origin, so (pulling the obvious exp(i.k·m) factor out as a constant in the integration) this is just
Now, G is positive definite and symmetric; there's a theorem of linear algebra which then lets us select a basis (dual(V):q|dim) which simultaneously diagonalises both g and G; i.e. for which g = sum(: N(e).q(e)×q(e) ←e |dim) and G = sum(: q(e)×q(e) ←e |dim) for some ({reals}: N |dim). Dual to this basis will be a basis (V:b|dim) of V for which q(e)·b(j) = 0 unless e is j, in which case q(e)·b(e) = 1. This will express k·x as sum(: k·b(e).q(e)·x ←e |dim) and G(x,x) as sum(: q(e)·x.q(e)·x ←e |dim). Thus we obtain
but integrating x over V allows q(e)·x to take every real value, independently for each e in dim, enabling us to separate integration out among these co-ordinates; and q diagonalises g so g's integration over V is equivalent to integrating over these co-ordinates for x and multiplying by the square root of g's determinant in these co-ordinates — i.e. by √product(N) = √det(g/G) – giving us
using n = k·b(e), we have t.t −2.i.t.n = (t−i.n).(t−i.n) +n.n so can shift each t-origin by i.k·b(e), and then exploit some helpful properties of analytic functions on the complex plane (the shift moved integration from {reals} to {i.n + reals}, but the function integrated has no poles at finite input, and dies away good and fast towards infinity, which lets us shift the integration back), to turn this into
and the t-integral, which is the only remaining integration, is inside the product over e in dim, but independent of e; so we just get power(dim) of it; while the remaining product over e is multiplying together outputs of exp, so could equally sum the inputs to exp over e and feed the result to exp:
meanwhile, the inverse of G is just sum(: b(e)×b(e) ←e |dim); call this H for now and observe that sum(: k·b(e).k·b(e) ←e |dim) is just H(k,k) = k/G\k, so we have
and we can evaluate the t-integral (by exploiting the two-dimensional case); use u = t/√2 to give integral(: exp(−t.t/2) ←t |{reals}) = integral(: exp(−u.u) ←u :).√2 = √(2.π), to get
(phew). Now that we know how W transforms, we can define a re-scaled version of it which splits the dependence on G-or-H, h and m between the side being transformed and the side that comes out as the answer;
(wherein root(4) is just the reverse of power(4), i.e. root(4, t) is just √√t). Note that g's inverse is the metric on dual(V) we would use to define integral(: |dual(V)), so that where w(G) uses G/g, w(H) must use H·g; and det(H·g).det(G/g) = 1, so we obtain
and we can sink the power of 2.π in the integration by scaling x; so define an operator
to obtain
with H and G mutually inverse; we then obtain
so that F&on;F&on;F&on;F acts as the identity on pointwise products of gaussians and sinusoids. Functions which die away to zero reasonably fast outside some finite region of some linear space can necessarily be expressed as integrals of w over the available range of values for metric, centre-point and wave co-vector (i.e. G, m and h); F is linear, so transforms each w-function separately and maps such a superposition to the matching superposition of transformed w-functions; thus F&on;F&on;F&on;F also acts as the identity on such functions. It should also be noted that
but G(x+m,x+m) = G(−x−m,−x−m), so this is
so that F&on;F is actually just (: (: f(−x) ←x :) ←f :), at least on the span of sinusoidal gaussians (which is everything we're justified in supposing we can play with). We can then introduce negate = (: −x ←x :) to write this as F(F(f)) = f&on;negate. This, in turn, means that
Now Fourier's inverse is defined to be the unscaled form of F; and F's inverse is just F&on;F&on;F; so, on any given linear space of dimension dim, Fourier's inverse is F.power(dim, √(2.π)) and Fourier = F&on;F&on;F/power(dim, √(2.π)), which (again sinking the power of 2.π in the integration variable, where it joins up with the one already there) is what I gave earlier.
Having used the above to justify the definitions, I'll mainly work with F, only sometimes with Fourier and its inverse, hereafter; g and W shall drop out of the discussion, while w will only be relevant for the following digression.
Note that w(G,m,h) is centred on
m but spreads out around it to an
extent characterized by G's unit sphere, {x in V: G(x,x) = 1}, which gets
proportionately smaller as G gets bigger. Likewise, w(H,−h,m) is centred
on −h, but spreads out to an extent characterized by H's unit sphere, {k
in dual(V): H(k,k) = 1}, which gets proportionately bigger as G gets bigger
(because that makes H get smaller). Thus the less blurred a function is, the
more blurred its transform is bound to be.
Specifically, for each basis member b(e), with e in dim, w(G,m,h)'s spread in the b(e) direction is proportional to √N(e) while w(H,−h,m)'s spread in the corresponding q(e) direction is proportional to 1/√N(e). If we combine two w-functions, with spreads a and A in some direction, their combined spread will be (the separation of their centres plus) at least the maximum of a and A; their transforms will have spreads 1/a and 1/A yielding a combined spread of (the separation of the two wave co-vectors plus) the maximum of these; when we multiply these two combined spreads we thus get at least (and generally more than) 1. As for a simple sum, so for an arbitrary linear combination of w's, hence for any function to which we have any business applying the Fourier transform.
This imposes a lower bound on the product of the uncertainties in any two functions which are one another's Fourier transforms, classically the position and wave co-vector of any function of which those are characteristic properties. In quantum mechanics, the momentum of a particle is just Planck's constant times the particle's wave covector (passed through the metric's inverse to turn it into a vector), so the uncertainties in the particle's position and momentum must, when multiplied, yield Planck's constant times the lower bound. Thus Heisenberg's famous uncertainty is a natural property of the Fourier transform.
The w-functions also serve as archetype for the notion of a
wave packet
; they're localized, like a particle, but also have
wave-nature. It's thus natural to interpret a photon (for example) as a
w-function at any given moment of time. Suppose we have a wave-packet that's
w(G,m,k) at time t=0. We know w(G,m,k) is, with H as G's inverse,
F(w(H,k,−m)). This expresses our wave-packet as a superposition of
spatial sinusoids at that moment; and sinusoids are solutions to the
electromagnetic field equations. A sinusoid with spatial wave co-vector q is
the state, at t=0, of a solution which is a sinusoid whose space-time co-vector
has q for its spatial co-ordinates, along with a time-like co-ordinate equal to
the magnitude of q. Taking g as the inverse of our 3-space metric, we have
√g(q,q) as the time-like component, so the phase exp(i.q·x) changes
to exp(i.(q·x +c.t.√g(q,q))) at time t. Thus our wave packet
varies with time as integral(: (: exp(i.q·x) ←x
:).w(H,k,−m,q).exp(i.c.t.√g(q,q)) ←q :), which ceases being a
simple wave-packet once t isn't 0. Thus, although very nice as an archetype for
a wave-packet, w's sinusoidal Gaussians aren't quite satisfactory as an actual
description of photons …
Now consider a transformation which maps a given (: f |V) to
This is trivially just (: F(f, Q.k) ←k |dual(V)). Applying it to its own output, we get
which is inescapably just
which is just F(F(f)).power(dim, Q) = power(dim, Q).f&on;negate. Now F is just Fourier's inverse with a scaling of power(dim, √(2.π)), so F&on;F is just Fourier's inverse squared with a scaling of power(dim, 2.π). If we applied a Q-scaling to Fourier's inverse's exp(…)'s input, the result would compose with itself to yield power(dim, Q/(2.π)).f&on;negate ←f, so choosing Q = 2.π gives us an operator
for which E&on;E = (: f&on;negate ←f :) = F&on;F. Thus E, like F,
is a fourth root of the identity; and the only place it involves a scaling is in
the input to exp. Now, ({phases}: exp(2.π.i.t) ←t |{reals}) is a
periodic function with period 1, and our transformation is using this as its
canonical sinusoid – rather than cos(t) + i.sin(t) = exp(i.t) ←t
– with which to decompose any function as a superposition of sinusoids.
In effect, this amounts to choosing the whole period
(or turn) as unit of
angle, in preference to the radian.
Likewise, the fact that the Gaussian integral(: exp(−x.x) ←x |{reals}) turned out to be √π implies that integral(: exp(−π.x.x) ←x |{reals}) is 1. This is another case where exp wants its input scaled by a factor including a π, rather hinting that exp wants its input to give rational period.
function
So I hope I've satisfied you that F, which is just Fourier's inverse
judiciously scaled, is simply a fourth root
of the identity on (a broad
class of) functions between linear spaces; hence that Fourier is sensibly
defined (at least on the relevant broad class of functions). There are some
serious analytic questions which deserve to be addressed (as should be evident
from my first, naïve, attack at justifying my definition of Fourier) if
Fourier is to be applied to arbitrary functions between linear spaces; the
integrals are all too prone to turning out divergent when examined in detail.
Rather than go into those, I'm going to flagrantly ignore the problems, derive
some orthodox results (which should perhaps be thought of as slogans
rather than theorems) and then produce a justification for them. Because it
happens to avoid lots of scrappy little powers of 2.π, I'm going to use E
rather than F or Fourier.
I showed earlier that, at least on those inputs where E is well-defined,
E&on;E = (: (: f(−x) ←x :) ←f :) = (: f&on;negate ←f :).
Thus E is a square root of compose after negate
, which (in its turn) is
manifestly a square root of the identity (since negate is self-inverse). So
let's look at E&on;E again: given (: f |V) for some linear space V of dimension
dim,
so define
polymorphically for each linear space V; we then have
whence, selecting a particular input −x,
so that integrating the pointwise product of δ with some other function selects that other function's value where δ's input is zero. That's quite a clever trick for a function to do. In particular, it says that δ is 0 everywhere but at zero, where I already noted that it is infinite (since it's integral(: 1←k |dual(V)), and dual(V) is a linear space over the reals, so unbounded).
Of course, if we apply either E or Fourier's inverse to δ we trivially obtain (: 1 ←k |dual(V)), so (:δ|V) = Fourier(: 1←k |dual(V)). This also gives us Fourier(:δ|V) = (: 1/power(dim, 2.π) ←k |dual(V)) and F(:δ|V) = (: 1/power(dim, √(2.π)) ←k |dual(V)). This is constant, so applying F to it involves integrating a simple sinusoid over a vector space, which is divergent; but that's OK, we already know E&on;E = F&on;F is just (: f&on;negate ←f :) so F(F(δ)) = δ&on;negate = E(E(δ)). Likewise, E(E(E(δ))) = E(δ)&on;negate, and if you can't distinguish that from E(δ) it's because you'd have had trouble distinguishing δ&on;negate from δ.
Now, reasonable sane folk might plausibly be upset at these games, so perhaps it's time I told you about distributions. A distribution on some domain V is formally a linear map ({scalars}: h :{map ({scalars}: :V)}). It naturally induces a natural extension to (W: :{map (W: :V)}), for arbitrary linear space W, by applying h to components in W; and h is equally a member of dual({map ({scalars}: :V)}). Now, when the V in question is a finite collection, we can define a particular distribution on it called sum which just adds together all the outputs of the mapping from V that it's given. In such a case, any other distribution h on V can be written as (: sum(: H(v).f(v) ←v :V) ← (:f|V) :) for some ({scalars}: H |V) defined by H(v) = h({1}::{v}), h's output when its input maps v to 1 (and ignores all other members of V, effectively mapping them to 0).
In light of this last, I define a binary operator @ for which k@h is defined whenever h is a distribution on some V and k is a mapping ({scalars}:k:V); in that case, k@h = (: h(: k(v).f(v) ←v :V) ←(:f:V) :). In these terms, the above expressed h as H@sum. It is worth noting that, when k@h is defined, h is a distribution on V, hence a member of dual({map ({scalars}::V)}), and k is a member of {map ({scalars}::V)}; so it's worth bearing in mind that I may eventually work out how to generalize @ to combine k@h whenever k and h are members of mutually dual linear spaces.
So, now that I've told you what distributions are, you might make sense of
the fact that: integral is a distribution; most distributions we deal with on
continua (e.g. linear spaces over {reals}) are of form f@integral for some
scalar function f; δ isn't really a scalar function, but
δ@integral
is a distribution. Thus, when orthodoxy says
what it really means is to talk about a distribution Δ, to be thought of as δ@integral, for which
which is a perfectly well-defined distribution.
It is manifest from the structure of its definition that F is linear. This implies that F(: f(x) +h×e(x) ←x :) = F(f) +h×F(e) where meaningful. In this section, I'll look into how F interacts with some other transformations one can apply to a function. It'll also turn out to be worth looking into F's action on powers and polynomials, for all that they don't strictly meet the criterion for eligibility as inputs to F.
Given a function (W:f|V), for linear spaces V and W, it should be immediately apparent from the form of F that scaling by a sinusoid or shifting origin will each transform to the other:
We can also ask for f's derivative, ({linear (W:|V)}: f' |V), whose transform will give, for any k in dual(V),
to which we can apply integration by parts
; the integrand is one of
the two terms in the x-derivative of exp(i.k·x).f(x), so it's
but, for f to be amenable to F, it must die off fast enough, as x grows, to make the former zero, leaving the latter, which is just
so that differentiation just scales a function's transform by −i times its input (necessarily tensored on the right so as to make a linear (W:|V) as the output),
In particular, with N = power(dim, √(2.π)), N.F(δ) = (: 1←k :) so N.F(δ') = (: i.k ←k :), N.F(δ'') = (: −k×k ←k :), N.F(δ''') = (: −i.k×k×k ←k :), N.F(δ'''') = (: k×k×k×k ←k :) and so on; when V = {reals}, dim = 1 this tells us that F maps any linear combination of derivatives of δ to a corresponding polynomial. Likewise,
In particular, F(: 1←x :) is just N.δ so F(: x←x :) is −i.N.δ', F(: x×x ←x :) is −N.δ'', F(: x×x×x ←x :) is i.N.δ''', F(: x×x×x×x ←x :) is N.δ'''' and so on; with V = {reals}, dim = 1 we thus find that F maps any polynomial to a suitable linear combination of the derivatives of δ. In particular, if we have a function that we believe to be a polynomial, computing its Fourier transform will reveal its coefficients as the multipliers of the various derivatives of δ.
Earlier I noted that it's possible to handle real functions using just sin and cos, rather than the complex phase sinusoid (: exp(i.t) ←t :), so I guess I'd better show how that's done, too. I'll demonstrate this for E, but similar can be done for F or Fourier.
Suppose we have a ({reals}: f |V); then we have
so introduce
so that we have E(f) = C(f) +i.S(f). (Note, in passing, that the functions (: cos(2.π.t) ←t :) and (: sin(2.π.t) ←t :) are just (: Cos(t.turn) ←t :) and (: Sin(t.turn) ←t :), in terms of the trigonometric functions of angle, rather than their dimensionless cousins implicitly using radian as unit of angle.) Now, E(E(f)&on;negate) is f, so our inverse transformation will be
and we know f is real, while C, S and negate preserve real-ness; so the two imaginary terms must necessarily cancel one another, leaving us with
Now, composing S(f) after negate just replaces k with −k in sin(2.π.k·x) in the integral; and sin(−2.π.k·x) = −sin(2.π.k·x) so S(f)&on;negate is simply −S(f). Likewise, composing C(f) after negate gives us cos(−2.π.k·x) in place of cos(2.π.k·x); but (: cos(−t) ←t :) = cos, so this is no change and we have C(f)&on;negate = C(f). Thus the inverse transform is simply:
The equivalent for F, in place of E, would define C and S with k·x, without any factor of 2.π, as input to sin and cos; while multiplying the input to the integrand by √(2.π) or (equivalently) dividing the integral by a factor of power(dim, √(2.π)); this would give C&on;C +S&on;S as the identity, again.
In discarding the two complex terms, I reasoned that they must cancel; just for the record, aside from their factor of i, they're just
in which the sine of a difference appears, yielding
and integral(: Sin(k·y.turn) ←k |dual(V)) is zero when y is zero; otherwise, it's the integral of a sinusoid, so strictly divergent but, by analogy with δ being zero except at zero, it's persuasively zero (this can alternatively be justified by cutting dual(V) in half, with {k: k·y = 0} as the cut, and expressing one half as {−k: k in H} with H as the other half; then our integral over dual(V) becomes an integral over H of two terms, one of which is Sin(k·y.turn), the other is the same for −k, which cancels it exactly).
So I'd be on slightly dodgy ground claiming the two complex terms cancel, but for the fact that we know in advance that our answer is real. Note, though, that their cancellation tells us that S and C commute: C&on;S = S&on;C, at least when acting on ({real}: |{real}) functions. Since they are also linear, that should suffice to ensure they still commute on complex functions.
There are various other cousins of the Fourier transform that preserve real-ness, notably including the Hartley transform.
Now let's consider two functions from one space, W, (U:u|W) and (V:v|W) to two spaces U, V between which we have a multiplication, and consider how F transforms u×v. For any k in dual(W),
in which we can substitute E(E(v)&on;negate, x) for v(x),
in which (admittedly modulo an application of Fubini's theorem, the details of which I'm glossing over) reversing the order of integration gives us
This combines E(u) and E(v) in a way that's known
as convolution
, defined, for any vector space X and functions (:g|X) and
(:f|X), by
with f*g being the convolution of f and g. Thus E(u×v) =
E(u)*E(v). Naturally this also implies E(u*v) = E(u).E(v). These results are
known as the convolution theorem
and technically only apply (see the note
on Fubini's theorem) when u and v have finite integrals of magnitudes. Which,
of course, I'm proptly going to ignore.
Now, earlier, we saw that E(: x×x ←x :) was (some normalisation constant times) −δ'' and E(: x ←x :) was −i.δ' (similarly scaled), from which we can (modulo such scaling) infer that δ'' = δ'*δ'. Indeed, for general f, E(: f(x)×x ←x :) is −i.F(f)' implies F(f)' = F(f)*δ' and, this being true for all f, even f' = f*δ'; so convolution with the derivative of delta implements taking the derivative of any function.
The operation of convolution may seem somewhat strange, but we have in fact met it elsewhere: in the multiplication of polynomials, if (: c :n) and (: e :m) are the coefficients of two polynomials, for some naturals n, m, then the polynomials C = sum(c.power), E = sum(e.power) will be P = sum(p.power) with p = (: sum(: c(i).e(j) ←[i, j]; i+j = k :) ←k :n+m), which we can rewrite as (: sum(: c(k −j).e(j) ←j :) ←k :) which is p = c*e. Thus sum(power.c).sum(power.e) = sum(power.(c*e)) gives polynomial = (: sum(power.p) ←p |{lists of scalars}) the same relationship to convolution that E, F and Fourier have.
As noted above: if we have a function that we believe is a polynomial, taking its Fourier transform will get us its coefficients of powers as the transform's coefficients of derivatives of δ; and adding an offset to the input to a function has the effect, on the transform, of scaling by a sinusoid whose wave covector is that offset. This gives us two ways to write the transform of (: x −k ←x |V), as (: exp(i.k.m).δ'(m) ←m :) by applying the offset rule to F(: x ←x :) = δ', or as δ' −k.δ by reading it as a polynomial, implying (give or take some normalisation factors) k.δ = (: 1 −exp(i.k·m) ←m :dual(V)).δ' for every k in V.
3Blue1Brown has some
good material on the Fourier transform and the related operation on functions
called convolution
,
including this one
on Borwein integrals
– which, in their own right, are a lovely
example of how a sequence that starts out following a seemingly predictable
pattern can merrily depart from
it after a while; a
warning to beware of the easy conclusion that worked for the first several,
it probably keeps going
.