# Differentiation

Geometrically, if you draw a smooth curve (e.g. the graph of some smoothly-varying function) on a piece of paper, you can draw a straight line tangent to the curve at any point on the curve: this is a line, through the given point, going in the same direction as the curve.

If we consider a general straight line through a given point on a curve, and imagine a pair of mobile points leaving the given point at a given moment, no matter what their speeds of movement – one along the curve, the other along the straight line – the distance between them increases approximately linearly with time. However, using the tangent as the straight line, if the two mobile points move at equal speeds, the distance between them grows slower than any linear function of time to begin with – though it may increase later. On a smooth curve, there is exactly one such exceptional straight line through any given point of the curve: it is called the tangent (derived from a Latin word for touching) to the curve at that point because it touches the line rather than simply crossing it (though it may cross in the process of touching, if the curve is at a point of inflection).

I'll begin by transcribing orthodoxy into my own terms.

## Tangents as linear maps

When the curve is the graph of some function, (V:f:U), with both U (f's inputs) and V (its outputs) one-dimensional – making the curve of y = f(x) denote {[x,f(x)]} in the two-dimensional plane {[x,y]: x in U, y in V} used to model the page on which the curve is drawn – the tangent is a line of form {[x,a.x+b]: x in U} for some b in V and linear (U:a:V). For one-dimensional U and V, {linear (U::V)} is also one-dimensional: so a and b feel like scalars, hence the use of a.x, as if a and x were scalars, for a(x). Note that choice of scale of description (e.g. units of measurement) in U and V will affect the values used to represent a and b: however, the things represented are not affected by the means used to describe them.

For a mapping (V:f:U) between general linear spaces, for any u in (:f|), analogy with the one-dimensional familiar case, above, will call for us to ask for a linear (V:g:U) and a v in V for which (: g(w)+v ←w :U) and f differ less than linearly in (small enough) displacement from u.

We can express this as asking for the linear (V:g:U) which makes (V: f−g :U) much nearer constant, near u, than any other linear g. It then suffices to identify the relevant notion of near-constancy, for which output displacements shrink quadratically faster than input displacements.

In order to characterize near-constancy, I'll now introduce a tool: shrunken images of bounded star-shaped domains. These come with a natural parameterisation, induced from variation in scale of shrinkage, which reflects the linear structure within which they're defined; this parameterisation will let us compare the rates of shrinkage of variations in f−g's inputs, near u, and outputs.

### Shrinkage and star-shaped domains

First introduce:

• shrink = (: (: (: w.t +u.(1−t) ←w :) ←u :) ←t :{positive scalars})

with implicit boundary shrink(0) = constant, ignoring its second input and always yielding its first; cf. shrink(1) = constant(: w←w :) ignores its first and always yields its second. Note that for t>1 the name shrink is arguably inappropriate, but it's for t<1 that I mostly want it. I'll introduce unit = {positive x: x is at most 1} so that (:shrink:unit) deserves the name. Note that this can be re-written without recourse to subtraction as:

• (:shrink:unit) = (: x, t are positive scalars, x+t=1; (: (: w.t+u.x ←w :) ←u :) ←t :)&unite;(: (: (: w←w :) ←u :) ←1 :)

We can then ask for (| h&on;shrink(t,u) :S) to fall within (| shrink(t.t,h(u)) :T), for all sufficiently small positive t, as a way of specifying that h couldn't be made more nearly constant by adding any linear map to it. It then suffices to discover what properties S and T must have to suitably enclose u and f(u), give-or-take a tweak to allow for pre-scaling of S and T.

If a linear space subsumes an S for which (S:t(u)|S) for all t in (|shrink:unit), for some u, then S is described as star-shaped about u in U. For given w in S, (: t(u,w) ←t :shrink&on;unit) is the straight line from u to w, so the definition can be re-expressed as: S is star-shaped about u precisely if the straight lines from points of S to u fall entirely within S. [If S is star-shaped about every u in S, it is described as convex; every sphere and (linear) simplex is convex.]

Theorem: if A and B are star-shaped domains in some linear space U, then so is C = {x+y: x in A, y in B}. Proof: suppose A to be star-shaped about a, B about b; for any x in A, y in B and t in (|shrink:unit) we have t(a,x) in A and t(b,y) in B, so their sum is in C; any member of C is x+y for some such x, y and t(a+b,x+y) = t(a,x) + t(b,y) is then in C; ergo C is star-shaped about a+b. Q.E.D.

Theorem: if S is star-shaped about u in U and (V:w|U) is linear then (|w:S) is star-shaped about w(u) in V. Proof: for any p in U and all positive x, w(shrink(x,u,p)) = w(u +x.(p−u)) = w(u) +x.w(p−u) = shrink(x,w(u),w(p)) as w is linear. Any point in (|w:S) is w(p) for some p in S, which is star-shaped about u, so every t in (|shrink:unit) has t(u,p) in S, hence w(t(u,p)) = t(w(u),w(p)) is in (|w:S), so shrinking any member of (|w:S) towards w(u) by any t in (|shrink:unit) yields a member of (|w:S). Q.E.D.

Star-shaped S about u in U

is bounded
iff intersect({(| shrink(t,u) :S): t in unit}) = {u}

Since (|shrink(t,u):S) subsumes (|shrink(x,u):S) whenever t>x, it suffices to consider only t in unit; anything in the intersection given is equally in (|shrink(x,u):S) for any x>1. Any member, w, of the intersection is necessarily in S, which is star-shaped so subsumes {shrink(t,u,w): t in unit}. For x>1, we have w in (| shrink(1/x,u) :S) whence shrink(x,u,w) in S, so S subsumes {shrink(x,u,w): x is positive}. If w isn't u, this is an infinite ray from u through w.

encloses u

iff, for all v in U, there's some t in (|shrink:) for which t(u,v) is in S (whence the same holds for all smaller t). [Strictly, this is U-enclosure; if U is a subspace of W, W-enclosure implies U-enclosure but not conversely.]

We can introduce a relation, ≥, on star-shaped domains about u by: S≥T iff, for some t in (|shrink:unit), S subsumes (|t(u):T); in which case the same is true for all smaller t. That ≥ is transitive and reflexive is trivial; it has a minimal element, {u}, and a maximal element, U. Among bounded star-shaped domains, those which enclose u are (intuitively) maximal for ≥. The intersection of ≥ with its reverse, S≥T, T≥S; S←T, is manifestly an equivalence, of which the bounded star-shaped domains enclosing u form an equivalence class.

### Comparative constancy

For any u in U and v in V, shrink induces a natural isomorphism between the shrinkage mappings towards u in U and those towards v in V, formally transpose(shrink,v)&on;reverse(transpose(shrink,u)). Restricting this to scalings up to 1 gives

• same = (: (: (: s in (|shrink:unit); s(v)←s(u) :) ←u :) ←v :)

so that same(u,v) relates each small enough shrinkage towards u to the corresponding shrinkage towards v. Note that outputs of transpose(shrink,u) are (: shrink(t,u) ←t :{positives}), i.e. shrinkage mappings about u. Note that repeat(n)&on;same(f(u),u) has left values which shrink inwards on f(u) faster than the corresponding right values shrink in on u.

I'll describe h as constant up to order n near u, for some natural n, precisely if, for some bounded star-shaped domains S enclosing u and T enclosing h(u), some fixed left value t and right value s of same(f(u),u):

• repeat(n)&on;same(f(u),u) relates a to b implies (| a&on;t :T) subsumes (| f&on;b&on;s :S)

A linear map g can then be a gradient for f at u iff f−g is constant up to order 2 near u. Note that repeat(n)&on;same(f(u),u) relates a to b means that a is shrink(power(n,x),f(u)) and b is shrink(x,u) for some x in unit; so the requirement can be restated as: for every x in (|shrink:), (| repeat(n,x(f(u)))&on;t :T) subsumes (| f&on;x(u)&on;s :S). If we replaced S with (|s:S) and T with (|t:T), we could lose the &on;s and &on;t, effectively making s and t both identities.

Now, for T star-shaped about h(u), i in n and x in (| transpose((:shrink:unit),h(u)) implies (| repeat(i,x) :T) subsumes (| repeat(n,x) :T), so any h constant up to order n is equally constant up to order i for each i in n. A mapping is constant up to order 0, near u, iff it is continuous at u; linear maps are constant up to order 1; constant mappings are constant up to all orders, as are such deviant functions as (: exp(1/z/z) ←z :{reals}). I'll show, later, that the mappings constant up to order n at u are closed under addition and scaling, thus forming a linear space.

If we replaced S and T, the replacements would consider h constant up to order n near u whenever: S and T so considered it, S≥ its replacement and T's replacement ≥T. This suffices to ensure that which mappings are constant up to order n near u doesn't depend on choice of S and T, provided both are bounded with S enclosing u and T enclosing h(u).

If h is constant up to order 1+n at u, h is constant at order n; otherwise, if h is constant up to order n, n is h's order of variation at u. Thus another way to say that g is f's derivative at u is to describe f−g as constant at order 1 near u.

#### Details

Theorem: the only linear map constant at order 1 is zero. Proof:

The zero mapping is linear and constant at all orders, including 1.

Now consider any linear map, (V:h|U), constant at order 1, i.e. up to order 2 (at least), near any u in U: for any star-shaped S enclosing u and T enclosing h(u), as h is constant up to order 2, there's some t in (|transpose(shrink,h(u)):) and s in (|transpose(shrink,u):) for which, for any x in unit, (| repeat(2,shrink(x,h(u)))&on;t(h(u)) :T) subsumes (| h&on;shrink(x,u)&on;s(u) :S).

Now, repeat(2,func) = func&on;func and shrink(x,h(u)) = (: h(u) +x.(w−h(u)) ←w :V), so repeating it gives (: h(u) +x.x.(w−h(u)) ←w :V), which is shrink(x.x,h(u)). Also, h is linear so h&on;shrink(x,u) is (: shrink(x,h(u),h(w)) ←w :) for any positive x.

For given w in (|s:S), constancy at order 2 says each x in (|shrink:unit) says that shrink has x(h(u),h(w)) is in (| x(h(u))&on;x(h(u)) :(|t:T)) which is (| x(h(u)) :x(h(u))&on;t&on;T) and x(h(u)) is monic, so we can infer h(w) in (| x(h(u)) :(|t:T)) for every x in (|shrink:unit). That includes the identity, shrink(1), so h(w) itself is in (|t:T); and any other x in (|shrink:) has (:x|(|t:T)) subsume (|t:T). Thus h(w) is in intersect(: (| x(h(u)) :(|t:T)) ←x :shrink), which is {h(u)} since T, hence (|t:T), is bounded. yields a mapping ((|t:T):v:unit), for which x in unit implies shrink(x.x,h(u),v(x)) = h(shrink(x,u,w)), and h is linear, so this last is shrink(x,h(u),h(w)); and the first can be re-arranged to shrink(x,h(u),shrink(x,h(u),v(x))). Now, shrink(x,h(u)) is monic and we see it produces the same output from h(w) as from shrink(x,h(u),v(x)), so these must be equal.

As h(w) doesn't depend on x, with v(x) varying in (|t:T), h(w) is in (| shrink(x,h(u)) :T) for every x in unit, hence in the intersection of these, which is {h(u)} as T, hence (:t|T), is bounded, so h(w) = h(u) and h is constant on (|s:S). As h is linear, its value for any displacement within (|s:S), being the difference between h's values at the ends, are thus zero.

S encloses w so there are displacements in all directions within (|s:S); as h maps these to zero, it maps all displacements to zero, so it is the zero linear map.

Thus: if a linear map is constant up to order 2 near any input then it is zero. Q.E.D.

Theorem: if f is constant up to order n near u and g is constant up to order m near f(u), then g&on;f is constant up to order n.m near u. Proof:

Given S star-shaped enclosing u and T star-shaped enclosing g(f(u)); for any R star-shaped enclosing f(u), find a positive scalar G for which for all positive x up to 1, (| shrink(power(m,x),g(f(u))) :T) subsumes (| g&on;shrink(G.x,f(u)) :R); replace R with (| shrink(G,f(u)) :R). Now find t for which (| shrink(power(n,x),f(u)) :R) subsumes (| f&on;shrink(t.x,u) :S) for all positive x up to 1.

Now, for all positive x up to 1, y = power(n,x) is positive and at most 1 so we know that (| shrink(power(m,y),g(f(u))) :T) subsumes (| g&on;shrink(y,f(u)) :R) which is (| g&on;shrink(power(n,x),f(u)) :R) so subsumes (| g&on;f&on;shrink(t.x,u) :S). In this case, power(m,y) is power(m,power(n,x)) = power(m.n,x) so we find that (| shrink(power(m.n,x),g(f(u))) :T) subsumes (| g&on;f&on;shrink(t.x,u) :S) for each positive x up to 1; i.e. f&on;g is constant up to order m.n near u. Q.E.D.

In particular, as every linear map is constant up to order 1, everywhere, composing a linear map left of one map constant at order n near u yields another; composing a linear map, w, right of a mapping, f, constant at order n near u will yield f&on;w constant up to order n near any member of ({u}:w|).

### Linear structure induced on the mappings to a linear space

For any linear space V and any collection U, there's an addition defined on {mappings (V::U)} by: for any non-empty mapping (V:h:n) with n finite, our addition on V gives us sum(h); for any list ({mappings (V::U)}: h |n), with n now implicitly natural, for which the addition to be defined must define sum(h), and any u in U, we can form the mapping (V: h(i,u) ←i :n), a.k.a. transpose(h,u), and addition in V supplies us with its sum. So we can define:

• sum(h) = (V: sum(transpose(h,u)) ←u :({non-empty}:transpose(h)|))

where the minor complication of the right-constraint renders the sum undefined except at those u for which some i in n has u in (:h(i)|); the sum of a list of mappings is defined on the union of the domains on which the list's entries are defined. For the simple case ({mappings (V::U)}: h |2) = [f,g] sum([f,g]) = f+g is: (: f(u)+g(u) ←u :U) on the intersection of (:f|) with (:g|); f on the rest of (:f|); and g on the rest of (:g|).

Note that this addition on {mappings (V::U)} doesn't depend on any structure in U itself; and yields (f+g)(u) = f(u)+g(u) always. Each scaling (V:r|V) on V induces one (: r&on;g ←(V:g:U) :{mappings}) on {mappings (V::U)}; taken with the above additive structure, {mappings (V::U)} forms a linear space.

Theorem: for any linear spaces U, V, any u in U and any natural n, {mapping (V:f:U): f is constant up to order n near u} forms a linear sub-space of {mappings (V::U)} with the linear structure just given. Proof:

For any f constant up to order n near u, I must show that adding a second, or scaling its outputs by any scaling in V, yields a result which is constant up to order n near u.

Given f and any scaling (V:r|V) of V,

If r is zero r&on;f is constant, so constant up to every natural order near u.

Otherwise, for any star-shaped S enclosing u and any star-shaped T enclosing f(u), there is some positive t for which, for all positive x up to 1, (| shrink(power(x,n),f(u)) :T) subsumes (| f&on;shrink(x.t,u) :S). As T is star-shaped enclosing f(u), so is (|r:T) enclosing r.f(u).

Now, (: r&on;shrink(power(x,n),f(u)) :T) = shrink(power(x,n),r(f(u)))&on;(:r:T) and its (|:T), which subsumes (| r&on;f&on;shrink(x.t,u) :S), depends on the …&on;(:r:T) only in so far as this implicitly restricts the shrinker it's composed right of; so we find that (| shrink(power(x,n),(r&on;f)(u)) :(:r:T)) subsumes (| r&on;f&on;shrink(x.t,u) :S) so we can use (|r:T) as star-shaped domain and re-use f's scaling t to confirm that r&on;f, a.k.a. r.f, is also constant up to order n near u. Q.E.D.

Given f and any g constant up to order n near u,

For any star-shaped S enclosing u, A enclosing f(u) and B enclosing g(u), we're given positive scalars a and b for which (| shrink(power(n,x),f(u)) :A) subsumes (| f&on;shrink(x.a,u) :S) and (| shrink(power(n,x),g(u)) :B) subsumes (| g&on;shrink(x.b,u) :S). Let c be the smaller of a and b; then both (| shrink(x.a,u) :S) and (| shrink(x.b,u) :S) subsume (| shrink(x.c,u) :S).

Let C={i+j: i in A, j in B}; then C is star-shaped about (f+g)(u) and (| shrink(power(n,x),(f+g)(u)) :C) subsumes {i+j: i in (| f&on;shrink(x.a,u) :S), j in (| g&on;shrink(x.b,u) :S)}. Now, any member of (| (f+g)&on;shrink(x.c,u) :S) is a sum of a member of (| f&on;shrink(x.a,u) :S) and a member of (| g&on;shrink(x.b,u) :S), making it a member of (| shrink(power(n,x),(f+g)(u)) :C), whence this last subsumes (| (f+g)&on;shrink(x.c,u) :S) and f+g is constant up to order n. Q.E.D.

## Derivatives

I've now introduced the tools I need to describe differentiation. So without further ado, let me define:

given a mapping (V:f:U) between linear spaces, a u in (:f|) and a linear map (V:g:U),

g is a derivative for f at u precisely if:
f−g is constant up to order 2 near u.

For every mapping f, the relation g is a derivative for f at u; g←u is orthodoxly written f' (that is, f with an apostrophe) and known as the derivative of f. Thus, in the case discussed in the definition, g = f'(u). The constraints of text introduce some difficulties when (not only am I typing my pages in a character set which obliges me to conflate right single quote with it but, …) apostrophe is standardly used in text for (among other purposes) possessives – so that f'(u) is f's derivative at u, but even when I don't try to enclose this in single quotes it uses an apostrophe after f in two different ways. Not that this'll stop me using orthodoxy's notation, but I'll introduce a mapping which will enable me to talk about derivatives without over-worked the multi-talented ASCII character 39:

derivative = (:
V, U are linear spaces;
({linear mappings (V::U)}: g is a derivative for f at u; g←u :U)
←(V:f:U) :{mappings})

so that I can write f' as derivative(f) when brevity is less of an issue than ambiguity; derivative is the explicit relation on which the syntactic sugar is founded.

Theorem: {mappings} subsumes (|derivative:{mappings}), i.e. if a mapping has a derivative at some input, the derivative is unique; or, for any mapping (V:f:U) between linear spaces, ({linear (V::U)}:f':U) is a mapping. Proof: suppose f' relates both g and h to u; i.e. g and h are both derivatives of (V:f:U) at u. Thus f−g and f−h are both constant up to order 2; whence so is g−h, which is linear. The only linear map constant up to order 2 is the zero linear map, and g−h = zero means g = h.

## Tangents as limiting chords: computing derivatives

That gives the specification of what it means for g to be the derivative of f at u; but we also need to be able to compute derivatives, which it is traditional to approach in terms of limiting values of gradients of suitably short chords of the graph of the function to be differentiated.

In effect, for any neighbourhood of u, I'll: take the topological closure of the collection of gradients inferred from chords of f within the neighbourhood; take the intersection, over all neighbourhoods of u, of these closures; and establish that any derivative for f at u must be a member of the result.

[Note: need for topological closure – consider the mapping (: x.x.x ←x :) = cube, whose graph has no chords of zero gradient, though it has chords near x=0 with gradients as close to zero as one may care for – yet cube's gradient at 0 is 0.] However, I'll be using star-shaped domains (probably in fact simplices) enclosing u in place of neighbourhoods and endeavouring to avoid explicit use of orthodox topology's tools, notably closure.

In the one-dimensional case, orthodoxy obtains f's gradient at u by looking at chords of f's graph – that is, straight lines between two points, [x,f(x)] and [w,f(w)], with x and w close to u – and observing that the direction of chords has a limiting value as the two points are squeezed closer and closer together in progressively smaller neighbourhoods of u. The directions of pertinent chords get closer to that of the tangent more rapidly than the points used to specify the chord get squeezed closer to u.

By taking a basis in the output space, we can always reduce the dimension of the output-space to one-dimensional, by considering each component of the output separately – once we've got derivatives of these several components, we'll re-combine them by tensoring each with its basis vector and summing – thus it suffices to consider one-dimensional output spaces. For a one-dimensional input, we must then consider a curve in two dimensions, as before; for higher dimensional input, the graph of f will then be a curve or hyper-surface or whatever in a space of just one higher dimension. We'll then need appropriate higher-dimensional chords of f's graph for which we can compute gradients by linear interpolation between f's values at the end-points of each chord.

### The base-lines of chords

In two dimensions, where such a curve is being considered to describe a mapping between one-dimensional spaces as before, a chord is just a line between two points on the curve: any two points suffice to specify a straight line. Thus the straight line between [x,f(x)] and [w,f(w)] is simply {[t, f(w).(t−x)/(w−x) +f(x).(t−w)/(x−w)]: t in U} with (x−t)/(x−w) and similar denoting the scalar ratio (available because U is one-dimensional) between the two given displacements in U. [This may be expressed in the form {[t, a.t+b]} by taking a = f(w)/(w−x) +f(x)/(x−w) and b = (x.f(w) −w.f(x))/(x−w), for suitable readings of the ratios and products it entertains between members of U and V.] Thus a mapping from a one-dimensional space requires a chord between two inputs.

In a space of high enough dimension, any two distinct points specify a line; add any third point not on the line to specify a plane; add any fourth point not on the plane to specify a 3-dimensional surface; and so on until the surface we get is the whole space, when the number of points is one more than the space's dimension. We can picture the line by its portion between the given two points, the plane by the triangle in it with the given vertices as corners; the 3-surface by the tetrahedron in it delimited by its four vertices; and so on. Formally, in an m-dimensional space, one requires m points to specify a flat surface of co-dimension 1, i.e. of dimension m−1, which may be described by a simplex with m vertices comprising the convex hull of the given m points. If the vertices are (U:p|m) then the simplex is (| sum(k) = 1; sum(k.p) ←k :{mappings ({positive scalars}::m)}), so it is tidier to specify the vertices than the simplex itself (especially as a chord only looks at f's values at the vertices).

Thus, for a mapping (V:f:U) with U n-dimensional – the graphs of f's components in V are n-dimensional surfaces, images of (:f|) in (1+n)-dimensional spaces – we need 1+n points, i.e. a simplex with 1+n vertices; which, in n dimensions, is as many vertices as a simplex can have without any vertex lying in the surface formed by the rest. Indeed, the edges out of any given vertex of such a simplex form a basis of U, e.g. if the vertices are (U:p:1+n) one such basis is (U: p(i)−p(n) ←i :n).

So a chord for a mapping (V:f:U) will, in general, be specified by a list (U:p|1+n) = [p(0),…,p(n)], and f's action on the entries in that list, (:f:p); the entries are vertices of a maximal simplex in U and we can infer a unique linear map (V:g|U) for which (:f−g:p) is constant; this linear map is the gradient of the chord.

### Computing the gradient of a chord

The gradient is computed by linear extrapolation between f's values at the vertices. Given (U:p|1+n) as the vertices of our simplex: obtain b = (U: p(i)−p(n) ←i |n) and use it as a basis for U; we're after a linear map, g, for which each g(b(i)) is f(p(i)) −f(p(n)); this will then yield (by subtraction) f(p(i)) −f(p(j)) = g(p(i)−p(j)) for all i,j in 1+n. Obtain b's dual, ({linear ({scalars}:|U)}: q |n) with q(i,b(i)) = 1 for each i in n and q(j,b(i)) = 0 for distinct i, j in n. Then the identity on U can be expressed as sum(b×q), and g must be sum((g&on;b)×q), and we know g&on;b = (: f(p(i)) −f(p(n)) ←i |n) so we obtain

• g = sum(: (f(p(i)) −f(p(n)))×q(i) ←i |n)

as the gradient of the chord (:f:p). [Note that no presumption is needed about whether (|f:p) is linearly independent in, let alone spans, V.] If this sum is fed b(j) as input, all but the q(j) term will involve a 0 = q(i,b(j)) factor, leaving g(p(j) −p(n)) = g(b(j)) = f(p(j)) −f(p(n)). Thus, for distinct vertices a, c in (|p:), g(a−c) = f(a)−f(c) and f(a)−g(a) = f(c)−g(c), making (:f−g:p) constant.

### Gradients of chords near where a function is constant at order 1

If (V:h:U) is constant to second order at u, what gradients will its chords offer near u ? For given star-shaped S enclosing u and T enclosing h(u), we have some positive t for which, for any positive x up to 1, (| shrink(x.x,h(u)) :T) subsumes (| h&on;shrink(t.x,u) :S). Consider any chord, with vertices (S:p:1+n), and its shrunken images shrink(t.x,u)&on;p for various x; the latter's vertices all lie in (| shrink(t.x,u) :S) so h maps these vertices to within (| shrink(x.x,h(u)) :T); so let

• r = shrink(1/x/x,h(u))&on;h&on;shrink(t.x,u)&on;p,

(exploiting shrink(1/s,v) = reverse(shrink(s,v)) for all positive s) be the list of members of T to whose shrunken images h maps p's shrunken entries. From r, which varies with x, we shall infer the gradient of our shrink(t.x,u)'d chord.

Let b = (U: p(i)−p(n) ←i :n) be the basis obtained from edges into the last vertex of p; let q be its dual. When we replace p with shrink(t.x,u,p) we'll need to replace b with (: b(i).t.x ←i :) = t.x.b and q with q/t/x. The differences in h between shrink(t.x,u)&on;p's entries are now the differences in shrink(x.x,h(u),r), namely (: x.x.(r(i)−r(n)) ←n :), yielding a sum of these tensored with corresponding replacements for q as our shrunken chord's gradient:

• sum(: x.x.(r(i)−r(n))×q(i)/t/x ←i |n)
• = x.sum(: (r(i)−r(n))×q(i)/t ←i :)

Now, since (T:r:) and T is bounded, while q/t is fixed, the values sum(: (r(i)−r(n))×q(i)/t :) can take all fall within some bounded star-shaped domain (albeit in {linear (V::U)}) enclosing zero; and scaling any such value by x, for sufficiently small x, will give values arbitrarily close to zero, in the sense of falling within shrink(x,zero)'s image of that domain.

Thus small enough chords of f, near enough to where f is constant at order 1, have gradients arbitrarily close to the zero linear map.

### Computing a derivative

If g is a derivative for f at u then f−g is constant at order 1 near u; its chords near u have gradients near zero, so all sufficiently small chords of f near u will have gradients arbitrarily close to g (i.e. g+zero).

It should suffice to embed the canonical rational simplex, {mappings ({positives}:h:1+n): sum(h) = 1}, linearly in U of dimension n, mapping its centre-point, ({1/(1+n)}:|1+n), to u, and examine the gradients of f across simplices whose vertices are points of that rational simplex. If f has a derivative at u, arbitrarily small chords enclosing u between rational points, within this simplex, should exhibit gradients arbitrarily close to f'(u). Again, I shall use nested shrinkage families of star-shaped domains to make arbitrarily close specific; but this time, the star-shaped domains shall all be simplices.

For any simplex enclosing at least some point (not necessarily u), with vertices (U:p|1+n) – yielding basis b, dual q, as before – and any positive rational x up to 1, we have a chord f&on;shrink(x,u)&on;p yielding a gradient sum(: (f(u+x.(p(i)−u)) −f(u+x.(p(n)−u)))×q(i)/x ←i :n) which we are assured will fall within (| shrink(x,f'(u)) :T) for some star-shaped T enclosing f'(u) in {linear (V:|U)}. It thus suffices to look, for one enclosing simplex, at the limiting gradient, of the chord on shrink(x) of its vertices, as x shrinks towards 0; if f has a derivative at u, this limiting gradient must be f'(u). Naturally, it is often convenient to use a simplex enclosing u itself; otherwise, one formally needs to presume a bounded star-shaped domain enclosing u, expand this (i.e. shrink(t) for t>1) until it subsumes the given simplex, call the result S and each chord obtained by shrinking our simplex falls within the result of correspondingly shrinking S, enabling us to infer that its gradient falls within a correspondingly shrinking star-shaped T about f'(u).

This will enable us to differentiate, where possible; though it is necessary to verify that choosing a different simplex would give the same answer, since different simplices may give different answers if f isn't really differentiable at u.

Example in 2 dimensions: draw a hexagon centred on u, let f map alternate corners of the hexagon to +1 and −1, interpolate linearly along each edge (so f is zero at the mid-point of each edge) and then extrapolate linearly along rays out of u; i.e. for any v in the plane, there's a unique p on the hexagon (where we know f's value) and positive scalar t for which shrink(t,u,p) = v, so make f(v) = t.f(p); this makes it arbitrarily close to zero near u, so let f(u) be zero. This rayward extrapolation guarantees that the gradient of any simplex about u is constant under shrinking towards u; the limiting gradient of arbitrarily shrunk (towards u) versions of a chord is the gradient of the chord itself. We can readily illustrate zero-gradient chords (e.g. on any equilateral triangle with u as mid-point); but also others, such as two adjacent vertices of the hexagon and the mid-point of the opposite edge, which have non-zero gradients. Thus we see zero as a candidate gradient but that f isn't differentiable at u …

Thus, while shrinking any chord provides the means to compute a derivative, one must look to how the answer would have depended on choice of initial simplex before believing the answer is the derivative.

## Derivatives of combinations of functions

We can combine functions by pointwise addition, multiplication and in other ways; when we do so, will the composite be differentiable where the components were ? Clearly not in all cases: at least when we get to division we can expect discontinuity where the denominator takes the value zero, no matter how smoothly it varies near there.

i.e. linear structure in {mappings (V::U)} induced from that of V.

• every mapping differentiable at given u in U is just the sum of a linear map (V:|U) and a mapping constant at order 1 near u
• the linear maps (V:|U) form a linear space
• so do the mappings (V::U), constant at order 1 (or, indeed, up to any given natural order) near any given u in U
• ergo, the mappings (V::U) differentiable at u form a linear space

which comes naturally equipped (because zero is the only linear map which is constant at order 1) a unique decomposition into linear part and locally constant part. Trivially, then, the derivative of a sum of functions is the sum of their derivatives; and the result of scaling a function is simply to scale its derivative by the same factor.

Analogously, any linear (W:w|V) induces a linear map ({differentiable (W::U)}: w&on;f ←f :{differentiable (V::U)}).

### Multiplication

For additive combination of functions, I've already given a point-wise solution implicitly treating undefined as zero, but not allowing this to cause the sum to be defined and deliver output zero except where at least one of the functions combined was defined. If we apply the same to pointwise multiplication, the product will be zero wherever even one of the factors is undefined; but the zero hasn't been arrived at due to any factor being defined and zero, and can be encoded by letting the product be undefined in such positions, leading to a pointwise product defined only where all the factors are defined. Wherever a multiplication is defined among members of some V, yielding products in V, this equips the bulk action of multiplication, product, with values for input lists (V:h|n); which, in turn, equips {mappings (V::A)} for any A with a multiplication whose contribution to product is:

• product({mappings (V::A)}:h|n) = (V: product(: transpose(h,a) |n) ←a :A)

with the minor complication that (: transpose(h,a) |n), given (:h|n), asserts that a is in (:H|) for all H in (|h:), thus implicitly restricting the product to only accept inputs in the intersection of {(:H|): H in (|h:)}, in contrast to addition's analogous union.

### Differentiating polynomials

• polynomial = (: (: compose(: repeat(i,x,h(i)) ←i :n) ←x :{relations}) ←(:h:n) :{mapping ({relations}::n): n is natural})

polynomial(h,x) is compose(: repeat(i,x,h(i)) ←i :n) If each h(i) is a translation in a linear space, U, and x is a scaling of U, repeat(i,x,h(i)) is power(i,x).h(i), for each i. Composing translations of a linear space amounts to adding the displacements they induce, so polynomial(h,x) is effectively sum(U: power(i,x).h(i) ←i :n) with the identity translation (i.e. the zero vector) as polynomial([]).

Algebraically: (:derivative:polynomial) = (: polynomial(: repeat(1+i,h(1+i)) ←i :) ← polynomial(h) :)

Valid CSS ? Valid HTML ? Written by Eddy.