MUM segmentation project report

What follows is an outline of what I did to Rod Cook's MUM segmentation program while I was working for NA software: it's here mostly to illustrate how I go about evolving a software project. Rod's MUM algorithm is a way of decomposing into regions an image which is corrupted with coherent speckle: each region having a suitably uniform visual texture (e.g. a wood, a field, a hedge-row, a lone tree), give-or-take the limitations of how much information speckle has destroyed or the statistics used have lost. I'll not be discussing the image processing aspects of this beyond saying that MUM stands for Merge Using Moments – it starts with lots of tiny regions and repeatedly chooses a region to merge into a neighbouring region with suitably-similar statistics.

The existing definitive expression of the algorithm was a piece of ForTran written by an expert in image processing – my boss, Rod Cook. I got to convert it to C without worrying too much about what the image processing meant, just so long as I mimicked it right. Although the context called for an ultimate goal of doing segmentation much faster than anyone could at the time – in real time, as collected by SAR and deconvoluted by my colleagues' code – I deliberately left speed on one side and set about writing an implementation with straightforward modules interacting with one another via clean APIs.

I knew about (what computer scientists describe as) topology (but mathematicians describe as homotopy) and data structures to model it, so duly wrote a library to describe abstract 2-dimensional space in such terms: this managed all the business of telling me which regions were neighbours and of merging each iteration's pair of regions into one another. Statistical data about the image then constituted the geometric data about each region. That conveniently bundled up all the mundane book-keeping tasks into a library with a straightforward API: what remained was to express Rod's algorithm in C using that tool-kit. This allowed for a good degree of abstraction in the expression of the algorithm (and blessed was gcc, for it did provide extern inline) – which made debugging it a lot easier than debugging the ForTran.

[Separating out the statistical details also proved a boon: it enabled us to explore different varieties of statistical data and how they might be used to measure similarity between adjacent regions. Thus, where Rod's original code had computed the moments (mean, variance and their kin) of the data, we were able to explore various notions of visual texture (which take account of how image intensity varies with position, within each region). These enabled us to distinguish (for instance) between woods and fields, even when their average brightness was the same. Exploring variations on this theme was made practical by judicious separation of the statistical analysis from the topological analysis.]

In the first instance, I implemented the algorithm directly as Rod had coded it in the ForTran: while doing so, I noted a glaring issue in the form of a sorted list of all live regions. All we did with the list was, at each iteration, take the top two items off, merge them (possibly reposition some of their neighbours) and insert the result wherever it belonged in the list. Since we only needed to know which was best, at each iteration, it seemed wasteful to be keeping the whole list strictly sorted (we only need the upper reaches, as it were) especially as the amount of effort in repositioning an item was always of order the length of the list. Still, that's how I did it first time round, because it was how the ForTran did it and it'd be easier to debug a direct mimic.

I'd left speed on one side, so it wasn't fast. But it was robust and made from well-structured components, so some fairly straightforward tweaking got it going about as fast as the ForTran (depending on image size, but only marginally). So then we set to looking at speed issues properly.

Each merge involves some statistical computation for each neighbour of the merged region and we intuited that the typical number of neighbours, over the entire history of merging, would grow about as fast as log(area). Each merge reduces the number of regions by one, so the number of merges is proportional to the area of the image. That told us to expect that that the time to segment an image would grow, with the area of the image, as roughly area.log(area). Note that neither of us knew much about sorting algorithms and speed at the time.

My implementation, like the ForTran, was taking time proportional to the square of the area. After a little inspection of the system I realised that the list sorting we were doing would take a square of area time, so the book-keeping was costing more than the image-processing. Remembering my earlier concerns about the sorted list, I went and read Knuth's chapter on Sorting and Searching. That gave compendious knowledge of how to do the job in hand, and several others besides.

This is when I got pay-back from those clean APIs between modules: I only had to change the limited amount of code concerned with the sorting, which was well-isolated from everything else. Using a variety of sorted binary tree, I could reduce the list management to times proportional to log(area) per insertion (used during initialisation of the data-structures) or removal (of which each merge involves one). Each merge also involves rattling the new merged node from its position at the top of the tree to wherever it now belongs, shuffling others upwards in the process: that again grows as log(area), the height of the tree. The time per merge thus contains an image-processing term proportional to how many neighbours the merged node has and three book-keeping terms proportional to log(area). Experiment found that the result overall was roughly proportional to log(area) per merge: since the statistical computation per neighbour per merge is of similar size to the book-keeping of 1 (or perhaps 3) rattles, the typical number of neighbours must be rising no faster than proportional to log(area) – our intuited rate.

[Time will also be spent rattling some neighbours of the merged node (the others alluded to above): each of these neighbours either saw one of the regions merged as my best neighbour or subsequently see the merged region as such. Since the two regions merged were as similar as we could find, the new region should be likewise similar to them, so typically about as similar to any of their neighbours as was the relevant old region. Consequently, these adjusted nodes will typically only change how good is my best merge by small amounts – so I would not be too surprised if their rattling typically does nothing and almost never does more than one step. In any case, when multiplied by the number of such re-sorted nodes, this rattling-cost would appear to come to at most log(area) – so if the typical rattling-cost does grow significantly, we would need to suppose that the number of re-sorted neighbours grows correspondingly less rapidly than log(area).]

That gave us a fast enough segmentation program that we could realistically look at doing the job in real time. The first step towards that was to port the program to the parallel computers used for the real-time deconvolution of the SAR data. This proved trivial – the only code change needed was to elide the call to a Sun-specific timing routine which I'd been using (in main) to measure run-times: writing portable code comes naturally when building a program from robust well-structured components. None the less, the ported code was unacceptably slow, so we profiled it – and discovered that the culprit was the system free() on the target – the second call (and all subsequent calls) to free() would perform heavy-duty processing designed to tidy up the free-space map.

Again, the clean interfaces saved the day. Each datatype used in the program had its own creation and destruction routines: so there were relatively few places where malloc() and free() were called, making it easy to replace these with calls to a custom memory manager my colleague Ian knocked together for the job. For this we could compute, before allocating any of our data, how many data-structures of what sizes we would be needing (the algorithm is well-behaved in this respect): for each size we could then allocate a single sufficient raft of memory and initialise this as a linked list of free nodes of that size; the allocation and release of items from this list is then easy to code and has no memory overhead. Each datatype's creator and destructor could then call the correct size's allocator or releaser – and the use of clean interfaces had guaranteed that we only had small amounts of code to change. The resulting program was fast enough that we could perform a demonstration of real-time segmentation: we had to use a somewhat coarse grid on the image in the first instance, but could confidently point to Moore's law to give our clients the speed-up needed to use a finer grid within a few years – where, previously, the task of real-time segmentation had seemed unapproachable.

It should be noted that MUM wasn't the best segmentation algorithm available – some neural net and simulated annealing methods were clearly better, and our clients had an algorithm of comparable quality to MUM which ran about as fast as Rod's implementation of MUM. Moore's law and improvements in neural and annealing technique duly ensured that, within five years or so, MUM was swept aside by better algorithms becoming feasible in real-time. Sic transit gloria mundi.

Written by Eddy.
This page is part of my curriculum vitae.