Tracking revisions

When you change a file on a computer, the new version replaces the old one. Your system might keep a back-up of the old version for you; if you're very lucky, it might even keep several. It does that because users sometimes make mistakes and, when they do, having a back-up (as long as they realise their mistake soon enough) saves them the from having to re-construct the earlier version from the later one (or from scratch, if the mistake was removal of the newer one). There are various other reasons why one may want a record of older versions of a file, and some interesting things you can do if you can do that systematically, so people have invented systems for doing version-control or revision management. These may work on individual files or on large collections of them; since a common use-case for managing versions of many files together is the case of software in its source form (human readable, before it's translated into machine code), such systems are also called source code management (SCM) systems.

Backups

I've already mentioned the most primitive version-control system: backups. If you back your computer up to some external persistent medium (archetypically a tape, but just as likely to be a USB drive on a key-fob, these days) on a regular basis, like we all know we should (even if few of us actually do it), your backups constitute a version history of your entire system (or as much of it as you bother to back up, anyway). Many programs that change files (and most programs that describe themselves as editors or word-processors) save backups; while you're working, they may auto-save your work, so that you can recover it after a crash or power-outage; when you save a new version, they save the old. The new version gets the name the old used to have, so the old needs a new name; likewise, auto-save backups get their own names. The names used depend on the application; they may be in an application-specific directory (this is common for auto-saves) or they may be obtained from the primary file's name, e.g. by adding a suffix. In the case of DCL/VMS, every file actually had a ;123 numeric suffix; when you typed the file name without this suffix, tools automatically used the one with highest number; when an application saved a new version, it used the next highest numeric suffix.

These approaches all give you at least some old versions (and, in the case of auto-backup, working drafts) of a document. The whole system backup tracks the state of many files all together; the auto-save and after-edit backups deal with each file independently. Backups only give you the content of the file and, generally, the time and date at which that revision was created; you get no information about what has changed (although you can always compare versions) or why you made the changes. If you abandon some recent changes, go back to an old backup and make some fresh changes based on it, then later decide you do want the earlier changes after all, it's up to you to work out how to merge the two sets of changes to arrive at a unified version. If you make a tiny change, the backup still takes up as much disk-space as the whole file (which may be huge), just to record almost no change.

Basic Desiderata

So, aside from recovering the usable version you had before you made your stupid mistake, what other benefits can we hope for from keeping track of old versions of the files on our computers' hard disks ?

Let's start with the things I just pointed out you don't get with backups:

describe the changes you've just made and the reasons for making them;
combine two sets of changes, starting from a common base-point, to obtain a new version that has the best of both; and
limit disk-space use by exploiting the fact that all the different versions are very similar.

For the first, you're going to need to give at least the reason for your changes, and preferrably a summary of them, when you save them: when doing that, you'll likely find it useful to be able to display a comparison of what you're saving with what it replaces, highlighting the changes. For the second, we're going to need some way to identify the versions involved – the common base-point and the two revisions obtained from it. You're also going to want some way for the system, preferrably as automatically as possible, to merge two sets of changes from a common base-point to obtain a unified set of changes.

Branching

The second desideratum above really needs us to be able to identify two separate histories of change, starting from the common base-point; when we back up to an old version and start making a fresh set of changes based on it, we're not continuing the original history of versions we were working on before – the common base-point now gets to have two lines of forward development based on it, rather than just being a stepping stone between the version before it and a single version after it. Furthermore, if we do merge two such lines of development back together, the resulting version has two lines of development leading to it.

So, where even the most sophisticated backup system produces a simple sequence of versions, we now find outselves dealing with a network of lines of development, splitting and joining. Although most files seldom have such complex histories, our system needs to be able to cope with the ones to which we do make such changes. We aren't going to know which files this'll apply to until (at least) after we've come past the point we later decide to back up to in order to start a fresh line of development; so we actually need to be using a system capable of this for all files, even if few of them are going to exercise it.

Now, as it happens, support for this gets even more desirable when several people are making changes to the same files. Suppose you've been writing some big fancy report on what work we've been doing for some customer recently. You've got a draft ready for internal circulation so you circulate it to your colleagues to get their input. The traditional way to do that is for each of us to send you a description of the changes we think you should make, then you make them, combining our various suggestions in the process. But, actually, the quickest way for me to indicate what I want changed is often to simply edit a local copy of the document. So let's suppose that's what each of your colleagues does. They could each send you back their copy and you can then create a branch for each of them and check their copies in on their several branches; or you could given them access to the system, so that they can check in their own changes on their own branches. Either way, you end up with a branch for each of your colleagues who provided input. Now you can ask your version-control system to show you the changes each has made and to, as far as it can, show you how to merge these.

The same basic idea applies when several programmers are working on a body of software, typically split into many files. Each programmer is working on a separate task – fixing some bugs, implementing a new feature, tidying up the chaos left by earlier development – and we don't want any of them to have to wait for the others, so they all work in parallel. Each works on a separate branch, to avoid collisions, and later they get together and merge their work (or hand off their work to someone else, who merges it).

Merging

One of the harder things to get right is merging. If two branches of development have changed separate files, it's easy: to merge, just make each set of changes to its own set of files and you're probably OK. Even if two branches have changed the same file, as long as they've changed separate parts of the file, it's fairly easy to ensure you combine the changes cleanly. It's possible you'll be left with some problems after such a merge, so you should always review or test (as appropriate; do both if both apply) the results of such an automatic merge; but fairly dumb programs have been developed that can do a pretty good job of such merges. If changes overlap, however, it's generally not possible for a dumb computer program to automate merging them: you'll need someone who actually understands the changes to work out how to combine them, or which of them to keep. For example, if a sentence in your original draft of a report was somewhat garbled, maybe one reviewer worked out what you meant and straightened out the sentence to say that while another reviewer thought about what you should have said, and replaced the whole sentence. Recognising that that's what happened takes intelligence and knowing what to do with the result takes judgement: those are your job, not the program's – that's why you have a salary and the program doesn't.

Now, that distinction was between overlapping changes that you're going to have to sort out for yourself and separate changes that the tools can do for you. So it's important to understand what constitutes overlapping as opposed to separate. Many computer-programming languages either are line-oriented or tend to be written in a line-oriented style (that is, the language may not care about line-breaks, but common habits of how to lay out the code do); line breaks happen where they do because that's the end of one syntactic construct and the start of another. Consequently, version-control systems primarily intended for computer programs tend to use a line-oriented test for separation / overlap: each set of changes is described by the set of lines that has been changed, along with a line or three on either side of each change to give context; if the sets of lines describing two sets of changes don't overlap, the changes are separate; even if there is some overlap, any contiguous blocks of lines-and-context in either change that don't overlap the other change can safely be merged, leaving just the blocks that do overlap, to be sorted out by someone with intelligence and judgement.

That works great for line-oriented files but less well for common document formats designed for the display of paragraphs of text. For example, this web page is a document written using HTML; which does let me break lines anywhere convenient to me, without it affecting how the paragraph of text gets presented by a user agent (a.k.a. browser), so I could format it in a line-oriented manner. However, in practice it's mostly text and I routinely view it in its source form (when editing it), so I use my editor's paragraph-flow features to keep the blocks of text vaguely neatly formatted. That means that, if I add a word I've noticed is missing while proof-reading, it can cause its line to over-flow into the next, and so on; so the new version of the file differs in several subsequent lines, rather than just the line containing the real change. If someone working on another branch has changed a sentence later in the paragraph, a line-oriented merge tool is going to think their change overlaps with mine, even though I didn't change that sentence and all that's needed for a merge is to add my missing word to their version and reflow the paragraph.

Thus merging, to be done properly, should be done by a program that understands the type of file in question. Indeed, since a file may contain diverse kinds of text, a merge tool would ideally understand the diverse different types of text that can appear in it – although that could be difficult; when I discuss a program, the discussion is text (so paragraph-structured, mostly) but includes fragments of program (which are apt to be line-structured) simply presented as preformatted text, with no specific hint (to a program that only understands the form of the content, not its meaning) that this content is a fragment of program.

Consequently, a well-designed version-control system would provide for being configurable to use different merge tools for different types of file – just as a desktop environment knows what program to use to activate a file when you double-click on the file's representation, because it has a configured activator program for the file's type.

Basic version control

After living for several years with VMS's numbered versions, virtual roots and implicit copy-on-write semantics, I switched to Unix, which lacks these things; at best, I had a single backup of my most recent edit. I endured the problems this caused for a time, but soon (about 1991) one of my colleagues introduced RCS, which I rapidly recognised as an improvement on what we'd been doing. Its name stands for Revision Control System, it tracks versions of each file and it had actually been around for about a decade before I met it. (At about the same time, I noticed SCCS – source code control system – which appears to have been an earlier system on some Unix variants; but I've never used it, so can't record how it worked.)

I soon enough learned how to make good use of RCS, including devious games with symbolic links to make directory hierarchies work smoothly and share RCS files between several directory hierarchies of checked out files. It was possible to learn to make it useful, albeit at the expense of some extra work. Its main weakness was that it dealt with single files and one usually wants to deal with a whole directory hierarchy containing source code, build files, documentation and much more. All the same, it wasn't too hard to build something on top of it that was adequate; indeed, up until late 2001, I maintained this web-site and most of the software I develop for my own amusement using RCS to track revisions (in fact, some of my toy software is still, in early 2010, in RCS).

A few jobs later (in 1996) I inherited a pile of scripts that managed source trees using a repository of RCS files to automate checking out working copies and checking in new revisions, dealing with the whole hierarchy of source files. I will not sing the praises of that system, because it wasn't beautiful (I was responsible for maintaining it, so I know this well): but it did what we needed, layering reasonable whole-system structure on top of RCS. The general approach of layering such structure on top of RCS was not new; by this time, a system called CVS – Concurrent Versions System – had been doing that for about a decade and could fairly be described as mature. From late 2001 to 2010, I used CVS to manage this web-site; the transition from RCS to CVS was made easy by the fact that a CVS repository is exactly a directory full of RCS files, so I just moved my old RCS directory hierarchies over to my shiny new CVS repository. From early 2002 to 2010, I also used CVS at work.

RCS: the Revision Control System

Because they're based on it, various other systems inherit RCS's problems; so they're worth describing in some detail. Indeed, since RCS really only set out to deal with individual files, those using RCS directly seldom run into these problems: but when one tries to deal with many files, in a large directory hierarchy, these problems scale up and start to hurt. In order to describe the problems, I must first describe RCS; and I may as well start out with how well it does in terms of being better than simple backups.

RCS addressed the desiderata listed above as follows:

The command ci (check in) saves the current contents of your file as a new revision; you give a check-in comment at the same time, that you can use to record what you've changed and why. This command also removes the file saved ! But you can get it back using …
The command co (check out) retrieves a previously saved revision of a file, ready for you to make fresh changes on it.
The command rcsdiff tells you how your working copy of a file differs from the saved version you last checked out; this makes it possible to review your changes in order to write a sensible check-in comment.
The files in which RCS records your revision history only remember the differences between successive revisions, plus your check-in comments, so small changes to a big file only take up as much space as the big file plus the changes, rather than a fresh copy of the whole file for each revision.
When you run ci, you can tell it to save the new version to a new branch off an old version, rather than as an update on the same line of development as the old version belonged to.
The command rcsmerge incorporates the changes between two revisions of an RCS file into the corresponding working file as its manual page says. This makes it possible to merge two lines of development: check out the latest version on one line, work out its most recent common ancestor with the other line of development, merge the changes between that and the most recent version on the other line.

For each file dir/ect/ory/file-name.ext, RCS maintains a revision-control file called file-name.ext,v in dir/ect/ory/RCS/, if this sub-directory exists, else in dir/ect/ory/ itself. This file begins with a preamble that identifies the branch it uses by default for check-outs and records any symbolic names you may have given to revisions or branches, along with some other administrative information; it then lists all revisions with their administrative details, including check-in comments, date and who checked them in; finally, it lists the changes between versions.

Problems with the revision-tree

RCS considers the set of revisions to be a tree – that is, each revision has one parent revision, although there may be several of which it is the parent. This simplifies the design of RCS and enables it to make some optimisations, but it also introduces problems.

In particular, when you check in a revision, you have to decide which branch to add it to; and you can only chose one. So when you check out on one branch and merge changes from another, you can save the result on either branch (or on some other branch entirely !) but the only way to record that it merges the two branches is to describe the merge in your check-in comment. If you later want to merge subsequent changs between the two branches, you need to find the most recent check-in comment on either branch that describes a merge; it'll tell you which version off the other branch it merged: that's the most recent common ancestor that you need to use for the merge. Suitable discipline can solve that, but it is a weakness of the tool that it makes such discipline necessary.

In the changes section of each file, RCS saves the full contents of the most recent version on its default branch; it saves the change you need to apply to it to retrieve the previous version on that branch; and, for each revision earlier on that branch, the change you need to apply to get back to it from the one that followed it, all the way back to the creation of the file; these are all backwards differences, showing how to obtain an older version from a newer. For each non-default branch, where it diverges from the default branch, RCS records the forward difference from the other branch's most recent common ancestor with the default branch; then, as we progress down that branch, a forward difference at each step to obtain each revision from its predecessor. Thus checking out a recent revision off an old branch involves lots of work, computing and applying changes, first backwards down the default branch, then forward up the old branch.

There is a kindred problem with telling RCS to give a symbolic name to the most recent version on some specified branch: to find that version, RCS has to traverse the file looking at the administrative section for each revision until it finds one on the specified branch; if the branch's details are stored late in the file, especially if the branch is long or appears after some long branches, this forces it to traverse a lot of the file, making sense of what it's reading in order to know what to skip over, before it can find the revision to tag with the new symbolic name. RCS works very well as long as you mostly work on the default branch; but significantly less well on side-branches. The problems are mostly ignorable when dealing with one file at a time, but some of the systems built on top of RCS can end up suffering very badly from them when dealing with large numbers of files with large numbers of revisions.

Centralisation

Another problem that RCS suffers, as do systems based on it, is that the revision-control information, in the *,v files, is all in one place; if several developers want to access it at the same time, the software has to make sure that their changes don't collide (two processes changing the same file at the same time); indeed, the software has to mediate enabling many developers to access the same files, possibly from several machines. The scripts I had to maintain dealt with that by presuming that everyone mounted the disks with the RCS repositories on it via NFS (network file system). CVS takes a more sophisticated approach and supports a network protocol that lets developers on diverse machines access a CVS server running on a machine that houses the repository.

Either way, the repository machine has to run a server (NFS or CVS) and endure concurrent access by potentially many developers. Aside from that server being a single point of failure (so, if it goes up in smoke, you suffer badly) this also makes it a bottle-neck: if too many developers are using it, it's going to slow down and that'll slow down everyone that's using it. In short, while these basic version-control systems work reasonably well for small numbers of developers, they don't scale well to the case of many developers working on a large body of software with a lot of history.

In particular, particularly when combined with the performance problems that arise from the revision-tree and the structure of the revision-control files, creating new branches or tagging by reference to a branch (i.e. telling the system to, for each file, find the latest version on the specified branch and add the specified name as an alias for that version – as distinct from, for each file, specifying a particular version to which to bind that name) can end up being expensive operations (on a CVS repository containing several thousand files, accessed by several hundred developers, mostly working on their own branches – that get integrated to a mainline once the work is complete and tested – I've routinely seen tagging operations take hours). Worst of all, since these operations modify the version-control files, they have to lock the files against access by others while they're modifying them – it isn't even safe for others to read a file while it's being modified – so any other developer accessing the repository, even only to fetch updates without modifying anything, is apt to get stuck behind the locks set by a process that's busy tagging or branching the source tree.

The frequency with which branches get created grows with the number of developers; the probability that each developer's other operations will be delayed waiting for locks grows in proportion to this. It's even worse than that: since the number of branches grows with the number of developers, the amount of each file that CVS must read before it finds where it's meant to be adding a new branch (or tag, if specified in terms of a branch) grows likewise, increasing the delays for each lock proportionately – so each developer's expected delay due to locks grows quadratically and the total time spent by developers collectively waiting for locks may well grow as the cube of the number of developers ! Of course, one of the reasons to hire new developers is to increase the amount of code: and standard advice about limiting the size of individual source files implies that this increase will largely be expressed as increases in the number of files, which further increases all delays.

Another problem with centralising the repository is that developers can only record their changes when they have access to the repository. Increasingly, those who might make useful contributions to a project are using laptop computers while they travel; they don't only do useful work when they're sat in their office with a full-bandwidth connection to the server. Without access to the repository, they can make changes to their checked-out versions of things, but they can't record what they're changing and why they're changing it as they go along – they can only check in the final result, possibly the result of several days' work, when they get back in touch with the server. This is apt to lead them to make one huge check-in with a comment that only crudely describes what they've changed and why. If they'd had some way to record each part of their work as they went along, they'd have left detailed notes about each step in the change, that would later be useful for helping others to understand why they did things the way they did.

Better version control

Various other version control systems from the '80s and '90s worked more or less differently, but I don't know them well enough to comment. By the late '90s, it was clear to at least some that it was time to try something better. Some of the attempts to do that, notably subversion (a.k.a. SVN), took the approach of trying to fix up old ways of doing it. SVN set out to be CVS done Right but reasonable comentators (prominently including Linus Torvalds) pointed out that CVS does the wrong thing: doing that wrong thing Right was still the wrong thing to do. As we've see, the main problems of CVS (and thus SVN) are inherited from the underlying RCS; which was quite a good thing to be doing in the 1980s (if only because it was better than nothing) but the experience gained from using it taught us the limitations of its approach. Since these limitations are largely intrinsic to the basic design features of RCS – line-oriented differences and merging, a simple tree of revisions per file and a single repository – the correct response is to re-think the basic design and work out what we really need.

All the earlier disiderata are, of course, still pertinent: but we've learned lessons from the larger-scale collaborative software development that basic version control systems have enabled us to persue. Thos lessons teach us to want the original desiderata addressed properly – a network of revisions, not just a tree that approximates it; and merging needs to be capable of adapting to different types of file – and they also teach us to want more:

When code moves around within the file system, the version control system should keep track of that; someone inspecting some code in its present location, when they ask how it got that way, should be told about changes before it was moved;
Avoid the need for access to a central server: it should be possible to make changes, recording what's changed and why, whenever one is in a position to spend time doing something useful that it may later be worth sharing with others; and, in prticular,
Creating a branch or tagging a revision should be a quick operation for the user doing it and should present no impediment to other users of the system.

At the same time, some features that basic version control systems have supported are worthy of note; they were good things to have and are desirable in any replacement:

Networked access: it is highly desirable that users be able to share their work with one another across a network, without any need for them to share a disk or even be using the same operating system; ideally, they should need no more than HTTP access (optionally over SSL), at least for reading, since they may be behind a firewall and these usually allow HTTP access – albeit access by other means might be more efficient, flexible or powerful;
Access control: when a repository is accessible to many, it's highly useful to be able to control who has what kind of access to which parts of it; for example, some large group of people may have read-only access to all of a repository while only various sub-groups of them are able to modify various portions of the repository;
Symbolic names: however the version-control system may refer to individual versions or lines of development internally, it's worth letting users give their own names to these things, that are meaningful to users in their own terms;
Dealing with whole sets of files: the very fact that we had to build systems on top of RCS to deal with large collections of files, coupled to the desire to track code as it moves around, argues for having a system that deals with the whole of a set of file, rather than single files individually; furthermore, one of the commonest uses of symbolic names is to give one name to a revision of all of the files, so that checking out that named revision will get back that state of the whole set;
Modularisation: at the same time, it's desirable to be able to sub-divide a set of files being managed, so that some sub-sets may be managed independently of the rest.

Written by Eddy.