Subordinate git repositories

Some of my colleagues looked at git's support for submodules a while back and I paid some attention to the discussion; I've also worked briefly with a project that used the repo system to manage submodules. My general impression is that git submodules aren't really what I want, so I've given some thought to what I do want as subordinate source-tree support in a version control system. This is heavily shaped by my tendency to think of a .git/ directory as conceptually divorced from (albeit, for convenience, usually a sub-directory of) the source tree it describes and manipulates; it is a tool for recording and restoring states of a source tree, hence also for recording and replaying changes of state.

Sub-division

The first thing that strikes me is that a sub-directory isn't always what I want to track; in some cases (e.g. the ~/.sys/ directory whose bin/ sub-directory I include in my PATH) there are conceptually separate sets of files that I want to share with different collaborators or check-out locations – for example

work-related: scripts and data that I'm not going to take with me to my next job because they'll be irrelevant there; I'll want to share these with my colleagues, but not with my machine at home. These may be further sub-divided into project-specific sub-sets that I'll only want to share with limited sub-sets of my colleagues.
idiosyncratic: stuff that configures my local system to work the way I like to have it set up – including the fact that loads of stuff goes in ~/.sys/ that others would have in ~/ itself. I want this both at work and at home, but my colleagues probably don't share my peculiar tastes in how to do things.

– that all live in the same directory tree, intermingled with one another. I could separate these out, but the only reason to do so is that version-control systems don't play well with such an arrangement. So when I talk about a module I mean: a more or less arbitrary sub-set of my source tree. It might be specified in terms of a set of sub-directories (including everything under each) and files; but there may be some better approach, given that I might want to rename parts of my module. In a given source tree, there are potentially many interleaved modules; and some may have sub-modules within them. There may even be overlapping modules (not very modular, but there may be reasons to allow it, anyway). It should be possible to mediate all such via each module's relation to the whole source tree.

In an existing repository containing the larger source tree, I can create a branch, to describe my module, and use git filter-branch to convert this branch to one describing the history of my module, ignoring all other files. I'd need to take care to keep branches related to the module separate from those related to the larger source-tree, but the branch name-space in git is flexible enough to make that easy. I'd push the module's branches to repositories specific to the module, and pull changes from such for use in my module, just as branches related to the full source tree would push and pull to and from repositories specific to it. I can replay changes (using cherry-pick or rebase) from the module's branch into the main source tree; these should replay cleanly. Replaying changes from the main source tree onto a module branch would get conflicts relating to files omitted from the module, easilly resolved by removal; and would add all new files to the module, for which git filter-branch on the replayed changes can take out unwanted additions. Even renamings in the main source tree should work as gracefully as one can reasonably hope for with this; git's usual mechanisms for replaying do what you'll need for these uses.

Nothing in branches of the larger source tree need ever be conscious, in any way that git push and peers would let other repositories know about, of the presence of the module. It should be possible to write scripts to package the needed replaying and git filter-branch so as make it easy to map change-sets between branches of the full source tree and branches of the module. Those scripts could sensibly form a git module command namespace and use configuration related to each module (mostly, how its branches' names map to and from corresponding branches of the main repository; the configuration of remotes used for the module would also want to take that into account) organised in a similar manner to the git remote commands and their configuration.

So git already contains everything it needs for the kind of modules I actually think are worth dealing with in a source tree; and the relationship between the main source tree and these modules need only be mediated by repositories in which developers map changes from each to the other. Other developers, intrested only in the module or only in the whole, can be entirely unaware of the module decomposition and unaffected by it. The only real complication is that the module-as-branch sees the module's files relative to the root of the main source tree, where some modules shall want to describe (to those working only with the module) their contents relative to some sub-directory of that root.

Sub-directories as modules

A common type of module is a sub-directory whose content relate to a relatively autonomous part of the whole system, that might also be shared with other systems. It's natural for a module's repository to think of the module as the whole source tree, even though it may be incorporated into other source trees as a (potentially deep) sub-directory. Different client source trees may indeed want to put the module in quite different locations relative to their root, which only re-inforces the module's need to believe in itself relative to its own root. This complicates the replay operations required, fortunately only slightly; now replaying module changes in the larger repository shall also require a git filter-branch to rename contents into the sub-directory; and the existing filter-branch for the reverse direction shall need to rename away the sub-directory. The only time this gets complicated is when the larger source tree moves the module around – to which I'll return, after considering other aspects of this common use-case.

When one has a larger source tree containing sub-directories as modules, it'd be nice to be able to automate simultaneous development of the whole and of the modules within it – that is:

have modules and the main repository synchronised with one another (so that their opinions on what HEAD should contain are consistent)
have commands such as git add, git rm, that manipulate the staging area, do so for relevant modules as well as the main source tree, in so far as the changes staged appear to belong to a module's sub-directory;
provide git module {add,rm,…} commands for correcting the usually correct guesses made by git {add,rm,…};
have git status report the statuses of modules within the source tree as well as that of the tree as a whole; and git commit commit each module with staged changes, using the same commit message and authorship, at the same time as the main repository;
have git checkout -b create corresponding branches in all modules at the same time as in the main source tree;

and so on. This would save the need for later replay operations, mapping changes between modules and main source tree. There would be complications to git checkout when requesting a ref for which the main source repository can't directly infer a matching ref in each module, or where the ref that it expects to match doesn't; in such a case, the main source tree should notice the problem, report it and suppress tracking the module's changes; git status would repeat this report; and git module checkout should provide the means to bring a module into sync with the main source tree, if needed.

It would also be nice to separate out the administrative files (i.e. contents of .git/ subdirectories) of modules from those of the main source tree; for example, storing their remote configuration, refs and the transient files related to replaying changes between them and the master separately from the primary .git/ of the master. Automated simultaneous development, in particular, would want a separate staging area for each module. One obvious way to do this would be as .git/ sub-directories in the individual modules, just as in a normal checkout of the module. Normally, git traverses the directory tree upwards in search of a .git/ on which to act, which would lead to it treating commands executed within a module as acting only on that module, unaware that it should propagate upwards to the main source-tree. That can be avoided either by telling each module's .git/ that it's a module within a larger tree, so that git can know to continue up the tree when it finds one of these; or by naming such a sub-repository differently, e.g. .sub-git/. There may also be value in doing both, with the .git/ directory being its normal self (aside from the subordinate flag) and the .sub-git/ handling administration of the interaction with the larger source tree. I'll assume a simple .sub-git/ approach for the purposes of further discussion, but it doesn't really make much difference.

The primary .git/ and the .sub-git/ of a module shall have objects relating to the module in common; so it'll be natural for them to share (e.g. by hard links; or in the fashion of git clone --shared) objects. Each .sub-git/ should be a fully functional repository that can serve as a remote for other repositories of the same module. The presence of .sub-git/ would facilitate git module commands' ability to work out what to act on. The only real complication would come – once more – when a renaming moves the module around within the source tree.

Actually doing the rename is easy: I'll move the module directory to a different place in the source tree, maybe make some changes in it and other places, add my changes. The move will naturally have also moved the .sub-git/ with the module. The main repository shall see the bulk rename; if it tells the module to add those changes, the module shall silently ignore them, as it won't see any change; but the main repository probably knows they're renames anyway, so won't bother to tell it. For files that haven't simply been renamed, the module will see a change when told to add (or (re)move) them, and the main repository shall know they've not just been renamed, so will tell it. All is well and we can commit as usual.

The complication comes when we use git checkout to switch between revisions before and after such a rename. Before the command, the module directory contains a .sub-git/, which is necessarily not treated as part of the source tree (else git push and friends would need to know about it, and I'm quite sure I don't want that). The command causes git to change the source tree, removing content from the old directory and adding it in the new, no part of which explicitly says a rename is happening, so the .sub-git/ directory is apt to be left behind. If the change also includes large amounts of change within the module, it may be very non-obvious to git that it was renamed, leaving it with little clue that it should also move the .sub-git/ as part of the checkout.

It will have one clue, however; when it does a checkout, it naturally needs to synchronise all modules' .sub-git/ states. As noted above, this isn't always possible and git would, in any case, need to check for it. So it'll notice a stranded .sub-git/ as a matter of course and mark it as out of sync. If it suspects it's done a rename, it can check whether the stranded .sub-git/'s expected state matches the target of the suspected rename; if it does, it can automatically move it to where it's needed, rather than marking it as out of sync. So the rename problem can just be handled as a special case of the out-of-sync problem, with a robust heuristic capable of resolving its more common cases.

By default, when a .sub-git/'s HEAD doesn't match the top-level .git/'s HEAD's account of what should be in the module's directory, git would necessarily perform no operations on the module. It would report it as out of sync in git status, but it couldn't sensibly reflect changes into it. The user can fix that up by moving a stranded .sub-git/ to its new home or by filtering a branch from the main source tree to make a matching one in the module; a git module checkout should then suffice to get it in sync.

A module may be added or removed at some point in the history of the main source tree. The correct state for its .sub-git/, in a checkout of the main source tree to a revision without the module, is the initial empty state of a git repository when there's nothing in it, just after running git init. This state isn't actually at a commit (and I find I can't actually commit it, much less tag it), so can't be referenced by any tag or branch; but it should surely be possible for the module infrastructure to set the .sub-git/ into a state equivalent to this, to indicate that its proper state in the given commit of the source tree is absence and it is in sync as long as the main source tree's HEAD has its directory empty.

Written by Eddy.