Tuesday, August 07, 2012

Modular git with "git subtree"

One thing I always disliked about the way we organized our Horde repository was the fact that we have all library modules and applications lumped together in a single git repository. Of course there are some good reasons for that type of monolithic repo. But for someone just interested in our (really powerful) IMAP library this is a drawback: The library is hidden somewhere between the other libraries and if you want to work on it you will nevertheless have to clone the whole repository. And there are other situations in which small, module specific repositories would make sense. So far I wasn't aware of a solution that would allow for a reasonable compromise.

Originally I only knew that git submodule would allow including additional git repositories into a master repository. This approach has some drawbacks though. We could construct the current horde repository out of a bunch of submodules. But the work flow within this master repository would be significantly more cumbersome as git submodule interferes with the default way of working with git.

git subtree to the rescue!

git subtree however seems to allow for the perfect solution: Separate subrepositories can co-exist with the monolithic master repository. And any commits to either of them can be exchanged between them. The stream of commits to the monolithic master can even be transmitted automatically to the splitted repositories. None of these steps seem to introduce any additional overhead to any of these repositories.

Installing "git subtree"

The subtree command has been added to the git-1.7.11 release. But as many distributions do not yet offer this variant you can install the tool in a more hackish way if desired:

cd /usr/lib/git-core/
wget https://raw.github.com/apenwarr/git-subtree/master/git-subtree.sh
mv git-subtree.sh git-subtree
chmod 755 git-subtree

Replacing a "submodule" with a "subtree"

A while ago I pulled the Jenkins installation procedures into our horde-support repository using git submodule. In order to give git subtree a first test run I replaced the Jenkins submodule by the subtree approach. The first step had to be the removal of the old submodule:

git rm .gitmodules ci/jenkins
git commit -m "Remove the jenkins installation procedures as a submodule. Prepares for replacement by 'git subtree'"

Now I imported the repository previously registered via git submodule using git subtree:

git subtree add --prefix=ci/jenkins --squash https://github.com/wrobel/jenkins-install.git master

This pulled the external repository into the current horde-support repository at prefix "ci/jenkins" and squashed all commits of the imported repository into a single commit. The imported code is now an equivalent citizen to the rest of the code in the repository - none of the standard git work flows are affected in any way.

Of course the interesting question is whether updates to this imported code can be merged back into the original repository. I commited a small change within the imported code:

git commit -m "Update to jenkins-1.475" ci/jenkins/jenkins.mk

This change can indeed now be splitted into the subtree again and exported to the original archive:

git subtree split --prefix=ci/jenkins --annotate="(horde-support) " d73edc4878c8.. --branch ci-jenkins

What happens here is that git subtree splits the path specified with the prefix option into a separate branch named ci-jenkins. It will prefix any commit transported into this branch with (horde-support) to indicate the origin of the commit. Usually the branch range given here (d73edc4878c8..) is unnecessary for the operation. But the code within ci/jenkins had been included as submodule before commit d73edc4878c8. This part of the history should not be imported into the splitted branch.

After the splitting operation created the new ci-jenkins branch in my repository it should be equivalent to our original, imported repository. Thus I was able to push back to it:

git push git@github.com:wrobel/jenkins-install.git ci-jenkins:master

Using "git subtree" for the horde repository

Can the subtree approach be used to have both a monolithic horde repository as well as the small modular repositories at the same time? This would be the best of both worlds: While we develop in the monolithic horde repositories we allow other developers to also watch and patch single modules. If the commits from the monolithic repo can be transferred to the modular repos on a regular basis while we can also import patches the other way around without blowing up any of the associated git repos: I'd be really happy.

I admit that I didn't test the subtree approach large scale yet - but everything I have tested so far indicates that the situation detailed above can indeed be achieved and automated.

In order to automate the splitting of the monolithic repository into different modules I would use an intermediate git repository that handles the splitting within a post-receive hook. Any pushing to a branch of this repository would then update the same branches in the various splitted repositories. The core of the splitting procedure in the post-receive hook I established looks like this so far:

git config --bool core.bare false
git checkout $short_refname
git reset --hard HEAD
if [ -z "`git branch | grep subtrees/$short_refname`" ]; then
    git branch subtrees/$short_refname
fi
git checkout subtrees/$short_refname
git merge $short_refname
git subtree split --prefix=$subtree --annotate="(horde) " --branch subtrees/$module/$short_refname --rejoin
git push git@github.com:horde/$module.git subtrees/$module/$short_refname:$short_refname
git config --bool core.bare true

The procedure runs in a loop that walks through the different modules of the horde repository. $module refers to the current module, $subtree to the path corresponding to this module, and $short_refname indicates the branch that was pushed to.

I'll walk you through the different steps...

The remote repository needs to be in a state where you can push updates to a branch to it: it needs to be "bare". Updates to the branch currently checked out in the remote repository would otherwise be impossible. git subtree however requires us to work on a real checkout. So before initiating the splitting process the repository is marked as non-"bare":

git config --bool core.bare false

And subsequently the branch that was pushed to is being checked out and resetted to HEAD - this is the basis for git subtree to work its magic.

git checkout $short_refname
git reset --hard HEAD

The splitting procedure benefits from using a separate branch that remembers the previous splitting operations using merge commits. It would also work without such a branch but the subtree operation would always have to walk through each and every commit of the repository again - a waste of time. Just in case this subtree specific branch does not exist it will be created with the next step.

if [ -z "`git branch | grep subtrees/$short_refname`" ]; then
    git branch subtrees/$short_refname
fi

All updates that were just pushed into the remote repository are now being merged into the branch specifically created for the subtree operation. Here the original line of development and the subtree marker merges (which do not affect the code itself) live together in one branch.

git checkout subtrees/$short_refname
git merge $short_refname

This prepared the stage for the splitting operation which can now analyze the incoming commits for any changes to the module currently handled.

git subtree split --prefix=$subtree --annotate="(horde) " --branch subtrees/$module/$short_refname --rejoin

If there were changes that affected the current module they will be pushed to the corresponding git repository on github using the following command:

git push git@github.com:horde/$module.git subtrees/$module/$short_refname:$short_refname

And finally the repository will be marked as bare again to prepare it for the next commit:

git config --bool core.bare true

Of course of all this is still somewhat untested. It still has to be shown to work large scale - with about one hundred different Horde modules at the same time. But at least it looks very promising.

7 comments:

  1. Did you end up having problems with this? I see that https://github.com/horde/ has only one monolithic repo :)

    ReplyDelete
  2. The main problem was the lack of time to further pursue it. One blocker was the amount of time it took. I think it should be possible to make the approach more efficient but so far I just wasn't able to return to the issue yet. Still on my ToDo list though, ... sigh

    ReplyDelete
  3. That's exactly what I'm trying to do since a long time with my PHP library. Thanks for giving such a detailed example on git-subtree. Unfortunately even git-subtree seems not to solve my biggest problem with this. Imagine that I have a library design like this (pseudo example):

    e.g: package Controller

    lib/php/Controller.php
    lib/php/Controller/Exception.php

    e.g: package View

    lib/php/View.php
    lib/php/View/Exception.php

    Files from both sub packages should be merged (potentially) into the same lib folder of the 'master' repo.

    I've - exited by your post - immediately tried a simple example but I got:

    'prefix 'lib/php' already exists.'

    Just a short question, do you thing it is possible what I'm trying to do?

    ReplyDelete
  4. Sorry for the rather late reply...

    I don't think it is possible to achieve what you want to do using git directly. But with Horde the installed file hierarchy looks similar to what you want. We use PEAR in between to install the file like that though.

    You could also give symlinks another thought.

    ReplyDelete
  5. Hello Gunnar,

    thanks for your reply. Since I've posted my question I've had the same insight. I haven't looked at horde yet but I plan to do. Currently I'm with PEAR like you. Also composer looks interesting for that purpose.

    Greets,
    Thorsten

    ReplyDelete
  6. Thanks a lot for the detailed procedure for using git subtree. I was able to split up code (on my relatively small codebase for now) and create subtrees within the main repo.
    I was wondering if you also looked into automatic pull from the subtree repos when you were pulling on the main repo by creating some sort of a hook. I plan to try something like that and was wondering if there was any experience/knowledge/information that you had come across for me to get any ideas.
    Thanks!

    ReplyDelete
  7. Great article, very encouraging, I'm about to try following it.

    It has been a year since you made the transition, is this still working for you? Or is there a different approach that you are using now?

    Thanks!

    Doug

    ReplyDelete