?

Log in

No account? Create an account

fanf

Some thoughts about git

« previous entry | next entry »
23rd Apr 2009 | 20:55

I was originally planning to witter about distributed version control vs. centralized version control, especially the oft-neglected problem of breaking up a large cvs / svn / p4 repository. This was partly triggered by Linus's talk about git at Google in which he didn't really address a couple of questions about how to migrate a corporate source repository to distributed version control. But in the end I don't think I have any point other than the fairly well-known one that distributed version control systems work best when your systems are split into reasonably modestly-sized and self-contained modules, one per repository. Most systems are modular, even if all the modules are in one huge central repository, but the build and system integration parts can often get tightly coupled to the repository layout making it much harder to decentralize.

Instead I'm going to wave my hands a bit about the ways in which git has unusual approaches to distributed version control, and how bzr in particular seems to take diametrically opposing attitudes. I'm not saying one is objectively better than the other, because most of these issues are fairly philosophical and for practical purposes they are dominated by things like quality of implementation and documentation and support.

Bottom-up

Git's design is very bottom-up. Linus started by designing a repository structure that he thought would support his goals of performance, semantics, and features, and worked upwards from there. The upper levels, especially the user interface, were thought to be of secondary importance and something that could be worked on and improved further down the line. As a result it has a reputation for being very unfriendly to use, but that problem is pretty much gone now.

Other VCSs take a similar approach, for example hg is based on its revlog data structure, and darcs has its patch algebra. However bzr seems to be designed from the top down, starting with a user interface and a set of supported workflows, and viewing its repository format and performance characteristics as of secondary importance and something that can be improved further down the line. As a result it has a reputation for being very slow.

Amortization

Most VCSs have a fairly intricate repository format, and every operation that writes to the repository eagerly keeps it in the canonical efficient form. Git is unusual because its write operations add data to the repository in an unpacked form which makes writing cheaper but makes reading from the repository gradually less and less efficient - until you repack the repo in a separate heavy-weight operation to make reads faster again. (Git will do this automatically for you every so often.) The advantage of this is that the packed repository format isn't constrained by any need for incremental updates, so it can optimise for read performance at the expense of greater pack write complexity because this won't slow down common write operations. Bzr being the opposite of git seems to do a lot more up-front work when writing to its repository than other VCSs, e.g. to make annotation faster.

Thus git has two parallel repository formats, loose and packed. Other VCSs may have multiple repository formats, but only one at a time, and new formats are introduced to satisfy feature or performance requirements. Repository format changes are a pain and happily git's stabilized very early on - unlike bzr's.

Laziness

As well as being slack about how it writes to its repository, git is also slack about what it writes. There has been an inclination in recent VCSs towards richer kinds of changeset, with support for file copies and renames or even things like token renames in darcs. The bzr developers think this is vital. Git, on the other hand, doesn't bother storing that kind of information at all, and instead lazily calculates it when necessary. There are some good reasons for this, in particular that developers will often not bother to be explicit about rich change information, or the information might be lost when transmitting a patch, or the change might have come from a different VCS that doesn't encode the information. This implies that even VCSs that can represent renames still need to be able to infer them in some situations.

Git's data structure helps to make this efficient: it identifies files and directories by a hash of their contents, so if the hash is the same it doesn't need to look any closer to find differences because there aren't any - and this implies a copy or rename. This means that you should not rename or copy a file and modify it in the same commit, because that makes git's rename inference harder. Similarly if you rename a directory, don't modify any of its contents (including renames and permissions changes) in the same commit.

Mercurial also uses hashes to identify things, but they aren't pure content hashes: they include historical information, so they can't be used to identify files with the same contents but different histories. Thus efficiency forces hg to represent copies explicitly.

Any more?

I should say that I know very little about bzr, and nothing about tla, mtn, or bk, so if any of the above is off the mark or over-states git's weirdness, then please correct me in a comment!

| Leave a comment | Share

Comments {6}

Renames

from: anonymous
date: 24th Apr 2009 04:24 (UTC)

"This means that you should not rename or copy a file and modify it in the same commit, because that makes git's rename inference harder. Similarly if you rename a directory, don't modify any of its contents (including renames and permissions changes) in the same commit."

That information is misleading or plain wrong. With the default settings Git detects renames and copies with simultaneous content changes if 50% of the content remains untouched. Git's rename/copy detection is not about matching SHA1 hashes, Git calculates "similarity index" instead and uses the 50% threshold (by default) to judge if it's the same file. The threshold can be configured, though.

Misleading statements like this probably originate from small "toy examples" where a DVCS-tester commits example files with only a couple of bytes of content. Then the files' similarity can't be detected realiably as only a one byte change may cause the jump over the default similarity threshold. In the real world with actual content the detection works nicely and user does not have to worry about committing changes and renames/copies within the same "git commit".

Reply | Thread

Tony Finch

Re: Renames

from: fanf
date: 24th Apr 2009 11:54 (UTC)

Thanks for the correction. I think that for efficiency and paranoia I'd probably still prefer to avoid it :-)

Reply | Parent | Thread

Simon Tatham

from: simont
date: 24th Apr 2009 08:38 (UTC)

This means that you should not rename or copy a file and modify it in the same commit, because that makes git's rename inference harder. Similarly if you rename a directory, don't modify any of its contents (including renames and permissions changes) in the same commit.

This seems particularly perverse given the fact that renames often want to be accompanied by changes in the renamed files. For instance, move source files about, update all the internal pathnames by which they refer to each other. Rename a program, change the name in its help and error messages. And so on.

Of course you could do those in a pair of consecutive commits, but isn't it obviously more useful to be able to come back later and find a single commit in which The Great Rename took place atomically? It makes version-control archaeology easier, since you don't have to augment your mental model of the actual software development with a mental model of things that had to be done a certain way due to VCS limitations; and it also makes it much easier to extract a diff or changeset description to send to someone else.

Reply | Thread

from: mooism
date: 24th Apr 2009 11:13 (UTC)

(Disclaimer: I haven't used git, merely absorbed lots of propaganda from its users :-)

Git can detect that new.file is mostly the same as an old.file that doesn't exist any more, and deduce that old.file has been renamed and modified. Obviously this is not as easy for git as detecting that old.file and new.file are identical, but I have never previously come across the contention that it makes git's rename inference harder to the point of it being slow or inconvenient to use.

fanf, how much of a slowdown have you noticed when renaming and modifying a file in the same git commit?

Reply | Parent | Thread

ewx

from: ewx
date: 24th Apr 2009 10:11 (UTC)

Linus's point about tracking more detailed rearrangements than just file renames is a good one, but if you can't safely rename a file and change its contents at the same time it doesn't sound like Git's actually doing what he says, but something much stupider and less useful. I'll have explicit renames instead, thanks l-)

Reply | Thread

from: anonymous
date: 24th Apr 2009 10:39 (UTC)

Please stop this FUD about git rename detection. You can rename and change the content of a file in git. With default settings, you can change about half of the content and git still detects it as a rename. You can also ask for other similarity thresholds used with rename detection.

Reply | Thread