Tony Finch (fanf) wrote,
Tony Finch

Uplift from SCCS to git

My current project is to replace Cambridge University's DNS servers. The first stage of this project is to transfer the code from SCCS to Git so that it is easier to work with.

Ironically, to do this I have ended up spending lots of time working with SCCS and RCS, rather than Git. This was mainly developing analysis and conversion tools to get things into a fit state for Git.

If you find yourself in a similar situation, you might find these tools helpful.


Cambridge was allocated three Class B networks in the 1980s: first the Computer Lab got in 1987; then the Department of Engineering got in 1988; and eventually the Computing Service got in 1989 for the University (and related institutions) as a whole.

The oldest records I have found date from September 1990, which list about 300 registrations. The next two departments to get connected were the Statistical Laboratory and Molecular Biology (I can't say in which order). The Statslab was allocated, which it has kept for 24 years!. Things pick up in 1991, when the JANET IP Service was started and rapidly took over to replace X.25. (Last month I blogged about connectivity for Astronomy in Cambridge in 1991.)

I have found these historical nuggets in our ip-register directory tree. This contains the infrastructure and history of IP address and DNS registration in Cambridge going back a quarter century. But it isn't just an archive: it is a working system which has been in production that long. Because of this, converting the directory tree to Git presents certain challenges.


The ip-register directory tree contains a mixture of:

  • Source code, mostly with SCCS history
  • Production scripts, mostly with SCCS history
  • Configuration files, mostly with SCCS history
  • The occasional executable
  • A few upstream perl libraries
  • Output files and other working files used by the production scripts
  • Secrets, such as private keys and passwords
  • Mail archives
  • Historical artifacts, such as old preserved copies of parts of the directory tree
  • Miscellaneous files without SCCS history
  • Editor backup files with ~ suffixes

My aim was to preserve this all as faithfully as I could, while converting it to Git in a way that represents the history in a useful manner.


The rough strategy was:

  1. Take a copy of the ip-register directory tree, preserving modification times. (There is no need to preserve owners because any useful ownership information was lost when the directory tree moved off the Central Unix Service before that shut down in 2008.)
  2. Convert from SCCS to RCS file-by-file. Converting between these formats is a simple one-to-one mapping.
  3. Files without SCCS history will have very short artificial RCS histories created from their modification times and editor backup files.
  4. Convert the RCS tree to CVS. This is basically just moving files around, because a CVS repository is little more than a directory tree of RCS files.
  5. Convert the CVS repository to Git using git cvsimport. This is the only phase that needs to do cross-file history analysis, and other people have already produced a satisfactory solution.

Simples! ... Not.

sccs2rcs proves inadequate

I first tried ESR's sccs2rcs Python script. Unfortunately I rapidly ran into a number of showstoppers.

  • It didn't work with Solaris SCCS, which is what was available on the ip-register server.
  • It destructively updates the SCCS tree, losing information about the relationship between the working files and the SCCS files.
  • It works on a whole directory tree, so it doesn't give you file-by-file control.

I fixed a bug or two but very soon concluded the program was entirely the wrong shape.

(In the end, the Solaris incompatibility became moot when I installed GNU CSSC on my FreeBSD workstation to do the conversion. But the other problems with sccs2rcs remained.)


So I wrote a small script called sccs2rcs1 which just converts one SCCS file to one RCS file, and gives you control over where the RCS and temporary files are placed. This meant that I would not have to shuffle RCS files around: I could just create them directly in the target CVS repository. Also, sccs2rcs1 uses RCS options to avoid the need to fiddle with checkout locks, which is a significant simplification.

The main regression compared to sccs2rcs is that sccs2rcs1 does not support branches, because I didn't have any files with branches.


At this point I needed to work out how I was going to co-ordinate the invocations of sccs2rcs1 to convert the whole tree. What was in there?!

I wrote a fairly quick-and-dirty script called sccscheck which analyses a directory tree and prints out notes on various features and anomalies. A significant proportion of the code exists to work out the relationship between working files, backup files, and SCCS files.

I could then start work on determining what fix-ups were necessary before the SCCS-to-CVS conversion.


One notable part of the ip-register directory tree was the archive subdirectory, which contained lots of gzipped SCCS files with date stamps. What relationship did they have to each other? My first guess was that they might be successive snapshots of a growing history, and that the corresponding SCCS files in the working part of the tree would contain the whole history.

I wrote sccsprefix to verify if one SCCS file is a prefix of another, i.e. that it records the same history up to a certain point.

This proved that the files were NOT snapshots! In fact, the working SCCS files had been periodically moved to the archive, and new working SCCS files started from scratch. I guess this was to cope with the files getting uncomfortably large and slow for 1990s hardware.


So to represent the history properly in Git, I needed to combine a series of SCCS files into a linear history. It turns out to be easier to construct commits with artificial metadata (usernames, dates) with RCS than with SCCS, so I wrote rcsappend to add the commits from a newer RCS file as successors of commits in an older file.

Converting the archived SCCS files was then a combination of sccs2rcs1 and rcsappend. Unfortunately this was VERY slow, because RCS takes a long time to check out old revisions. This is because an RCS file contains a verbatim copy of the latest revision and a series of diffs going back one revision at a time. The SCCS format is more clever and so takes about the same time to check out any revision.

So I changed sccs2rcs1 to incorporate an append mode, and used that to convert and combine the archived SCCS files, as you can see in the ipreg-archive-uplift script. This still takes ages to convert and linearize nearly 20,000 revisions in the history of the hosts.131.111 file - an RCS checkin rewrites the entire RCS file so they get slower as the number of revisions grows. Fortunately I don't need to run it many times.


There are a lot of files in the ip-register tree without SCCS histories, which I wanted to preserve. Many of them have old editor backup ~ files, which could be used to construct a wee bit of history (in the absence of anything better). So I wrote files2rcs to build an RCS file from this kind of miscellanea.

An aside on file name restrictions

At this point I need to moan a bit.

Why does RCS object to file names that start with a comma. Why.

I tried running these scripts on my Mac at home. It mostly worked, except for the directories which contained files like (source file) and (generated file). I added a bit of support in the scripts to cope with case-insensitive filesystems, so I can use my Macs for testing. But the bulk conversion runs very slowly, I think because it generates too much churn in the Spotlight indexes.


One significant problem is dealing with SCCS files whose working files have been deleted. In some SCCS workflows this is a normal state of affairs - see for instance the SCCS support in the POSIX Make XSI extensions. However, in the ip-register directory tree this corresponds to files that are no longer needed. Unfortunately the SCCS history generally does not record when the file was deleted. It might be possible to make a plausible guess from manual analysis, but perhaps it is more truthful to record an artificial revision saying the file was not present at the time of conversion.

Like SCCS, RCS does not have a way to represent a deleted file. CVS uses a convention on top of RCS: when a file is deleted it puts the RCS file in an "Attic" subdirectory and adds a revision with a "dead" status. The rcsdeadify applies this convention to an RCS file.


There are situations where it is possible to identify a meaningful committer and deletion time. Where a .tar.gz archive exists, it records the original file owners. The tar2usermap script records the file owners from the tar files. The contents can then be unpacked and converted as if they were part of the main directory, using the usermap file to provide the correct committer IDs. After that the files can be marked as deleted at the time the tarfile was created.


The main conversion script is sccs2cvs, which evacuates an SCCS working tree into a CVS repository, leaving behind a tree of (mostly) empty directories. It is based on a simplified version of the analysis done by sccscheck, with more careful error checking of the commands it invokes. It uses sccs2rcs1, files2rcs, and rcsappend to handle each file.

The rcsappend case occurs when there is an editor backup ~ file which is older than the oldest SCCS revision, in which case sccs2cvs uses rcsappend to combine the output of sccs2rcs1 and files2rcs. This could be done more efficiently with sccs2rcs1's append mode, but for the ip-register tree it doesn't cause a big slowdown.

To cope with the varying semantics of missing working files, sccs2rcs leaves behind a tombstone where it expected to find a working file. This takes the form of a symlink pointing to 'Attic'. Another script can then deal with these tombstones as appropriate.

pre-uplift, mid-uplift, post-uplift

Before sccs2cvs can run, the SCCS working tree should be reasonably clean. So the overall uplift process goes through several phases:

  1. Fetch and unpack copy of SCCS working tree;
  2. pre-uplift fixups;
    (These should be the minimum changes that are required before conversion to CVS, such as moving secrets out of the working tree.)
  3. sccs2cvs;
  4. mid-uplift fixups;
    (This should include any adjustments to the earlier history such as marking when files were deleted in the past.)
  5. git cvsimport or cvs-fast-export | git fast-import;
  6. post-uplift fixups;
    (This is when to delete cruft which is now preserved in the git history.)

For the ip-register directory tree, the pre-uplift phase also includes ipreg-archive-uplift which I described earlier. Then in the mid-uplift phase the combined histories are moved into the proper place in the CVS repository so that their history is recorded in the right place.

Similarly, for the tarballs, the pre-uplift phase unpacks them in place, and moves the tar files aside. Then the mid-uplift phase rcsdeadifies the tree that was inside the tarball.

I have not stuck to my guidelines very strictly: my scripts delete quite a lot of cruft in the pre-uplift phase. In particular, they delete duplicated SCCS history files from the archives, and working files which are generated by scripts.


SCCS/RCS/CVS all record committers by simple user IDs, whereas git uses names and email addresses. So git-cvsimport and cvs-fast-export can be given an authors file containing the translation. The sccscommitters script produces a list of user IDs as a starting point for an authors file.

Uplifting cvs to git

At first I tried git cvsimport, since I have successfully used it before. In this case it turned out not to be the path to swift enlightenment - it was taking about 3s per commit. This is mainly because it checks out files from oldest to newest, so it falls foul of the same performance problem that my rcsappend program did, as I described above.

So I compiled cvs-fast-export and fairly soon I had a populated repository: nearly 30,000 commits at 35 commits per second, so about 100 times faster. The fast-import/export format allows you to provide file contents in any order, independent of the order they appear in commits. The fastest way to get the contents of each revision out of an RCS file is from newest to oldest, so that is what cvs-fast-export does.

There are a couple of niggles with cvs-fast-export, so I have a patch which fixes them in a fairly dumb manner (without adding command-line switches to control the behaviour):

  • In RCS and CVS style, cvs-fast-export replaces empty commit messages with "*** empty log message ***", whereas I want it to leave them empty.
  • cvs-fast-export makes a special effort to translate CVS's ignored file behaviour into git by synthesizing a .gitignore file into every commit. This is wrong for the ip-register tree.
  • Exporting the hosts.131.111 file takes a long time, during which cvs-fast-export appears to stall. I added a really bad progress meter to indicate that work was being performed.

Wrapping up

Overall this has taken more programming than I expected, and more time, very much following the pattern that the last 10% takes the same time as the first 90%. And I think the initial investigations - before I got stuck in to the conversion work - probably took the same time again.

There is one area where the conversion could perhaps be improved: the archived dumps of various subdirectories have been converted in the location that the tar files were stored. I have not tried to incorporate them as part of the history of the directories from which the tar files were made. On the whole I think combining them, coping with renames and so on, would take too much time for too little benefit. The multiple copies of various ancient scripts are a bit weird, but it is fairly clear from the git history what was going on.

So, let us declare the job DONE, and move on to building new DNS servers!

  • Post a new comment


    default userpic

    Your reply will be screened

    Your IP address will be recorded 

    When you submit the form an invisible reCAPTCHA check will be performed.
    You must follow the Privacy Policy and Google Terms of use.
The first stage of this project is to transfer the code from SCCS to Git

Good grief. I am having horrible flashbacks to SCCS and the rather entertaining bug in SCCS that resulted in SCCS check-ins over NFS randomly duplicating lines ...

(That was on SCO's SVR3.2 v4.2, so hopefully Solaris is unaffected. Nevertheless.)

Yes, SCCS is complete shit at handling files that are checked in regularly over long periods of time -- RCS's reverse-delta model is vastly more efficient. The comma restriction in RCS filenames: this was due to RCS appending comma-suffix to a filename as the name of it's archive, and the parser they used for filenames back in the day (late 1980s, I think) being badly designed (i.e. started looking for the comma-suffix by scanning the string from the start, rather than scanning in reverse from the end of the filename).
So many times when I mention SCCS people say, "SCCS?! Really?!" :-)

Yes, just operating on the head of RCS files is quite quick. Going further back it rapidly gets worse than SCCS. Based on the performance I saw when I was running my import scripts, I can see why CVS branches caused so much pain - I think it had to delta back to the branch point then forward to the branch head, which would be really slow. Fortunately I did not have to deal with branches!
Oh, cool. That sounds to have been a bit of a tar pit but I'm glad you managed to pull the useful stuff out of it. It's lovely that a 25-year-old history CAN be preserved usefully in production like that!

I was recently impressed with how far cvs-fast-export looked to have come. Although I almost needed the opposite -- at work I would like to drag our source code out of cvs, but (i) I think we can live with it if the history isn't preserved perfectly, as long as the current state is, as there wasn't much interesting history other than sequential commits, and (ii) there's no will to move everything at once, so we'd have to use git in parallel, repeatedly importing from cvs as the canonical version until git was shown to be better, and I'm not sure if cvs-fast-export can do that yet. Or even if it's worth it if we can't use a lot of git features until we switch over permanently (I think there would be benefits, but not that many if people aren't eager to work in a git-like fashion).
cvs-fast-export has a -i option for incremental updates, which might do the trick for you.

There is the option of using git for preparing changes which eventually get committed to CVS, which is a good way of getting people using it without too much up-front effort. And it can be optional, so the people who don't want to change can stick with what they like. Hopefully their colleagues will eventually convince them to make the leap :-)

Also, git-cvsserver exists. I have not tried it!
Huh, I was sure incremental dump wasn't there when I looked, but I must have just missed it, it seems to have been there since last year (I read a tutorial more than the original man page). Yes, that's obviously perfect, thank you for remembering it!

There is the option of using git for preparing changes which eventually get committed to CVS

Yeah, I already have some simple scripts which manage MY changes in .git, and let me check them in to CVS instead of pushing upstream (they're a little fragile, but let me track MY history). But one of the things I'd like is to have git's better history viewing stuff work on recent CVS commits, as that seems like something that would be valuable (and right now my script just flattens a day's changes, not try to reconstruct the individual commits).
On your Mac, could you not have created a "Mac OS Extended (Case-Sensitive, Journaled)" partition and worked on that partition, to get around your issue with the case insensitivity of the default Mac filesystem?
Probably, but a git push from the Mac and a pull on my workstation to run it there was easier :-) I didn't do much of the work on a Mac, only a couple of evenings. (Mostly writing the above, in fact!)
Hello! Did you solve that?
No, I didn't bother - I don't do enough work with case-sensitive files on a Mac to make it worth setting up a case-sensitive partition.
Possibly a silly question, but did you actually need the whole commit history?

Surely the latest version is sufficient?

(And an archive of the rest in sccs so that you can get to it if someone needs something from the dim and distant past)
That's a fair question :-)

I think the main use I will have for the history is understanding how the database code works and evolved, since that is where my understanding is weakest. That part of it goes back to about 2001. I really don't want to gave to use SCCS to work with that code! I anticipate that being able to see how changes were made in the past will give me a template for future changes.

The most voluminous part of the history is before then, the pre-database record of registrations. To be honest, that part of it is a bit of a liability: it is full of personal data so it means I can't publish the repository as it is. But now that I have everything in git, it will be much easier to carve out the relevant parts which can be published.

And having it all in git means I can keep it all in one place rather than having to worry about preserving old archives somewhere else. There is a backup/archive on our staff timesharing server which will remain in some form, but my aim is to be able to forget about it :-)
> The Statslab was allocated, which it has kept for 24 years!

Does the history show that the third quarter of that was stolen (for IIRC) for a number of years - perhaps 2003-9 ?

The existing maths DNS history seems to start on 1997/10/15.
IIRC DAMTP would have moved its DNS to a SPARCstation5 about then and I subcontracted the DPMMS DNS to the DAMTP system.
Something as recent as 2003-9 will be recorded in the IP-register database, though some of the network-level allocations were still recorded in SCCS after the database was set up, including the e-science centre:

commit 41cdddb9f754c1dd5ec27737c3a46a6c943d7622
Author: Joe Gluza <>
Date:   2002-10-15 18:50:23 +0000

    131.111.20.[128-191] moved from Stats Lab to eScience Centre

diff --git a/adm/hosts.131.111 b/adm/hosts.131.111
index 2a7889f..ce43e6f 100644
--- a/adm/hosts.131.111
+++ b/adm/hosts.131.111
@@ -898,8 +898,8 @@ The canonical nameservers for the DNS zone are at

-131.111.20.x   Statistical Laboratory
+131.111.20.[0-127]    Statistical Laboratory (1)

   contact: CO - Dr Andrew C Aitchison A.C.Aitchison@pmms
                Eva Myers E.R.Myers@statslab
@@ -923,6 +923,20 @@ machines if they are performing satisfactorily.)

+131.111.20.[128-191]   eScience Centre
+contact: Bruce Beckles,
+131.111.20.[192-255]   Statistical Laboratory (2)
+see 131.111.20.[0-127] for contacts
 131.111.21.[0-127]    Centre for Applied Research in Educational Technologies
commit cb835e9f420ff7117a7b5e40b6a52b283b79711c
Author: Tony Stoneley <>
Date:   1990-09-17 10:22:35 +0000

diff --git a/old-group-Internet/MAC.addresses/Other b/old-group-Internet/MAC.addresses/Other
new file mode 100644
index 0000000..f346bf5
--- /dev/null
+++ b/old-group-Internet/MAC.addresses/Other
@@ -0,0 +1,63 @@
+Mill Lane Hosts
+Type            Hostname        Internet address        Hardware address
+----            --------        ----------------        ----------------
+               statslab.gateway             08-00-09-02-19-6B
+               owl                08-00-20-00-A9-20
+               owlet 
+               sharvy             08-00-20-06-16-D5
+               canary             08-00-20-00-68-4A
+               twilight             08-00-20-06-4D-12
+               atm                08-00-20-01-FA-C8
+               space              08-00-20-00-9C-CD
+               vortex             08-00-20-00-61-94
+               hope               08-00-20-00-82-7A
+               star               08-00-20-00-98-34
+               magnet             08-00-20-00-83-32
+               stokes             08-00-20-00-65-60
+               chaos              08-00-20-06-08-37
+               john  
+               camnum            08-00-20-01-DD-3E
+               jane              08-00-20-00-98-61
+               oak               08-00-20-00-79-2B
+               bio               08-00-20-00-5C-FA
+               mhd               08-00-20-00-7E-00
+               pub               08-00-20-00-7B-3E
+               dummy              fictional
+               gfd               08-00-20-00-95-92
+               chas              08-00-20-00-4D-6F
+               joe               08-00-20-01-C2-64
+               alan              08-00-20-00-36-18
+               robo              08-00-2B-0C-C4-97
+               server1            08-00-09-03-6A-ED
+               swan               fictional
+               tom                fictional
+               cs1               08-00-20-06-39-AF
+               isis              08-00-20-00-21-69
+               lynx              }
+               jaguar            }
+               cheetah            }
+               panther            }
+               leopard            } Gatewayed via
+               cougar            } statslab.gateway
+               lion             }
+               tiger            }
+               puma             }
+               lynx.teach           }
+MBCRF Hosts
+Type            Hostname        Internet address        Hardware address
+----            --------        ----------------        ----------------
+               mbua  
+               mbub              08-00-20-00-63-69
+               mbuc              08-00-20-06-1A-CF
+               mbud              08-00-20-06-1A-C1
+               mbue              08-00-20-06-1B-11
+               mbuf              08-00-20-06-19-E1
+               mbug  
+               mb-fs1           08-00-20-06-EB-87 is allocated to HP. I have no idea what it is doing here!