?

Log in

No account? Create an account

fanf

Uplift from SCCS to git

« previous entry | next entry »
27th Nov 2014 | 15:52

My current project is to replace Cambridge University's DNS servers. The first stage of this project is to transfer the code from SCCS to Git so that it is easier to work with.

Ironically, to do this I have ended up spending lots of time working with SCCS and RCS, rather than Git. This was mainly developing analysis and conversion tools to get things into a fit state for Git.

If you find yourself in a similar situation, you might find these tools helpful.

Background

Cambridge was allocated three Class B networks in the 1980s: first the Computer Lab got 128.232.0.0/16 in 1987; then the Department of Engineering got 129.169.0.0/16 in 1988; and eventually the Computing Service got 131.111.0.0/16 in 1989 for the University (and related institutions) as a whole.

The oldest records I have found date from September 1990, which list about 300 registrations. The next two departments to get connected were the Statistical Laboratory and Molecular Biology (I can't say in which order). The Statslab was allocated 131.111.20.0/24, which it has kept for 24 years!. Things pick up in 1991, when the JANET IP Service was started and rapidly took over to replace X.25. (Last month I blogged about connectivity for Astronomy in Cambridge in 1991.)

I have found these historical nuggets in our ip-register directory tree. This contains the infrastructure and history of IP address and DNS registration in Cambridge going back a quarter century. But it isn't just an archive: it is a working system which has been in production that long. Because of this, converting the directory tree to Git presents certain challenges.

Developmestuction

The ip-register directory tree contains a mixture of:

  • Source code, mostly with SCCS history
  • Production scripts, mostly with SCCS history
  • Configuration files, mostly with SCCS history
  • The occasional executable
  • A few upstream perl libraries
  • Output files and other working files used by the production scripts
  • Secrets, such as private keys and passwords
  • Mail archives
  • Historical artifacts, such as old preserved copies of parts of the directory tree
  • Miscellaneous files without SCCS history
  • Editor backup files with ~ suffixes

My aim was to preserve this all as faithfully as I could, while converting it to Git in a way that represents the history in a useful manner.

PLN

The rough strategy was:

  1. Take a copy of the ip-register directory tree, preserving modification times. (There is no need to preserve owners because any useful ownership information was lost when the directory tree moved off the Central Unix Service before that shut down in 2008.)
  2. Convert from SCCS to RCS file-by-file. Converting between these formats is a simple one-to-one mapping.
  3. Files without SCCS history will have very short artificial RCS histories created from their modification times and editor backup files.
  4. Convert the RCS tree to CVS. This is basically just moving files around, because a CVS repository is little more than a directory tree of RCS files.
  5. Convert the CVS repository to Git using git cvsimport. This is the only phase that needs to do cross-file history analysis, and other people have already produced a satisfactory solution.

Simples! ... Not.

sccs2rcs proves inadequate

I first tried ESR's sccs2rcs Python script. Unfortunately I rapidly ran into a number of showstoppers.

  • It didn't work with Solaris SCCS, which is what was available on the ip-register server.
  • It destructively updates the SCCS tree, losing information about the relationship between the working files and the SCCS files.
  • It works on a whole directory tree, so it doesn't give you file-by-file control.

I fixed a bug or two but very soon concluded the program was entirely the wrong shape.

(In the end, the Solaris incompatibility became moot when I installed GNU CSSC on my FreeBSD workstation to do the conversion. But the other problems with sccs2rcs remained.)

sccs2rcs1

So I wrote a small script called sccs2rcs1 which just converts one SCCS file to one RCS file, and gives you control over where the RCS and temporary files are placed. This meant that I would not have to shuffle RCS files around: I could just create them directly in the target CVS repository. Also, sccs2rcs1 uses RCS options to avoid the need to fiddle with checkout locks, which is a significant simplification.

The main regression compared to sccs2rcs is that sccs2rcs1 does not support branches, because I didn't have any files with branches.

sccscheck

At this point I needed to work out how I was going to co-ordinate the invocations of sccs2rcs1 to convert the whole tree. What was in there?!

I wrote a fairly quick-and-dirty script called sccscheck which analyses a directory tree and prints out notes on various features and anomalies. A significant proportion of the code exists to work out the relationship between working files, backup files, and SCCS files.

I could then start work on determining what fix-ups were necessary before the SCCS-to-CVS conversion.

sccsprefix

One notable part of the ip-register directory tree was the archive subdirectory, which contained lots of gzipped SCCS files with date stamps. What relationship did they have to each other? My first guess was that they might be successive snapshots of a growing history, and that the corresponding SCCS files in the working part of the tree would contain the whole history.

I wrote sccsprefix to verify if one SCCS file is a prefix of another, i.e. that it records the same history up to a certain point.

This proved that the files were NOT snapshots! In fact, the working SCCS files had been periodically moved to the archive, and new working SCCS files started from scratch. I guess this was to cope with the files getting uncomfortably large and slow for 1990s hardware.

rcsappend

So to represent the history properly in Git, I needed to combine a series of SCCS files into a linear history. It turns out to be easier to construct commits with artificial metadata (usernames, dates) with RCS than with SCCS, so I wrote rcsappend to add the commits from a newer RCS file as successors of commits in an older file.

Converting the archived SCCS files was then a combination of sccs2rcs1 and rcsappend. Unfortunately this was VERY slow, because RCS takes a long time to check out old revisions. This is because an RCS file contains a verbatim copy of the latest revision and a series of diffs going back one revision at a time. The SCCS format is more clever and so takes about the same time to check out any revision.

So I changed sccs2rcs1 to incorporate an append mode, and used that to convert and combine the archived SCCS files, as you can see in the ipreg-archive-uplift script. This still takes ages to convert and linearize nearly 20,000 revisions in the history of the hosts.131.111 file - an RCS checkin rewrites the entire RCS file so they get slower as the number of revisions grows. Fortunately I don't need to run it many times.

files2rcs

There are a lot of files in the ip-register tree without SCCS histories, which I wanted to preserve. Many of them have old editor backup ~ files, which could be used to construct a wee bit of history (in the absence of anything better). So I wrote files2rcs to build an RCS file from this kind of miscellanea.

An aside on file name restrictions

At this point I need to moan a bit.

Why does RCS object to file names that start with a comma. Why.

I tried running these scripts on my Mac at home. It mostly worked, except for the directories which contained files like DB.cam (source file) and db.cam (generated file). I added a bit of support in the scripts to cope with case-insensitive filesystems, so I can use my Macs for testing. But the bulk conversion runs very slowly, I think because it generates too much churn in the Spotlight indexes.

rcsdeadify

One significant problem is dealing with SCCS files whose working files have been deleted. In some SCCS workflows this is a normal state of affairs - see for instance the SCCS support in the POSIX Make XSI extensions. However, in the ip-register directory tree this corresponds to files that are no longer needed. Unfortunately the SCCS history generally does not record when the file was deleted. It might be possible to make a plausible guess from manual analysis, but perhaps it is more truthful to record an artificial revision saying the file was not present at the time of conversion.

Like SCCS, RCS does not have a way to represent a deleted file. CVS uses a convention on top of RCS: when a file is deleted it puts the RCS file in an "Attic" subdirectory and adds a revision with a "dead" status. The rcsdeadify applies this convention to an RCS file.

tar2usermap

There are situations where it is possible to identify a meaningful committer and deletion time. Where a .tar.gz archive exists, it records the original file owners. The tar2usermap script records the file owners from the tar files. The contents can then be unpacked and converted as if they were part of the main directory, using the usermap file to provide the correct committer IDs. After that the files can be marked as deleted at the time the tarfile was created.

sccs2cvs

The main conversion script is sccs2cvs, which evacuates an SCCS working tree into a CVS repository, leaving behind a tree of (mostly) empty directories. It is based on a simplified version of the analysis done by sccscheck, with more careful error checking of the commands it invokes. It uses sccs2rcs1, files2rcs, and rcsappend to handle each file.

The rcsappend case occurs when there is an editor backup ~ file which is older than the oldest SCCS revision, in which case sccs2cvs uses rcsappend to combine the output of sccs2rcs1 and files2rcs. This could be done more efficiently with sccs2rcs1's append mode, but for the ip-register tree it doesn't cause a big slowdown.

To cope with the varying semantics of missing working files, sccs2rcs leaves behind a tombstone where it expected to find a working file. This takes the form of a symlink pointing to 'Attic'. Another script can then deal with these tombstones as appropriate.

pre-uplift, mid-uplift, post-uplift

Before sccs2cvs can run, the SCCS working tree should be reasonably clean. So the overall uplift process goes through several phases:

  1. Fetch and unpack copy of SCCS working tree;
  2. pre-uplift fixups;
    (These should be the minimum changes that are required before conversion to CVS, such as moving secrets out of the working tree.)
  3. sccs2cvs;
  4. mid-uplift fixups;
    (This should include any adjustments to the earlier history such as marking when files were deleted in the past.)
  5. git cvsimport or cvs-fast-export | git fast-import;
  6. post-uplift fixups;
    (This is when to delete cruft which is now preserved in the git history.)

For the ip-register directory tree, the pre-uplift phase also includes ipreg-archive-uplift which I described earlier. Then in the mid-uplift phase the combined histories are moved into the proper place in the CVS repository so that their history is recorded in the right place.

Similarly, for the tarballs, the pre-uplift phase unpacks them in place, and moves the tar files aside. Then the mid-uplift phase rcsdeadifies the tree that was inside the tarball.

I have not stuck to my guidelines very strictly: my scripts delete quite a lot of cruft in the pre-uplift phase. In particular, they delete duplicated SCCS history files from the archives, and working files which are generated by scripts.

sccscommitters

SCCS/RCS/CVS all record committers by simple user IDs, whereas git uses names and email addresses. So git-cvsimport and cvs-fast-export can be given an authors file containing the translation. The sccscommitters script produces a list of user IDs as a starting point for an authors file.

Uplifting cvs to git

At first I tried git cvsimport, since I have successfully used it before. In this case it turned out not to be the path to swift enlightenment - it was taking about 3s per commit. This is mainly because it checks out files from oldest to newest, so it falls foul of the same performance problem that my rcsappend program did, as I described above.

So I compiled cvs-fast-export and fairly soon I had a populated repository: nearly 30,000 commits at 35 commits per second, so about 100 times faster. The fast-import/export format allows you to provide file contents in any order, independent of the order they appear in commits. The fastest way to get the contents of each revision out of an RCS file is from newest to oldest, so that is what cvs-fast-export does.

There are a couple of niggles with cvs-fast-export, so I have a patch which fixes them in a fairly dumb manner (without adding command-line switches to control the behaviour):

  • In RCS and CVS style, cvs-fast-export replaces empty commit messages with "*** empty log message ***", whereas I want it to leave them empty.
  • cvs-fast-export makes a special effort to translate CVS's ignored file behaviour into git by synthesizing a .gitignore file into every commit. This is wrong for the ip-register tree.
  • Exporting the hosts.131.111 file takes a long time, during which cvs-fast-export appears to stall. I added a really bad progress meter to indicate that work was being performed.

Wrapping up

Overall this has taken more programming than I expected, and more time, very much following the pattern that the last 10% takes the same time as the first 90%. And I think the initial investigations - before I got stuck in to the conversion work - probably took the same time again.

There is one area where the conversion could perhaps be improved: the archived dumps of various subdirectories have been converted in the location that the tar files were stored. I have not tried to incorporate them as part of the history of the directories from which the tar files were made. On the whole I think combining them, coping with renames and so on, would take too much time for too little benefit. The multiple copies of various ancient scripts are a bit weird, but it is fairly clear from the git history what was going on.

So, let us declare the job DONE, and move on to building new DNS servers!

| Leave a comment | Share

Comments {15}

Autopope

from: autopope
date: 27th Nov 2014 16:53 (UTC)

The first stage of this project is to transfer the code from SCCS to Git

Good grief. I am having horrible flashbacks to SCCS and the rather entertaining bug in SCCS that resulted in SCCS check-ins over NFS randomly duplicating lines ...

(That was on SCO's SVR3.2 v4.2, so hopefully Solaris is unaffected. Nevertheless.)

Yes, SCCS is complete shit at handling files that are checked in regularly over long periods of time -- RCS's reverse-delta model is vastly more efficient. The comma restriction in RCS filenames: this was due to RCS appending comma-suffix to a filename as the name of it's archive, and the parser they used for filenames back in the day (late 1980s, I think) being badly designed (i.e. started looking for the comma-suffix by scanning the string from the start, rather than scanning in reverse from the end of the filename).

Reply | Thread

Tony Finch

from: fanf
date: 27th Nov 2014 17:38 (UTC)

So many times when I mention SCCS people say, "SCCS?! Really?!" :-)

Yes, just operating on the head of RCS files is quite quick. Going further back it rapidly gets worse than SCCS. Based on the performance I saw when I was running my import scripts, I can see why CVS branches caused so much pain - I think it had to delta back to the branch point then forward to the branch head, which would be really slow. Fortunately I did not have to deal with branches!

Reply | Parent | Thread

cartesiandaemon

from: cartesiandaemon
date: 27th Nov 2014 17:05 (UTC)

Oh, cool. That sounds to have been a bit of a tar pit but I'm glad you managed to pull the useful stuff out of it. It's lovely that a 25-year-old history CAN be preserved usefully in production like that!

I was recently impressed with how far cvs-fast-export looked to have come. Although I almost needed the opposite -- at work I would like to drag our source code out of cvs, but (i) I think we can live with it if the history isn't preserved perfectly, as long as the current state is, as there wasn't much interesting history other than sequential commits, and (ii) there's no will to move everything at once, so we'd have to use git in parallel, repeatedly importing from cvs as the canonical version until git was shown to be better, and I'm not sure if cvs-fast-export can do that yet. Or even if it's worth it if we can't use a lot of git features until we switch over permanently (I think there would be benefits, but not that many if people aren't eager to work in a git-like fashion).

Reply | Thread

Tony Finch

from: fanf
date: 27th Nov 2014 17:42 (UTC)

cvs-fast-export has a -i option for incremental updates, which might do the trick for you.

There is the option of using git for preparing changes which eventually get committed to CVS, which is a good way of getting people using it without too much up-front effort. And it can be optional, so the people who don't want to change can stick with what they like. Hopefully their colleagues will eventually convince them to make the leap :-)

Also, git-cvsserver exists. I have not tried it!

Reply | Parent | Thread

cartesiandaemon

from: cartesiandaemon
date: 27th Nov 2014 18:19 (UTC)

Huh, I was sure incremental dump wasn't there when I looked, but I must have just missed it, it seems to have been there since last year (I read a tutorial more than the original man page). Yes, that's obviously perfect, thank you for remembering it!

There is the option of using git for preparing changes which eventually get committed to CVS

Yeah, I already have some simple scripts which manage MY changes in .git, and let me check them in to CVS instead of pushing upstream (they're a little fragile, but let me track MY history). But one of the things I'd like is to have git's better history viewing stuff work on recent CVS commits, as that seems like something that would be valuable (and right now my script just flattens a day's changes, not try to reconstruct the individual commits).

Reply | Parent | Thread

Dr Plokta

from: drplokta
date: 27th Nov 2014 17:21 (UTC)

On your Mac, could you not have created a "Mac OS Extended (Case-Sensitive, Journaled)" partition and worked on that partition, to get around your issue with the case insensitivity of the default Mac filesystem?

Reply | Thread

Tony Finch

from: fanf
date: 27th Nov 2014 17:31 (UTC)

Probably, but a git push from the Mac and a pull on my workstation to run it there was easier :-) I didn't do much of the work on a Mac, only a couple of evenings. (Mostly writing the above, in fact!)

Reply | Parent | Thread

Антон Сергеев

Did you solve this?

from: sdelatpravilno
date: 27th Mar 2017 15:54 (UTC)

Hello! Did you solve that?

Reply | Parent | Thread

Tony Finch

Re: Did you solve this?

from: fanf
date: 27th Mar 2017 16:04 (UTC)

No, I didn't bother - I don't do enough work with case-sensitive files on a Mac to make it worth setting up a case-sensitive partition.

Reply | Parent | Thread

Andrew Ducker

from: andrewducker
date: 27th Nov 2014 20:40 (UTC)

Possibly a silly question, but did you actually need the whole commit history?

Surely the latest version is sufficient?

(And an archive of the rest in sccs so that you can get to it if someone needs something from the dim and distant past)

Reply | Thread

Tony Finch

from: fanf
date: 28th Nov 2014 13:07 (UTC)

That's a fair question :-)

I think the main use I will have for the history is understanding how the database code works and evolved, since that is where my understanding is weakest. That part of it goes back to about 2001. I really don't want to gave to use SCCS to work with that code! I anticipate that being able to see how changes were made in the past will give me a template for future changes.

The most voluminous part of the history is before then, the pre-database record of registrations. To be honest, that part of it is a bit of a liability: it is full of personal data so it means I can't publish the repository as it is. But now that I have everything in git, it will be much easier to carve out the relevant parts which can be published.

And having it all in git means I can keep it all in one place rather than having to worry about preserving old archives somewhere else. There is a backup/archive on our staff timesharing server which will remain in some form, but my aim is to be able to forget about it :-)

Reply | Parent | Thread

Andrew

from: nonameyet
date: 28th Nov 2014 07:40 (UTC)

> The Statslab was allocated 131.111.20.0/24, which it has kept for 24 years!

Does the history show that the third quarter of that was stolen (for e-science.cam.ac.uk IIRC) for a number of years - perhaps 2003-9 ?

The existing maths DNS history seems to start on 1997/10/15.
IIRC DAMTP would have moved its DNS to a SPARCstation5 about then and I subcontracted the DPMMS DNS to the DAMTP system.

Reply | Thread

Tony Finch

from: fanf
date: 28th Nov 2014 11:07 (UTC)

Something as recent as 2003-9 will be recorded in the IP-register database, though some of the network-level allocations were still recorded in SCCS after the database was set up, including the e-science centre:

commit 41cdddb9f754c1dd5ec27737c3a46a6c943d7622
Author: Joe Gluza <jlg1@cam.ac.uk>
Date:   2002-10-15 18:50:23 +0000

    131.111.20.[128-191] moved from Stats Lab to eScience Centre

diff --git a/adm/hosts.131.111 b/adm/hosts.131.111
index 2a7889f..ce43e6f 100644
--- a/adm/hosts.131.111
+++ b/adm/hosts.131.111
@@ -898,8 +898,8 @@ The canonical nameservers for the cam.ac.uk DNS zone are at



-131.111.20.x   Statistical Laboratory
---------------------------------------
+131.111.20.[0-127]    Statistical Laboratory (1)
+------------------------------------------------

   contact: CO - Dr Andrew C Aitchison A.C.Aitchison@pmms
                Eva Myers E.R.Myers@statslab
@@ -923,6 +923,20 @@ machines if they are performing satisfactorily.)



+131.111.20.[128-191]   eScience Centre
+---------------------------------------
+
+contact: Bruce Beckles, mbb10@cam.ac.uk
+
+
+
+131.111.20.[192-255]   Statistical Laboratory (2)
+--------------------------------------------------
+
+see 131.111.20.[0-127] for contacts
+
+
+
 131.111.21.[0-127]    Centre for Applied Research in Educational Technologies
 -----------------------------------------------------------------------------

Reply | Parent | Thread

Tony Finch

First commit

from: fanf
date: 28th Nov 2014 11:09 (UTC)

commit cb835e9f420ff7117a7b5e40b6a52b283b79711c
Author: Tony Stoneley <ajms@cam.ac.uk>
Date:   1990-09-17 10:22:35 +0000

diff --git a/old-group-Internet/MAC.addresses/Other b/old-group-Internet/MAC.addresses/Other
new file mode 100644
index 0000000..f346bf5
--- /dev/null
+++ b/old-group-Internet/MAC.addresses/Other
@@ -0,0 +1,63 @@
+Mill Lane Hosts
+---------------
+
+Type            Hostname        Internet address        Hardware address
+----            --------        ----------------        ----------------
+
+               statslab.gateway 128.88.9.4             08-00-09-02-19-6B
+               owl             128.88.9.25             08-00-20-00-A9-20
+               owlet           128.88.9.26
+               sharvy          128.88.9.27             08-00-20-06-16-D5
+               canary          128.88.9.28             08-00-20-00-68-4A
+               twilight        128.88.9.29             08-00-20-06-4D-12
+               atm             128.88.9.50             08-00-20-01-FA-C8
+               space           128.88.9.51             08-00-20-00-9C-CD
+               vortex          128.88.9.52             08-00-20-00-61-94
+               hope            128.88.9.53             08-00-20-00-82-7A
+               star            128.88.9.75             08-00-20-00-98-34
+               magnet          128.88.9.76             08-00-20-00-83-32
+               stokes          128.88.9.77             08-00-20-00-65-60
+               chaos           128.88.9.78             08-00-20-06-08-37
+               john            128.88.9.90
+               camnum          128.88.9.100            08-00-20-01-DD-3E
+               jane            128.88.9.125            08-00-20-00-98-61
+               oak             128.88.9.126            08-00-20-00-79-2B
+               bio             128.88.9.127            08-00-20-00-5C-FA
+               mhd             128.88.9.150            08-00-20-00-7E-00
+               pub             128.88.9.151            08-00-20-00-7B-3E
+               dummy           128.88.9.152             fictional
+               gfd             128.88.9.175            08-00-20-00-95-92
+               chas            128.88.9.200            08-00-20-00-4D-6F
+               joe             128.88.9.201            08-00-20-01-C2-64
+               alan            128.88.9.202            08-00-20-00-36-18
+               robo            128.88.9.210            08-00-2B-0C-C4-97
+               server1         128.88.9.250            08-00-09-03-6A-ED
+               swan            128.88.9.251             fictional
+               tom             128.88.9.252             fictional
+               cs1             128.88.9.253            08-00-20-06-39-AF
+               isis            128.88.9.254            08-00-20-00-21-69
+               lynx            131.111.20.1            }
+               jaguar          131.111.20.2            }
+               cheetah         131.111.20.3            }
+               panther         131.111.20.4            }
+               leopard         131.111.20.5            } Gatewayed via
+               cougar          131.111.20.6            } statslab.gateway
+               lion            131.111.20.59           }
+               tiger           131.111.20.60           }
+               puma            131.111.20.61           }
+               lynx.teach      131.111.20.62           }
+
+MBCRF Hosts
+-----------
+
+Type            Hostname        Internet address        Hardware address
+----            --------        ----------------        ----------------
+
+               mbua            131.111.12.1
+               mbub            131.111.12.2            08-00-20-00-63-69
+               mbuc            131.111.12.3            08-00-20-06-1A-CF
+               mbud            131.111.12.4            08-00-20-06-1A-C1
+               mbue            131.111.12.5            08-00-20-06-1B-11
+               mbuf            131.111.12.6            08-00-20-06-19-E1
+               mbug            131.111.12.7
+               mb-fs1          131.111.12.19           08-00-20-06-EB-87

Reply | Thread

Tony Finch

Re: First commit

from: fanf
date: 28th Nov 2014 11:10 (UTC)

128.88.0.0 is allocated to HP. I have no idea what it is doing here!

Reply | Parent | Thread