[LinuxMaster Home] [CVS Agent] [The Articles Page]

[EXPERIMENTAL] Code Forking Tool Kit Proposal

This document describes an experimental patch management system which is designed to track "our" patches against "their" source code tree.

What It's For

"Code forking" is the maintenance of multiple versions of a source code tree that share a common ancestor. It usually occurs as the result of two distinct groups with different agendas working on the same project. It's generally considered a bad thing over the long term due to the duplication of effort and division of resources; however, over the short term code forks are useful in that they allow small, focused groups of people to work on code before contributing it back to the main source code tree. This reduces the amount of work that the main source maintainer has to do.

The development model embodied in this software (although not the software itself) is used by kernel hackers the world over for the Linux kernel. The development model goes like this:

  1. Linus Torvalds releases a new revision of the "official" kernel source code tree.
  2. Independent developers and groups of developers modify this code and create patches.
  3. These patches are distributed unofficially to other developers, interested end-users, and Linus Torvalds. Some of these patches are full release-quality products in themselves, which just happen to be maintained outside of the "mainstream" kernel sources. Examples of this kind of patch are device drivers and architecture ports. Other patches are simply junk: one-offs, incomplete solutions, stabs in the dark, or just plain ugly. Statistically, if you see a Linux kernel patch written by a randomly chosen member of the patch-writing public, it's probably junk. On the other hand, if you see a kernel patch by Alan Cox or Linus Torvals, it's probably worthwhile.
  4. All of these groups evaluate the patch in a number of ways ranging from functionality to coding style. Improved versions of the patch appear.
  5. The patch is either accepted into the "offical" kernel, or it is rejected (often with constructive criticisms). Reasons for rejection range from poor design or implementation to lack of simple proven track record (the "don't bother us with it until after you've tested it with more than 100 users for a few months" response).
  6. Go to step 1.
CFTK is intended to support all but the first and last step of this development process: collecting patches, collecting information about them, providing convenient methods for obtaining patched sources, and retiring patches that are no longer useful, whether it be because they become accepted as part of the "official" code or because they are of poor quality.

Some of the implementation details are designed with today's widespread open-source development community in mind. These features are:

Caveats

Some of the policy stuff, especially dependency relationships between patches, is unimplmented and therefore untested. For example, when dependent or replacement patches change state, what should happen to the patches they depend on or replace? Is it necessarily true that if a dependency patch is rejected, then all patches that depend on it should also be rejected?

As such, one must assume that when this is really implemented, some details will be refined, and the behavior of the final product may differ.

Progress

The Original Proposal (slightly modified)

Currently we spend an inordinate amount of time doing code merges between our private CVS server and the public CVS server at [open source project name here].

What if we solved the code merge problem by eliminating our own CVS repository entirely?

Assume that we have the following things automated and reliable:

  1. Creating patches
  2. Storing, maintaining, and serving a set of patches
  3. Applying patches
  4. Identifying patches that can or cannot be applied
  5. Identifying patches that are already applied
  6. (maybe) proving patches can be applied in any order
  7. (maybe) proving patches generate correct output
Note that currently none of these are entirely reliable or supported with what automation we do have with CVS:
  1. CVS is unable to build well-formed diffs in many instances, particularly older and buggier versions.

    dpkg-source from Debian does it really well, even going so far as to detect "unrepresentable" binary differences and non-file differences.

  2. CVS does not record enough information to recreate the change made at a commit. It is possible to extend CVS to do this, but this requires software we don't have.

    It is of course trivial to implement a simple repository system to store the actual patches themselves, e.g. based on RCS.

  3. CVS does not well support applying a patch, particularly one received in email (as the vast majority of our revisions are received). In particular deletions and additions must be handled manually.

    It is possible to build support for this around CVS, and unnecessary if CVS is not used at all. [ed: indeed, it would be trivial to generate an 'add/delete' list from a patch with the existing code.]

  4. We have no utility which can automatically apply an arbitrary patch against a source code tree, or back out of such application if it blows up. [ed: we do now! patch -sf on patch text cooked with the CFTK's secret recipe of 11 regular expressions!] [ed^2: ok, it's not really 11, it's more like five dozen, although they're all assembled together at the end...]
  5. If we had 4, 5 is really easy to build out of it. [ed: yep, just add -R].
  6. Proving patches can be applied in any order is interesting but possibly not very useful. Thankfully, in most cases patches do not actually overlap (they would be useless if that was the normal case), and the few cases where ordering does matter can be resolved manually, or by analyzing the patched source tree when building a patch to discover which patches have already been applied.

    It's probably possible to do this by checking which line numbers are modified in the patches. The only problem is that patches are not guaranteed to be derived from the same input files, and without re-implementing patch from scratch it is difficult to know how a patch will behave--i.e. where it will wander off to with its fuzz lines.

  7. It is completely unknown how correct the output of applying these patches will be. This area requires further investigation; however, a useful system can be built even without this knowledge.

    Some sort of sanity check may detect this; e.g. attempt to apply a patch in both directions, and if it is successful both ways, it should be questionable. Another trick would be to try to regenerate the patch given the original and patched files, compare that patch to the attempted patch text, and warn about differences. If someone can prove that the unreliability problem is equivalent to diff3's, then I'll be happy.

Instead of two branches of code, we could simply maintain a set of patches to the external project's code, and completely eliminate our own copy of the code in our CVS repository (although of course we wouldn't actually do that, we'd just pretend it was a read-only CVS repository so that we can use cvsweb, cvs diff, cvs log, cvs annotate, etc...).

Instead of CVS commit, developers would submit the output of 'cvs diff' (or reasonable, and hopefully more correct, facsimile) [ed: actually it's no longer necessary for it to be correct thanks to the patch text rewriting logic] with all the right options to the patch management server. Ideally such submissions can be made by email and formatted in such a way that they contain or can be converted into the format used by WineHQ, Linux, GNU, and many other projects (which includes an accreditation and a description). GPG signed email can be used for authentication and also for compression, with an optional fallback to passwords or mail-from "authentication."

Instead of CVS checkout, a developer could fetch the latest and greatest WineHQ sources (or some approved snapshot of those) via an extra-CFTK mechanism (OK, so that mechanism could be 'cvs co' if desired), then ask CFTK "apply all approved patches except X, Y, and Z, and also the non-approved patches A, B, and C".

A server somewhere could provide a CVS repository or tarballs or both which contains some popular versions of the code, e.g. the code with all approved patches and the code with all patch thresholds set (i.e. all patches with N, N-1, and N-2 approval votes could be made available).

This has four important differences (not to say "advantages") relative to CVS:

  1. Changes retain their distinctiveness and have identity. This means they're ready to ship to the external project's code maintainers.
  2. No live network feed is required to make commits. A mailing list for incoming patches could mean no feed is required for checkouts either. [ed: indeed, we can support a public mailing list for patch submissions directly into CFTK now.]
  3. It is not necessary to make two patches against two source trees and merge them all the time, as is the case with two CVS repositories. CFTK merges on-demand with your local sources; it can be somewhat more predictable than CVS. An automated test server that talks to CFTK could report when something is broken early, when it's easier to fix.
  4. When conflicting changes happen on the external source code tree, we will probably lose distinct features, not random changes, because patches can be atomically applied or unapplied.
The major disadvantages relative to CVS:
  1. Changes are not guaranteed to be correct.

    It is always possible to generate a current version of the source code by using the version of the unpatched source tree that was current at the time when the number of viable patches is zero, and applying viable patches to that version in the correct order, if all viable patches are derived from that version of the unpatched source tree (or the same with patches known to CFTK applied) and if they all have correct ordering information. In many cases it is possible to use other versions of the unpatched source tree as input as well--patch is a master of approximate matching.

    In any other case, there may be errors introduced into the code by the patching process. This is an unavoidable part of merging two code forks. We are not trying to eliminate this problem; we are simply trying to automate those parts of the problem that can be easily automated, and provide useful information to manage the problem in cases where it can't be easily automated.

  2. This system is not a primary revision control system. In fact, it explicitly requires another revision control system, although that revision control system could be a simple as regularly published snapshot tarballs. CFTK is something different, a secondary revision control system, one that is used in tandem with another.

    CFTK is also useful by persons and organizations that already have a primary source code repository as a mechanism for evaluating contributed code prior to inclusion in the code base. CFTK is not designed to be as strict as Aegis (which includes mandatory testing and strictly defined user roles), but it is intended to support peer review better than CVS.

  3. CFTK will not scale up all the way from an empty project to a large project containing millions of lines. It is intended to maintain a small set of revisions outside of the primary revision control system, and it is designed to efficiently forward these revisions to the primary revision control system.

    It is assumed in the design of this system that patches will not be maintained indefinitely; after a period of time, they will either become part of the unpatched source, or they will be destroyed.

Patches would need to be assigned identities (numbers work, although I prefer 8-character random ID's because they don't imply an order which may not exist, and for general applicability there should be an implied or actual hostname appended).

Each patch would have the following control fields in addition to the patch text itself: