[LinuxMaster Home]
[CVS Agent]
[The Articles Page]
[EXPERIMENTAL] Code Forking Tool Kit Proposal
This document describes an experimental patch management system which
is designed to track "our" patches against "their" source code tree.
What It's For
"Code forking" is the maintenance of multiple versions of a source code
tree that share a common ancestor. It usually occurs as the result of
two distinct groups with different agendas working on the same project.
It's generally considered a bad thing over the long term due to the
duplication of effort and division of resources; however, over the short
term code forks are useful in that they allow small, focused groups of
people to work on code before contributing it back to the main source
code tree. This reduces the amount of work that the main source
maintainer has to do.
The development model embodied in this software (although not the
software itself) is used by kernel hackers the world over for the
Linux kernel. The development model goes like this:
- Linus Torvalds releases a new revision of the "official"
kernel source code tree.
- Independent developers and groups of developers modify this
code and create patches.
- These patches are distributed unofficially to other developers,
interested end-users, and Linus Torvalds. Some of these patches
are full release-quality products in themselves, which just
happen to be maintained outside of the "mainstream" kernel
sources. Examples of this kind of patch are device drivers and
architecture ports. Other patches are simply junk: one-offs,
incomplete solutions, stabs in the dark, or just plain ugly.
Statistically, if you see a Linux kernel patch written by a
randomly chosen member of the patch-writing public, it's
probably junk. On the other hand, if you see a kernel
patch by Alan Cox or Linus Torvals, it's probably
worthwhile.
- All of these groups evaluate the patch in a number of
ways ranging from functionality to coding style. Improved
versions of the patch appear.
- The patch is either accepted into the "offical" kernel, or
it is rejected (often with constructive criticisms). Reasons
for rejection range from poor design or implementation to lack
of simple proven track record (the "don't bother us with it
until after you've tested it with more than 100 users for a few
months" response).
- Go to step 1.
CFTK is intended to support all but the first and last step of this
development process: collecting patches, collecting information
about them, providing convenient methods for obtaining patched
sources, and retiring patches that are no longer useful, whether
it be because they become accepted as part of the "official" code
or because they are of poor quality.
Some of the implementation details are designed with today's widespread
open-source development community in mind. These features are:
- Strong authentication (via GPG signed control messages).
Plaintext password and MAIL-FROM authentication will be
supported as well (but gpg --batch --no-options
--no-default-keyring --always-trust --load-extension rsaref
--keyring=$pgpkeyring --keyring=$gpgkeyring is so
friendly...).
- Automatic patch application by a scoring algorithm which
can align patches to a source directory even in the case of
repeated base names. A separate pass is made over an entire patch
using assumptions opposite to those used by the patch
program (patch assumes that each hunk of the patch is
independent, while CFTK assumes that they all apply to the same
source code tree and simply tries to figure out where).
patch -p and -d options are automatically
selected to match the largest number of files and directories
mentioned in the patch. This has been tested using the
wine-patches mailing list as input.
- Obnoxious features such as "../" in filenames are controlled,
so that it is (reasonably) safe to feed an arbitrarily evil patch
into CFTK without compromising the host system. The same code
which determines where in the source tree a patch can be applied
can also rewrite the patch text so that constructs like "../" are
not required. The software can also work around some bugs in
old versions of CVS with regard to file naming in diffs.
- Peer review for patches. A weighted voting system allows
project leaders to have the power to push patches through to
meet a deadline while also allowing for development on a
consensus-driven feature-by-feature basis for other patches.
Dissent is a good thing, so negative votes are allowed as well.
Once a patch has passed through peer review, the system can
automatically submit the patch to an email address, including
a full changelog, or merge the patch into a source tree.
- Email-based interface. Support for extracting MIME-encoded
(base64, quoted-printable, [78]bit, and the previous three
gzipped) patches from email messages is already implemented.
Full control of the system will be available via email,
if desired.
- CGI-based interface (because some people only speak HTTP,
and HTTPS has pretty decent authentication too). Full control
of the system will be available via the web, if desired.
- Command-line interface which hides the email and CGI-based
protocols on the client side and implements them on the server
side. This will include a utility that allows patches to be
quickly and easily selected by various search criteria and
applied against a source directory.
- CVS import-compatible source code export of useful patch sets
applied to a known unpatched source code tree. This can be
used to generate tarballs and web sites too. It's most useful
for generating a CVS repository with tags that you can use to
retrieve and apply each patch.
- Server-side state storage should be efficiently transportable
via rsync.
- Preservation of patches as separate from the entity that
they patch. Unlike other revision control systems, a patch is
never really "integrated" by CFTK--instead, CFTK expects
this integration is done by someone else, and merely detects
it when it happens. This is a design goal.
- Tracking of the source code, including automatic testing
of patches, detection of applied patches, and reporting of
broken (conflicting) patches.
Caveats
Some of the policy stuff, especially dependency relationships between
patches, is unimplmented and therefore untested. For example, when
dependent or replacement patches change state, what should happen to
the patches they depend on or replace? Is it necessarily true that if a
dependency patch is rejected, then all patches that depend on it should
also be rejected?
As such, one must assume that when this is really implemented, some
details will be refined, and the behavior of the final product may differ.
Progress
- A parser for MIME-encoded patches has been done, which can identify the
output of:
- cvs diff -u
- diff -c
- diff -u
encoded in MIME types 8bit, quoted-printable,
and application/* (actually, all non-text types are assumed to
be patch text decompressible via gzip). The test material was obtained
from the wine-patches mailing list.
- The patch recognition engine can now intuit patch -p and -d options
given a copy of the sources to be patched. It can do this in many cases
even with buggy CVS diff output and with multiple files that have the
same base name.
The Original Proposal (slightly modified)
Currently we spend an inordinate amount of time doing code merges
between our private CVS server and the public CVS server at [open source
project name here].
What if we solved the code merge problem by eliminating our own CVS
repository entirely?
Assume that we have the following things automated and reliable:
- Creating patches
- Storing, maintaining, and serving a set of patches
- Applying patches
- Identifying patches that can or cannot be applied
- Identifying patches that are already applied
- (maybe) proving patches can be applied in any order
- (maybe) proving patches generate correct output
Note that currently none of these are entirely reliable or supported with
what automation we do have with CVS:
- CVS is unable to build well-formed diffs in many instances,
particularly older and buggier versions.
dpkg-source from Debian does it really well, even
going so far as to detect "unrepresentable" binary differences
and non-file differences.
- CVS does not record enough information to recreate the change
made at a commit. It is possible to extend CVS to do this, but
this requires software we don't have.
It is of course trivial to implement a simple repository system
to store the actual patches themselves, e.g. based on RCS.
- CVS does not well support applying a patch, particularly
one received in email (as the vast majority of our revisions are
received). In particular deletions and additions must be
handled manually.
It is possible to build support for this around CVS, and unnecessary
if CVS is not used at all. [ed: indeed, it would
be trivial to generate an 'add/delete' list from a patch with
the existing code.]
- We have no utility which can automatically apply an arbitrary
patch against a source code tree, or back out of such application
if it blows up. [ed: we do now! patch -sf
on patch text cooked with the CFTK's secret recipe of 11
regular expressions!] [ed^2: ok, it's not really
11, it's more like five dozen, although they're all assembled
together at the end...]
- If we had 4, 5 is really easy to build out of it. [ed: yep, just add -R].
- Proving patches can be applied in any order is interesting but
possibly not very useful. Thankfully, in most cases patches do
not actually overlap (they would be useless if that was the normal
case), and the few cases where ordering does matter can be resolved
manually, or by analyzing the patched source tree when building a
patch to discover which patches have already been applied.
It's probably possible to do this by checking which line
numbers are modified in the patches. The only problem is
that patches are not guaranteed to be derived from the same
input files, and without re-implementing patch from
scratch it is difficult to know how a patch will behave--i.e.
where it will wander off to with its fuzz lines.
- It is completely unknown how correct the output of applying
these patches will be. This area requires further investigation;
however, a useful system can be built even without this knowledge.
Some sort of sanity check may detect this; e.g. attempt to
apply a patch in both directions, and if it is successful
both ways, it should be questionable. Another trick would
be to try to regenerate the patch given the original and
patched files, compare that patch to the attempted patch text,
and warn about differences. If someone can prove that the
unreliability problem is equivalent to diff3's, then I'll
be happy.
Instead of two branches of code, we could simply maintain a set of patches
to the external project's code, and completely eliminate our own copy of
the code in our CVS repository (although of course we wouldn't actually
do that, we'd just pretend it was a read-only CVS repository so that
we can use cvsweb, cvs diff, cvs log, cvs annotate, etc...).
Instead of CVS commit, developers would submit the output of 'cvs diff'
(or reasonable, and hopefully more correct, facsimile) [ed: actually
it's no longer necessary for it to be correct thanks to the patch text
rewriting logic] with all the right options to the patch management
server. Ideally such submissions can be made by email and formatted
in such a way that they contain or can be converted into the format
used by WineHQ, Linux, GNU, and many other projects (which includes
an accreditation and a description). GPG signed email can be used for
authentication and also for compression, with an optional fallback to
passwords or mail-from "authentication."
Instead of CVS checkout, a developer could fetch the latest and
greatest WineHQ sources (or some approved snapshot of those) via an
extra-CFTK mechanism (OK, so that mechanism could be 'cvs co'
if desired), then ask CFTK "apply all approved patches except X, Y,
and Z, and also the non-approved patches A, B, and C".
A server somewhere could provide a CVS repository or tarballs or both
which contains some popular versions of the code, e.g. the code with all
approved patches and the code with all patch thresholds set (i.e. all
patches with N, N-1, and N-2 approval votes could be made available).
This has four important differences (not to say "advantages") relative
to CVS:
- Changes retain their distinctiveness and have identity.
This means they're ready to ship to the external project's
code maintainers.
- No live network feed is required to make commits. A mailing
list for incoming patches could mean no feed is required for
checkouts either. [ed: indeed, we can support a public mailing
list for patch submissions directly into CFTK now.]
- It is not necessary to make two patches against two source
trees and merge them all the time, as is the case with two CVS
repositories. CFTK merges on-demand with your local sources;
it can be somewhat more predictable than CVS. An automated test
server that talks to CFTK could report when something is broken
early, when it's easier to fix.
- When conflicting changes happen on the external source code
tree, we will probably lose distinct features, not random changes,
because patches can be atomically applied or unapplied.
The major disadvantages relative to CVS:
- Changes are not guaranteed to be correct.
It is always possible to generate a current version of the source
code by using the version of the unpatched source tree that was
current at the time when the number of viable patches is zero, and
applying viable patches to that version in the correct order, if
all viable patches are derived from that version of the unpatched
source tree (or the same with patches known to CFTK applied) and
if they all have correct ordering information. In many cases it
is possible to use other versions of the unpatched source tree
as input as well--patch is a master of approximate matching.
In any other case, there may be errors introduced into the code
by the patching process. This is an unavoidable part of merging
two code forks. We are not trying to eliminate this problem; we
are simply trying to automate those parts of the problem that can
be easily automated, and provide useful information to manage
the problem in cases where it can't be easily automated.
- This system is not a primary revision control system. In
fact, it explicitly requires another revision control system,
although that revision control system could be a simple as
regularly published snapshot tarballs. CFTK is something different,
a secondary revision control system, one that is used
in tandem with another.
CFTK is also useful by persons and organizations that already have
a primary source code repository as a mechanism for evaluating
contributed code prior to inclusion in the code base. CFTK is
not designed to be as strict as Aegis (which includes mandatory
testing and strictly defined user roles), but it is intended to
support peer review better than CVS.
- CFTK will not scale up all the way from an empty project to
a large project containing millions of lines. It is intended to
maintain a small set of revisions outside of the primary revision
control system, and it is designed to efficiently forward these
revisions to the primary revision control system.
It is assumed in the design of this system that patches will not
be maintained indefinitely; after a period of time, they will
either become part of the unpatched source, or they will be
destroyed.
Patches would need to be assigned identities (numbers work, although I
prefer 8-character random ID's because they don't imply an order which
may not exist, and for general applicability there should be an implied
or actual hostname appended).
Each patch would have the following control fields in addition to the
patch text itself:
- Domain: see below (probably needs a better name)
- Synopsis: single-line description of the patch
(for subject line of WineHQ postings)
- Description: complete text description of the patch
(for body text of WineHQ postings)
- Submit-To: where this patch goes if approved (email address)
- Changelog: revision history of the patch
- Patch-Id: identity of the patch (xyjakwer@patches.wine.lnx or
some such thing). Unique identifier for the patch.
- Patch-Name: mnemonic name of the patch. Non-unique
identifier for the patch.
- Patch-Group: Another mnemonic name of the patch. Used to
identify groups of related patches.
- Authors: Email addresses/names of people who contributed to
the patch
- Notify: list of email addresses that should be notified
whenever the patch state or approvals are modified
- Approvals: list of voters for the patch
- Disapprovals: list of voters against the patch
- Replaces: patch ID's that this patch replaces
- Depends: patch ID's that must be applied before this patch
- Viable: true or false - is this patch under review, or is it
dead, obsolete, or applied?
- Status: see below
- Origin: source of what this patch is applied against (see below)
- Condemnations: if the patch was made non-viable by human
intervention: which human, and why?
"Changelog" is maintained by the system and includes an entry for every
change to the patch, including approvals, disapprovals, condemnations,
and brokenness detection.
"Patch-Id" is a unique identifier for the patch, which is automatically
generated. All patch ID's assigned by the same server are guaranteed to
be unique; for multiple servers, append the probably-unique host name
after an "@".
"Patch-Group" is used to identify sets of related patches. The field can
have many values so that patches can be part of many groups. When given
to a user interface, selects the entire group of patches.
"Patch-Name" is a non-unique identifier which can be used in user
interfaces, where it means the patch which is not replaced or depended
on by any patch with the same Patch-Name if and only if there is only
one such patch. So, given:
-
Patch-Id: A
Patch-Name: foo
Patch-Id: B
Patch-Name: foo
Replaces: A
'foo' refers to patch B.
-
Patch-Id: C
Patch-Name: quux
Patch-Id: D
Patch-Name: bar
Replaces: C
'quux' refers to patch C. While patch D is listed as replacing
patch C, patch C is not named 'quux'. 'bar' in this case refers to
patch D.
-
Patch-Id: E
Patch-Name: bletch
Patch-Id: F
Patch-Name: bletch
'bletch' is ambiguous and attempting to use it results in an error.
-
Patch-Id: G
Patch-Name: foobar
Patch-Id: H
Patch-Name: foobar
Depends: G
'foobar' refers to patches G and H applied in that order.
"Domain" is used to figure out who is allowed to approve or disapprove
of a patch, and also used when trying to figure out which patches should
be attempted. This allows a user to say "I want all the patches from
the Draw team and Macadamian but not Quattro or Presentations," if we
had domains set up just the right way.
"Origin" (suggested by Albert den Haan ) indicates what
this patch is applied to. It's advisory, for humans to read, and not
used by CFTK itself; CFTK only cares about the "current" revision of the
"other" source tree and whatever sources happen to be present when it
is asked to apply patches to them. This field never refers to patches
that are known to CFTK itself--those are handled via "Depends".
Examples of "origin" references:
- Linux kernel version 2.2.12
- WineHQ CVS tag Wine-990923
- foobar snapshot 1999/10/25 17:00
The status and viability of a patch changes automatically when approvals
or disapprovals are received for the patch, and when various changes in
state are automatically detected. A patch is 'viable' if it is not rejected,
applied, or obsolete.
A working set of statuses for viable patches (subject to change
during implementation):
- approved (at least N positive votes and can be applied),
- review (less than N positive votes and can be applied),
- broken (cannot be applied due to conflict, number of votes
irrelevant),
For non-viable patches:
- applied (the patch can't be applied because it is already
applied to a new version of the "unpatched" source)
- obsolete (someone has reported that the patch is superceded
by something non-identical in the "unpatched" source)
- rejected (at least N negative votes received)
- replaced (a patch that replaces this patch has been applied
or obsoleted)
Note we currently assume N = 2, developers have up to 1 vote, project
leaders/managers up to 2 votes. The vote server will know how many
votes each user can apply to patches written by other people and patches
written by themselves. A user can vote in any direction with up to
their maximum number of votes plus the number of votes they have already
submitted for the patch in the opposite direction. Each negative vote
cancels one positive vote and vice versa.
Voting proposal #1: Voting restarts at zero every time patch text
is modified.
Voting proposal #2: It is impossible to modify a patch; instead, you can
only create a new one that replaces an old one. All new patches start with
zero votes, of course. This has the advantage that it more elegantly
allows patches to be broken up into smaller pieces, and when someone says
"patch #72", you always know what you're talking about.
The patching management software must detect the broken and applied
states automatically.
Upon entering approved state, the patch is automatically submitted
to the Submit-To email address (or a default email address supplied by
the domain, e.g. the external project maintainer's address or a patches
mailing list).
When a patch is broken, the patch is still tested from time to time
(to see if it's still broken) but when a developer does a checkout it is
not included by default in their patch set.
It should be possible to ask for a source tree with all viable patches
applied if they have vote counts above a certain threshold.
The software should support both automatically reverting partially
applied (conflicting) packages, and allowing such conflicts to persist
(so that partial conflicts can be resolved by hand).
Patches cannot have higher effective vote counts or viability than
the patches they depend on. This means if patch A depends on patch B
(i.e. B must be applied before A), then if patch B is not approved,
then patch A cannot be approved, no matter how many votes it has.
If patch B is rejected, patch A is rejected too.
Further Work
A command line definition, with directory structure of local state
files, would be nice. Write a good one, and I'll implement it.
$Id: cftk.html,v 1.8 1999/10/26 19:48:58 zygob Exp $