\documentclass[notitlepage]{report}
\newcommand{\code}[1]{{\tt #1}}
\newcommand{\var}[1]{{\sl #1}}
\newcommand{\FIXME}[1]{{\large \sc FIXME: #1}}
\newcommand{\fakefs}{\code{fakefs}}
\newcommand{\fakedmd}{\code{fakedmd}}
\newcommand{\ldpreload}{\code{LD\_{}PRELOAD}}
\newcommand{\inoepo}{$(Inode, Epoch)$}
\title{\fakefs{} Architecture \& Development Plan}
\author{Zygo Blaxell}
\begin{document}
\maketitle{}
\begin{verbatim}
$Id: architecture.tex,v 1.4 2003/03/05 03:01:12 cvs Exp $
\end{verbatim}
\tableofcontents{}

\section{Status Of This Document}

This document is undergoing \emph{major} revisions almost daily, possibly
while you're reading it.

There may be problems navigating the HTML version due to nodes being
rearranged as new versions of the document are uploaded to the web server.

Currently it's best to take most of this document with a grain of salt
until there is some concrete implementation to worry about.

\textbf{You have been warned.}

\chapter{Overview}

\fakefs{} is a database application that implements a virtual POSIX
compressed file system.  

\section{Disorganized List of Features}

\begin{itemize}

\item Optimized for high space efficiency (as opposed to speed).  It
is highly likely that \fakefs{} will be very slow for writes and not
particularly fast for reads either.  Perhaps ``Astonishingly Slow
File System'' would be a better name.

\item Interface compatible with POSIX, with the usual Unix features and
extensions (users, groups, set-id bits, symlinks, devices, pipes).

\item Designed for storing highly redundant collections of files.
Capable of performing global searches for common data blocks to reduce
the size of the text stored.  Can also store completely identical copies
of entire files by copy-on-write references.

\item Identical copies of a file appear only once in physical storage, but
transparently appear as distinct files through the filesystem interface.
This is different from normal Unix hard links, which are not distinct.

\item Implemented entirely in user-space---ordinary users can create
virtual \code{/dev} structures, change ownership of files, and act
as if they were \code{root} within the virtual filesystem, without
having \code{root} privileges outside of the virtual filesystem.
This becomes interesting when combined with\ldots{}

\item ...the optional \ldpreload{} interface for legacy application
support.

\item Journaling and concurrent access implemented by Berkeley DB
Transactional Data Store.

\item Maybe in the future, efficient storage of previous views of the
filesystem contents (`snapshots').

\end{itemize}

\section{Where to get it}
\label{where}

\fakefs{} is available from
\code{http://www.hungrycats.org/\~{}zblaxell/projects/fakefs/fakefs.html}
via HTTP and from \code{rsync://rsync.hungrycats.org/zblaxell/fakefs}
via \code{rsync}.  It is written by Zygo Blaxell \code{<zblaxell@hungrycats.org>}.

\section{Support}

If there is sufficient demand, a mailing list and anonymous CVS repository
can be created.  

No other support is provided.  At this early stage of development,
you will have to write code in order to make \fakefs{} do anything at all.

\section{Contributing}

Patches are welcome.  Please generate your patches in `unified diff'
format (\code{diff -uNr} \var{my-copy your-modified-copy}) against a
copy of the source tree (see section \ref{where}) and send them to me
via email.

If you wish to work on implementing the archtecture in this document,
please drop me a line so that I can arrange who works on what to avoid
duplication of effort.  There are lots of modules that you can create
in isolation.

\chapter{Goals}

\section{First Phase Goals}

Things we want to be able to do right away.

\subsection{Backup Server}

\fakefs{} can be used to implement a backup system for a workgroup of
machines in geographically diverse locations.  The process works in
stages:

\begin{enumerate}

\item At regular intervals, each machine which is to be backed up
(backup clients) will contact the backup server, which runs \fakefs{}.
Each machine will upload itself into a subdirectory of the \fakefs{}
filesystem using a modified or \ldpreload{}-hacked \code{rsync}, or some
similar utility.

\item Backup clients can access the backup server, either through a
\fakefs{} NFS server, or through an \ldpreload{}-hacked Samba server, or
through a \fakefs{}-aware web server, to retrieve individual files should
they be lost, or browse through previous versions of the filesystem.

\item The entire contents of the backup server are dumped to tape at
regular intervals.  Note that \fakefs{} uses friendly filenames and
directory structure, so you don't get such data restoration disasters
as:

\begin{itemize}

\item missing hard links,

\item entirely missing files due to confusion caused by hard links,

\item case-mangling backup file formats (e.g. ISO-9660 or anything
touched by Microsoft),

\item poor performance on slow files (e.g. millions of 4K files 
transferred over NFSv2, or anything touched by Microsoft),

\item loss of Unix-specific file attributes, and

\item entirely missing directory hierarchies due to insufficient
credentials for the backup process.

\end{itemize}

\end{enumerate}

As long as the Berkeley DB guidelines for database backup are followed,
it should be possible to back up a live \fakefs{} filesystem.  The
main concern is the non-Berkeley data (external files).  Fortunately
there isn't much of this data.

\section{Second Phase Goals}

Things that are nifty, that we want to be able to do later on.

\subsection{`Transparent' or `Overlay' Filesystem}

By extending \ldpreload{} hacking a little further, one could implement
a copy-on-write filesystem on top of a read-only filesystem.  Attempts
to get read access to objects not in the \fakefs{} filesystem will be
redirected to access a read-only `backing' filesystem.  Attempts to get
write access to objects, or to rename or delete existing objects (which
is really just another way to speak of write access to directory objects),
or to create new objects, will be redirected to objects in a \fakefs{}
filesystem.

Note that such filesystem must be able to store `rename' and 'delete'
objects as well.  This is why they are notoriously hard to implement!

\fakefs{} should be able to implement this using an asymmetrical set of
storage and retrieval plugins, especially if there is an `import'
stage which transfers metadata from the real filesystem to \fakefs{}
before the overlay filesystem can be used.

\subsection{Quota Support}

The big issue about quota support on a globally compressing filesystem
is:  who owns the one copy of a block of zeros in the filesystem?

In normal filesystems, a block is a block---each user either owns a
block, or they don't.  In \fakefs{}, a users do not own blocks, the
filesystem does.  At best, users rent blocks.

Suppose that we have a block B that is shared between user A and user B.
Since there are two users, we put half of a block against the quota
for each user.  Now suppose B deletes their file...B's quota is then
transferred to A.  If A was at quota before, they are now over quota,
even though they did not modify the filesystem!

That might not be as much of a problem as it looks, but it does seem
to imply that popular blocks will touch many users' quota records
simultaneously.  Hmmm.

\subsection{Variable-length Hashing}
\label{hash-fubar}

Some method other than ``copy the entire \fakefs{} filesystem into
another \fakefs{} filesystem'' for changing the lengths of hashes,
or even variable-length hashes.

\section{Non-Goals}
\label{not-goals}

Things we just don't care about.

\subsection{Source Code Repository}

Although not really designed for this purpose, \fakefs{} can be used as
a slow and limited CVS/RCS/PRCS/XDFS replacement, assuming that you
write a \code{ChangeLog} file somewhere, and you completely disable
garbage collection.

It is probably more (if at all) correct to build systems like
CVS/RCS/PRCS/XDFS on top of \fakefs{}.

\chapter{Terminology}

Some words have specific meanings when used in this document.  In
addition to the terms defined here, see section \ref{db-types} (names
of data types in the database).

\section{Files}

A `file' is used in this document to refer to a contiguous stream
of data.  In POSIX terms, a `file' refers to the data content of an inode,
as opposed to the other attributes or names linked to the inode.

The term `file' is distinct from other strings of binary data in that
a `file' in \fakefs{} may be very large, possibly much larger than 2
gigabytes (the traditional file-size limit on filesystems implemented
on 32-bit host machines).

\section{Host Filesystem or Physical Filesystem}

The data stored inside a \fakefs{} virtual filesystem physically resides
inside another filesystem implemented by a POSIX-compliant kernel.
In this document, this containing filesystem is interchangeably called a
`host' or `physical' filesystem.

\chapter{System Architecture}

The \fakefs{} system consists of:

\begin{itemize}

\item A database that physically contains the filesystem, discussed in
detail in chapter \ref{Database}.

\item A daemon (or daemons?) which takes care of database administrative
functions, environment setup, deferred file cache operations, and
garbage collection.

\item A C library which provides POSIX-style file manipulation primitives
(e.g. \code{fakefs\_{}open}, see chapter \ref{primitives}).  

\item A \code{.h} file which can be \code{\#include}d into C programs,
which re\code{\#define}s standard POSIX calls to use \fakefs{} functions.

\item A \ldpreload{} shared library which replaces the standard POSIX
\code{open()}, \code{stat}, \code{chmod} library functions with
\fakefs{} equivalents.  This library is discussed in chapter \ref{Preload}.

\item A collection of dynamically loadable extensions (`plugins') which
implement various compression and decompression schemes.  These are
discussed in detail in chapter \ref{Extensions}.

\item An NFS server (because no filesystem without an NFS service is worth
discussion).

\end{itemize}

\chapter{Database Operations}
\label{primitives}

\fakefs{} primitives fall into two groups---POSIX standard and
\fakefs{}-specific operations---which are each further subdivided into
two groups---those that operate on the contents of files, and those that
operate on the data in the filesystem surrounding files.  Another way to
describe the latter distinction is in terms of `data' versus `metadata'.

\section{Metadata Operations}

Metadata operations deal with the filesystem namespace, Inodes, and Stat
data associated with Inodes.

Certain POSIX operations are related by the kind of operation they perform.
The difference between these operations is the kind of object being operated
on.  For example, \code{stat} operates on a file, while \code{lstat}
operates on a symlink rather than the symlink's target, and \code{fstat}
operates on a file descriptor.  In this document, such a related group
of operations such as \code{stat}, \code{fstat}, and \code{lstat} is
referred to as the `\code{stat} family'.

\subsection{POSIX Metadata Operations}

These functions are supported by \fakefs{} and have exactly the same meanings
as in POSIX or Unix:

\begin{itemize}

\item Inode information functions:  \code{stat}/\code{chown}/\code{chmod}
families, \code{utime}

\item Name-to-inode mapping manipulation:  \code{link}, \code{unlink},
\code{rename}

\item Directory structure:  \code{opendir}, \code{readdir},
\code{closedir}, \code{seekdir}, \code{telldir}, the \code{chdir} family

\end{itemize}

Note that not all of the semantics of access control are necessarily
implemented in \fakefs{} the same way they are in the host OS.  A user's
real \code{uid}/\code{gid} identity is irrelevant to a filesystem that
exists entirely in user-space and is fully accessible to that user in
the host OS.  In many cases, this can be considered a feature.

\subsection{\fakefs{} Metadata Operations}

\fakefs{} can perform the following operations in addition to the standard
POSIX/Unix operations:

\begin{itemize}

\item Direct read or write an existing Inode's Stat info.

\item Fast mapping from Inode number to names of all hard-links
to that Inode (linear in the number of path components and links).

\end{itemize}

\section{Data Operations}

Data operations deal with data stored in files in the filesystem.
\fakefs{} does not perform data operations on files; instead, \fakefs{}
copies a physical file from the virtual filesystem to a physical
filesystem cache (see section \ref{database-cache}), where it can be
accessed using the host OS facilities for manipulating data in files.
After the file is closed, or when explicitly synchronized, \fakefs{}
re-inserts the physical file back into the virtual filesystem.

\subsection{POSIX Data Operations}

Note:  POSIX data operations are defined in terms of \fakefs{} data
operations, and many are actually implemented by the host OS.  They are
listed here only because they are explicitly intended to work as
described.

\begin{itemize}

\item \code{open}, when used with \code{O\_{}CREAT} to create a new file,
inserts an Inode containing a file of zero length into the \fakefs{}
filesystem, then behaves like \code{open} when used on an existing file.

\item \code{open}, when used on an existing file, returns a real file
descriptor to a real inode stored on the host file system; however,
all of the \fakefs{} functions such as \code{stat()} return information
about the virtual inode in the \fakefs{} filesystem, not the physical
inode.  When \ldpreload{} (see chapter \ref{Preload}) is used, the libc
\code{stat} function is replaced with the corresponding \fakefs{}
function.

\item \code{read}, \code{write}, \code{readv}, \code{writev},
\code{pread}, \code{pwrite}, \code{mmap}, \code{lseek}, \code{fcntl},
\code{flock}, and \code{ftruncate} all behave in \fakefs{} as they do in a
real Unix filesystem.  \fakefs{} does not provide these functions---since
they all work on file descriptors, and \fakefs{} \code{open()} returns
a file descriptor, these functions do not need to be modified to work
with \fakefs{}.

\item \code{dup}, \code{dup2}, and \code{fcntl} with the \code{F\_{}DUPFD}
flag have to be handled through \fakefs{} when \ldpreload{} is in use,
in order to be able to keep track of which file descriptors correspond to
files in \fakefs{} and which do not.

\item \code{close} causes the cached physical copy of a file (which may
have been modified by the \fakefs{} user) to be reinserted into \fakefs{}.
Note that this reinsertion may be deferred for performance reasons.

\end{itemize}

File descriptor sharing works as much as possible in \fakefs{} as it
does in POSIX, even though this may be severely painful to implement.

\subsection{\fakefs{} Data Operations}

The data operations create and destroy physical copies of virtual files
in a cache in a physical filesystem on the host OS.  These operations
are known as `checking out' and `checking in' files, respectively.

When \fakefs{} `checks out' a file, it creates a physical file in
the cache area (see section \ref{database-cache}) whose contents are
identical to the corresponding virtual file inside \fakefs{}.  This is
further subdivided into two modes for read-write and read-only access
to the physical file:

\begin{itemize}

\item For read-only access, the physical file may be shared with other
Inodes or even \fakefs{} permanent storage.  Modification of the physical
file is prohibited, and may damage the \fakefs{} virtual filesystem if
it occurs.

\item For read-write access, exactly one physical file exists for each
distinct open Inode.  The normal Unix semantics of a file hard-linked
to multiple different names apply---if the file is opened under two
names, the two file descriptors correspond to the same physical file
as well as the same virtual one.  Note that the physical file will have
only one hard link.

\end{itemize}

When \fakefs{} `checks in' a file, it analyzes a physical file in the
cache and computes an efficient representation of that file (see
chapter \ref{file-insert}) to store in the database.

\begin{itemize}

\item Fast comparison of files by the Hash of their contents.

\item Retrieval of Inodes by Hash (returns list of Inodes that are
identical) or by \inoepo{} pair (returns a Hash).

\item Retrieve a file by Inode.  In read-write mode, files with distinct
Inode numbers get distinct physical files, while files with identical
Inode numbers get references to a single physical file.  Physical files
retrieved from Inodes with non-current Epoch values are always read-only.

\item Retrieve a file by Hash.  The physical files are read-only.
Only one physical file exists no matter how many times the same Hash
is retrieved by clients.  This physical file may be shared with files
accessed read-only by Inode.

\item Insert a physical file into \fakefs{} by Inode or by filename
(which creates a new Inode or looks up an existing one).  Insert-by-Inode
can only use existing Inodes.

\end{itemize}

\chapter{Berkeley DB:  The Least of the Evil}

Berkeley DB introduces a number of caveats of its own to \fakefs{},
especially when transactions and recovery are involved.

\section{DB:  Database Babysitting}

Berkeley DB databases require a single thread to perform database recovery
at startup, and three threads to perform the checkpointing, log archiving,
and deadlock-detection functions throughout the life of processes that
use the database.  This is the function of \fakedmd{}, described
in section \ref{dmd}.

\section{Running out of Space}

It is imperative that \fakefs{} never allow itself to run completely
out of disk space.

Berkeley DB requires disk space for logging \emph{all} modifications to
the database---\emph{including deletions}.  In particular, deleting an
object requires transaction log space proportional to the size of the
object being deleted, and doesn't release any space from the database
file to the host OS.  

If we want to delete a 10MB object we need at least 10MB of free space to
log the transaction.  Ouch!  

If we later insert an 11MB object, we require 12MB of free space:  11MB
for the transaction log, and 1MB of new free space for the object (the
other 10MB uses previously freed space).  Fortunately Berkeley DB gets
this right.

Simple transaction-protected reads do not seem to consume log space,
at least in single-reader cases.

We don't want to store large objects inside Berkeley DB lest we be
unable to find sufficient space to remove them afterwards.  Also,
Berkeley DB refuses to \code{mmap()} large databases by default (and
can't \code{mmap()} a database larger than the host CPU address space
in any case due to API limitations), which means that for all practical
purposes, any data retrieved from the database is copied into anonymous
virtual memory (i.e. RAM + swap).  Assuming:

\begin{itemize}

\item we don't want to devote more than 50\% of the system RAM to
\fakefs{}, and

\item \fakefs{} may require two data items in memory at any given time
for each process,

\end{itemize}

it follows that we want a maximum value data size of 25\% of system RAM,
probably much smaller.  For a machine with 768MB of RAM running 5 \fakefs{}
processes, this works out to about 38MB.  For a machine with 96MB of RAM,
this works out to 4MB.

Note that we can tell Berkeley DB to write data directly to an
\code{mmap()}ed file, which saves us the RAM of one copy of the data.
This is usually what we want to do with the data anyway.

\section{If programmers built cars\ldots}

Berkeley DB must never experience an application crash, especially if a
transaction is in progress.  If an application using \fakefs{} should
crash, all processes using the same \fakefs{} data store will have to be
aborted and the \fakedmd{} process restarted.

On the other hand, Berkeley DB provides a robust transaction-oriented
database back-end, which provides the benefits of a physical journaling
file system.

\section{Interactions between Transaction and non-Transaction Environment}

Some files in the \fakefs{} database are stored outside of Berkeley DB and
therefore are not under transaction protection (see section \ref{db-tact}).
For these files, a minimalist logging system must be implemented to complete
I/O operations that manipulate external files (see section
\ref{ext-file-log}).

\section{\ldpreload{} Hacking}

\code{fakeroot} provides a functional example of an \ldpreload{} hack
that is actually useful; however, \fakefs{} must do a little more to
virtualize all of the filesystem operations.  Also note the semi-elegant
way that \code{fakeroot} handles a very inelegant situation:  executing
a setuid program.

Database access should not be implemented inside the processes running
under \ldpreload{} conditions, due to the heroic efforts required to keep
the Berkeley DB implementation hidden from the process that it is sharing
symbol table, address space and file descriptors with.  This is hard
enough with the plain Berkeley DB version 3 \code{.so} library and
\code{libnss\_{}db.so} without any \ldpreload{} hacking at all.  Importing
all those symbols into arbitrary programs, some of which may then turn
around and attempt to use Berkeley DB, it just \emph{asking} for trouble.

This is the motivation behind \code{fakefsd}---to put all of
that goo in its own process, and simply communicate with it from
the \ldpreload{}-hacked binary using a message-passing interface.
See section \ref{fakefsd}.

\chapter{Database Contents and Organization}
\label{Database}

\section{Tactics}
\label{db-tact}

The database uses Berkeley DB Transactional Data Store for storing the
\fakefs{} virtual filesystem metadata.  The BD-TDS supports concurrent
access, transactions, and the ACID database properties.

The text of small files can be stored in a Berkeley DB table as simple
key values.  Larger files are stored in a directory structure outside
of Berkeley DB.

\section{Physical Filesystem Requirements and Limitations}

Filenames must consist of lowercase letters, numbers,
periods (\code{.}), dashes (\code{-}), and underscores
(\code{\_{}}).  

No file's name should ever begin with a dash, begin or end with a period,
or contain multiple consecutive periods.

Individual components of a filename should be kept shorter than 50
characters.  No filename should be longer than 100 characters.  No
directory should contain more than 100 entries.

Hard links and symlinks are prohibited unless they can be either removed
or replaced with copies without ill effect.

These restrictions are intended to prevent file naming on the physical
filesystem from causing problems with Unix shells, broken backup software,
braindead physical filesystems, and untrained human operators.

\section{Non-Limitations}

\fakefs{} should be LFS-compliant.  It may be asked to store more than
$2^{31}$ bytes of data.

Note that Berkeley DB will not be used to store objects larger than
$2^{31}$, probably nothing much larger than $2^{20}$.  After this point,
Berkeley DB becomes no more space-efficient than the underlying
filesystem, and accessing data items begins to involve RAM-hungry large
memory copies.

\section{Directory Structure}

The top level of a \fakefs{} filesystem's physical directory structure
contains the configuration file \code{fakefs.conf} (see section
\ref{fakefs-conf}), the lock file \code{fakefs.lock} (see section
\ref{fakefs-lock}) and the following subdirectories:

\subsection{\code{db}}

The Berkeley DB `home directory.'  This contains the Berkeley DB
\code{\_{}\_{}db.???} shared memory regions, and the following
subdirectories:

\subsection{\code{log}}

The Berkeley DB log directory.

\subsection{\code{data}}

The Berkeley DB data directory.  This is further subdivided:

\subsubsection{\code{data/main}}

The data tables for the built-in components of \fakefs{}.

\subsubsection{\code{data/plugins}}

The Berkeley DB database files for the plugin components of \fakefs{}.
If a plugin component requires only one file, it can place this file
directly under the `plugins' directory, with the same name as the plugin.
If a plugin component requires more than one database file, it must place
all of its files under a directory with the same name as the plugin.

\subsection{\code{blobs}}
\label{blob-naming}

\FIXME{Should really be \code{data/plugins/file/blobs}}.

The \code{blobs} directory is used for the storage of data objects
that are too large to be stored in Berkeley DB.  Files in this directory
have names generated by \fakefs{}, usually consisting of Hash values
(see section \ref{hashtype}) encoded as pairs of hexadecimal digits,
split into some number of subdirectories.

See section \ref{dirhash} for details on directory hashing.

\section{Types}
\label{db-types}

The following data types appear in the database.  

Each type has a one-byte identifier associated with it, listed in brackets
after the type name.

\subsection{Inode (\code{I})}

An Inode is an integer-like type (but not necessarily an integer---it
has constant size and is sortable, but might be a tuple of \code{int}s
or a \code{long long}) which represents an inode (lowercase I) in
the virtual POSIX filesystem.  All virtual filenames represent a path
through directories to an Inode, just like POSIX.

Inodes are often used as keys in key-value pairs in databases.  In this
document, the term `Inode' refers either to an Inode itself, or to some
value of another type in a database indexed with an Inode key.
Which meaning is actually used should be clear from context.

\subsection{Epoch (\code{E})}

An Epoch is an integer-like type which represents an instant in time.
There exists a value of Epoch called \code{EPOCH\_{}MAX}, which is
`later' than any other possible Epoch---past, present, or future (think
`\code{MAXINT}' here).

\subsection{\inoepo{}}

Epochs are commonly associated with Inodes, so often that they deserve
special mention as a distinct type:  an \inoepo{} pair.

In an \inoepo{} pair, the Epoch is the latest instant at which
the data associated with the Inode had a particular value.  In other
words, it is based on the time when the \emph{next} version of the
data associated with the Inode was created.  This allows for retrieval
of Inode data consistent with a given time $T$ using the Berkeley DB
\code{DB\_{}SET\_{}RANGE} flag with a key of `(Inode, $T$)'.

An \inoepo{} is distinguished from a plain Inode by its length.  Inodes
and Epochs are both constant-length strings, and they have non-zero
length, so this is not ambiguous.

An Inode (without Epoch) evaluated in \inoepo{} context is considered
equivalent to $(Inode, \code{EPOCH\_{}MAX})$.

\subsection{Hash (\code{H})}
\label{hashtype}

A Hash is a binary string type, typically 128-160 bits (16-20 bytes) in
length, which is the idempotent result of a cryptographically-strong
hash function applied to a file stored in \fakefs{}.  

A Hash must have the property that \fakefs{} will never contain any two
files with different contents but with the same value for Hash.

The function should have the property that any two files with identical
contents will have the same value for Hash, but this is not strictly
necessary.  If the Hash is non-idempotent, storage efficiency will
decrease, and paranoid checking \emph{must} be turned off (see section
\ref{paranoid-readback}), but there will be no other ill effects.
See section \ref{hashes}.

A Hash should never be more than 127 bytes long, because \fakefs{}
often assumes that it can store the length of certain strings, including
Hashes, in a \code{signed char}.  A Hash can only be a multiple of 8
bits long---if your favorite hashing algorithm produces non-multiples
of 8 bits of output, it will have to pad its output appropriately.

\subsection{Stat (\code{S})}

A Stat is a collection of attributes associated with an Inode---or,
from the user's point of view, a \emph{virtual} file.  

The following have their usual POSIX or Unix meanings:

\begin{itemize}
\item gid (numeric)
\item link count (this may not be physically stored, but implied by other data)
\item mtime, atime, ctime
\item uid (numeric)
\item mode (only the portable bits:  $07777$)
\item size (in bytes, 64 bits)
\item type (file, directory, pipe, socket, character, block, or symlink)
\item major, minor (for device special files only)
\item target name (for symlink special files only)
\end{itemize}

In addition, a Stat contains the following attribute:

\begin{itemize}
\item hash (Hash of contents of the file, if it is non-zero size)
\end{itemize}

\subsection{Reference (\code{R})}

A Reference is an identifier for some other object.  It is mostly used
by the garbage collector (see section \ref{garbage}) to figure out which
parts of the database depend on which other parts.  

It is also used to prevent circular Hash references (where the contents
of a file $A$ are generated by applying a transform to file $B$, but
the contents of file $B$ are in turn generated by applying a transform
to file $A$, so neither $A$ nor $B$ can be generated at all).

A Reference consists of the following catenated together:

\begin{itemize}

\item One byte $T$ indicating the type of object referenced,

\item One byte $L$ indicating the length of the name (key) of the object,

\item $L$ bytes for the name of the object.

\end{itemize}

In cases where an object references multiple objects, the individual
Reference objects as described above are catenated together.  As a
convenience, the object type 0 (\code{NUL}) is not allowed, and can be
used as a delimiter between a list of References and other data.

All object types between 65 and 90 (ASCII `A' through `Z') are reserved
for use by \fakefs{}.  Wherever possible, object types should be allocated
dynamically by plugins.  A registry is provided for this purpose.

\section{Tables}

The database consists of the following tables:

\subsection {Ownership Tables}

\subsubsection{\code{masters}}

\begin{tabular}{r l}
Type: & btree \\
Key: & Reference \\
Value: & Reference \\
\end{tabular}

This table maintains ownership information in the forward direction.
A key-value pair in this database indicates that the object in the key
owns the object(s) in the value.

\subsubsection{\code{slaves}}

\begin{tabular}{r l}
Type: & btree \\
Key: & Reference \\
Value: & Reference \\
\end{tabular}

This table maintains ownership information in the reverse direction.
A key-value pair in this database indicates that the object(s) in the value
are owned by the object in the key.

\subsubsection{\code{clients}}

\begin{tabular}{r l}
Type: & btree \\
Key: & PID \\
Value: & Reference \\
\end{tabular}

This table maintains ownership information for objects that are owned
by clients (\fakefs{} user processes).  \FIXME{Why is this useful?}

\subsubsection{\code{cache-closed}}

\begin{tabular}{r l}
Type: & btree \\
Key: & Epoch \\
Value: & List of Reference to \inoepo{} or Hash \\
\end{tabular}

Lists files in the cache by their last usage Epoch.  If a file is
currently open it is removed from this table.  This table is used
to delete unused items in the physical file cache.

\subsubsection{\code{cache-open}}

\begin{tabular}{r l}
Type: & btree \\
Key: & Reference to Inode or Hash \\
Value: & Epoch \\
\end{tabular}

Lists files in the cache by their Reference.  Files appear in this
table whether they are open or not.

\subsection{Namespace Tables}

\subsubsection{\code{roots}}

\begin{tabular}{r l}
Type: & btree \\
Key: & filesystem name (text string) \\
Value: & Inode \\
\end{tabular}

This (very small) database maps from filesystem names (which can be
arbitrary text strings, such as `\code{/mnt/virtualfs}' or `\code{My
Data Store}') to Inode numbers.  These are the roots of the virtual
filesystem(s) in a \fakefs{} database.

\subsubsection{\code{pie-in}}

\begin{tabular}{r l}
Type: & btree \\
Key: & \inoepo{} \\
Value: & List of (child-Inode, Name) \\
\end{tabular}

This table maps names within an \inoepo{} to Inodes.  Think `POSIX path
search function.'

Note:  \fakefs{} allows the same Inode to be both a directory (a
namespace, or interior node in a tree structure) and a file (a leaf node
in a tree structure).  POSIX does not.  This property might be useful
for something later, and it would have to be prevented explicitly if it
was not desired.

\subsubsection{\code{ie-pin}}

\begin{tabular}{r l}
Type: & btree \\
Key: &\inoepo{}\\
Value: & List of (parent-Inode, Name) \\
\end{tabular}

This table maps \inoepo{} pairs to the set of all their POSIX paths
and file names.  It is the inverse of the POSIX path search function,
and can be used to efficiently derive the names of all hard-links
to a given Inode.  

`Name' corresponds to the name in the directory `parent-Inode' which
points to the Inode in the key of the database key-value pair.

Thus the full paths and file names of all links to any given Inode are the
list obtained by appending the Name in this record to all of the paths
obtained by determining the names of the corresponding parent-Inodes.
The name of a directory `parent-Inode' can be derived by determining
the name associated with the record in this table corresponding to
parent-Inode, recursively, until the root of the virtual filesystem
is found.

\subsection{Other Tables}

\subsubsection{\code{freei}} 

\begin{tabular}{r l}
Type: & btree \\
Key: & pair of Inodes \\
Value: & character (see below) \\
\end{tabular}

Each key in this database is the beginning and end of a range of free
Inode numbers.  The corresponding value is a single character as follows:

\paragraph{F (Free)}

Free inode numbers are those that are not used by any version of any
Inode in the virtual filesystem.  Note that there must always be at
least one Free record in this database, or no new Inodes can be allocated.

\paragraph{R (Reserved)}

As far as we know, the only Reserved inode number is 0.

\subsubsection{\code{epochs}}

\begin{tabular}{r l}
Type: & btree \\
Key: & Epoch \\
Value: & None \\
\end{tabular}

Each key in this database represents an Epoch that exists in the
filesystem.  The list of keys can be used as a hint when selecting an
Epoch for viewing old versions of the filesystem.

The values are zero-length strings.

\subsubsection{\code{misc}} 

\begin{tabular}{r l}
Type: & btree \\
Key: & one of the constant strings enumerated below \\
Value: & see below \\
\end{tabular}

This table contains data that must be transaction-protected but does
not belong anywhere else.  The keys are all strings as follows:

\paragraph{Key:  Last-Inode}

Value:  The last Inode number allocated in the virtual filesystem.
This number is always incremented until it is higher than any "Free"
record range in the filesystem's \code{freei} table.  

Note that `increment' is applied as if the Inode was an
arbitrary-precision integer in MSB byte order, regardless of what the
underlying data type of Inode is.  If any invalid values of Inode are
generated by this algorithm, they are skipped.

\paragraph{Key:  FirstGC}

Value:  The oldest \inoepo{} tuple in the database.  This Inode is
deleted when garbage-collection occurs, and the value of this record is
updated to the NextGC from the deleted inode. 

\FIXME{This might be redundant, depending on how \code{masters} and 
\code{slaves} are structured.}

\paragraph{Key:  LastGC}

Value:  The newest \inoepo{} tuple in the database.  Whenever a 
new \inoepo{} tuple is entered in the database, the NextGC attribute
associated with the Inode as well as the value of LastGC itself are both
updated to point to the newly created \inoepo{}.

\section{Cache}

The physical file cache area (see section \ref{cache-config})).  This
is further subdivided into two directories:

\subsection{\code{inode}}

Cache files with names based on their Inode, opened for read-write access.
See section \ref{dirhash} for naming details.

\subsection{\code{hash}}

Cache files with names based on their Hash, opened for read-only access.
Also used for non-current Inodes, which can only be opened read-only.
See section \ref{dirhash} for naming details.

\chapter{File Insertion}
\label{file-insert}

\FIXME{This chapter is broken.}

Inserting a file into the database is conceptually a search
of a `representation space' for the file.  

For any given file there are many reasonable representations, such as:

\begin{itemize}

\item The file, verbatim

\item The output of the \code{zlib} algorithm on the file

\item The output of the \code{bzlib} algorithm on the file

\item The output of \code{xdelta delta} combining the previous file with
the same filename and a delta\ldots{}

\item \ldots{} further processed with \code{zlib} or \code{bzip2}

\item The output of \code{xdelta delta} combining the previous file with
the same inode, or a recently accessed file in the same directory.

\item Any one of the above filtered through GnuPG.

\end{itemize}

There are also many unreasonable representations, such as:

\begin{itemize}

\item The output of the \code{zlib} algorithm applied to the output of
the \code{bzlib} algorithm on the file

\item The output of the \code{bzlib} algorithm applied to the output
of GnuPG applied to the file.

\end{itemize}

We want to get the reasonable representations and avoid the unreasonable
ones.  We also want to explore possibilities such as replacing an existing
file with a reverse delta.

Since all of the representations are transformations of other
representations, it makes sense to arrange them in a tree.  Thus:

\begin{itemize}

\item cat reads encryption

\item encryption reads compression algorithms

\item encryption reads deltas

\item compression algorithms read deltas

\item compression algorithms read input files

\item deltas read input files

\end{itemize}

or

\begin{itemize}

\item input files feed deltas

\item input files feed compression algorithms

\item deltas feed compression algorithms

\item compression algorithms feed encryption

\item encryption feeds cat

\end{itemize}

The configuration file for this looks like:

\begin{verbatim}
# transform INPUT-TYPE OUTPUT-TYPE NAMES...
# endpoints INPUT-TYPE OUTPUT-TYPE 

# Note NAMES are tried in the order listed.
# If no NAMES are given, the transform is an identity transform.
# Identity transforms are handy (necessary!) if more than one
# output type can be an endpoint.

# Where we start and where we want to go:
endpoints input storage 

# Use this for encryption:
endpoints input encrypted-storage 

# Identity transforms
transform input storage         
transform delta storage         
transform compressed storage    

# Use these for encryption
transform compressed encrypted          gpg
transform delta encrypted               gpg
transform input encrypted               gpg
transform encrypted encrypted-storage   

# Compression algorithms (in desired search order)
transform input compressed      bzlib zlib bzip2 gzip
transform delta compressed      bzlib zlib bzip2 gzip

# Delta compression
transform input delta           xdelta

# Storage and retrieval methods
# (try them all in order until one succeeds)
storage berkeley
storage file
storage some-plugin-or-other
\end{verbatim}

\subsection{Score Objects}

Scores are sortable objects that support an addition operator.
Scores contain, and are ordered by:

\begin{enumerate}

\item total size of new files (increasing order), 

\item the number of files (increasing order), 

\item then computation time (decreasing order).

\end{enumerate}

There is a value for a score object which is ordered after all legitimate
values of score (i.e. all possible score values that could be derived
from real data are better than this score).

Score values can be summed to generate the score of an aggregation
of files.

\subsection{Plan Objects}

Each plan object is:  a list of file handle objects, a score object, a
commit method, a destructor, and supporting transform-algorithm-specific
data.

\subsection{Representation Search}

In general, the search follows a directed acyclic graph (DAG),
where multiple forward paths from a given node indicate a choice of
equivalent representations.  The DAG can be logically traversed from
input to output or output to input; however, regardless of logical
traversal order, the transformations must be applied in order from input
to output for storage and output to input for retrieval.  No searching
is required for retrieval, as the sequence of operations for retrieval 
will be determined in advance by the result of the search for storage.

\begin{enumerate}

\item \fakefs{} searches from input types to output types, forming a
list of transforms corresponding to paths through the DAG.  Each path
from input to output is traversed, and in the process a series of nested
plan objects is generated by each transform.

\begin{enumerate}

\item If no plan object is generated, recursion terminates at this
transform function.  This is not an error unless no plan object is
generated after traversing all paths through the DAG.

\item If one or more plan objects are generated, their files will be
fed into the next step in the path.

\end{enumerate}

\item Any candidate which is the result of a transform at the
leaves of the tree has its score computed, and that score aggregates
up the branches of the tree.  The candidate with the best final score
is kept.  All others are destroyed.

\end{enumerate}

Short-cuts are not possible; a delta compression output may be larger
than a compressed file output but a compressed delta compression output
may be smaller.

Each transform is given the score of the current `best' representation at
the same recursion depth (if any), the Inode (if any) of the file being
inserted, and the file input of the transform.  Transforms may inspect
\fakefs{} contents during transformation, e.g. to try to find likely
co-inputs for delta computation.  The return value of the transform is
a candidate (a list of file outputs) which is then recursively fed to
higher-depth transforms as input.

Note that recursion does not occur within the transformation
functions---it happens around them, within the \fakefs{} core.

When \fakefs{} chooses a representation and stores its files, it also
notes dependencies between representations in its reference-counting
tables.

If the list of candidates generated by a transform is empty, the
transform is considered to have failed, recursion is terminated, and
a sibling transform is tried.  This is the normal case if all possible
representations generated by a transform have poorer scores than were
known when the transform was attempted.

In case of success, each file is looked up in \fakefs{} and any files
already known to \fakefs{} are skipped.  

The existence of an `identity storage' transform which accepts the
input endpoint as \var{INPUT-TYPE} and produces the output endpoint as
\var{OUTPUT-TYPE} assures that there is always at least one storage
candidate.  

It is an error if no storage candidate emerges after all transforms
are tried.

\chapter{Configuration}
\label{Extensions}

\section{\code{fakefs.conf} configuration file}
\label{fakefs-conf}

The configuration file resides in the top level of the \fakefs{}
directory structure.  

The file consists of a list of lines, which are broken down
into lists of words on whitespace boundaries.  The first word
is treated as the name of a configuration object.  The second
word is treated as the name of a method to be called on that object.
Remaining words are treated as arguments to the method call.

Blank lines are ignored.  Leading and trailing whitespace is ignored.

Lines whose first word begins with `\#' are ignored.

\section{Non-Extensions}

Before discussing extensions, it helps to discuss the built-in commands:

\subsection{\code{source} \var{CONFIG}}

Reads the file \var{CONFIG} as if its text appeared at this point in
the configuration file.

\subsection{\code{load} \var{PLUGIN}}

Load the named plugin into memory.

\subsection{\code{plugin} \var{PLUGIN} \code{command\ldots{}}}

Executes \textsl{command\ldots{}} defined by \code{PLUGIN}.  This is 
the general way to configure options defined by plugins.

\subsection{\code{umask} \var{MASK}}

Sets the process `umask' to \var{MASK} for all file creation related to
the underlying physical filesystem.  Note this does not apply to files
either inside the \fakefs{} virtual filesystem, nor outside the \fakefs{}
physical filesystem.

\subsection{\code{dirhash} \var{DEPTH SIZE}}
\label{dirhash}

Specifies the depth of directory hashing.  This is used for physical
filesystems that cannot handle huge numbers of files in a single
directory.

When generating a filename from a binary string such as a Hash, 
for a \var{DEPTH} of $N$, the first $N$ groups of \var{SIZE} hexadecimal
digits are used as nested directory names.

For example, at \var{DEPTH} $3$ and \var{SIZE} $1$, the
empty file, using an MD5 hash, would have a filename like
`\code{d/4/1/d41d8cd98f00b204e9800998ecf8427e}'.

\subsection{\code{paranoid} \var{WHAT ON/OFF}}
\label{paranoid-readback}

Specify three paranoia checks:

\subsubsection{\code{paranoid storage} \var{ON/OFF}}

Immediately attempt to retrieve any file after storing it, and compare
it byte for byte with the input file before deleting the input file.

\subsubsection{\code{paranoid retrieval} \var{ON/OFF}}

Verify the hash against the expected value of all files as they are
retrieved.

\subsubsection{\code{paranoid duplicates} \var{ON/OFF}}

When a file is to be added to the database, and its hash already exists,
retrieve the file in the database and do a byte-for-byte comparison with
the incoming file.  

It's not clear what should happen if a hash collision does occur.
Here are some probably causes and remedies:

\begin{itemize}

\item The most probable cause of hash collision is some kind of data corruption,
usually due to hardware or software problems.  

\item Hash collision can also occur if hashes are truncated to too
short a length, or if a poor hash algorithm is used (MD5, SHA1, or a
similarly strong hash function capable of being used for cryptographic
authentication are recommended).  In this case the hash length should be
increased, or the failing Hash function replaced.  The entire database
will have to be rebuilt as a side-effect---see section \ref{hash-fubar}.

\item A hash collision between two full-length supposedly
`cryptographically strong' hashes is almost certainly a fatal failure
of the hashing algorithm, on which the entire \fakefs{} database is
based (not to mention a number of cryptosystems, assuming well-known
cryptographic hash algorithms are used).

\end{itemize}

\section{Garbage Collection Policy}
\label{garbage}

Without a garbage collection policy, the \fakefs{} database will grow
in physical size without bound until all disk space is consumed, even
if files in the virtual filesystem are apparently removed (e.g. by
\code{unlink} or \code{rename}).

Different garbage collection plugins implement different garbage
collection policies.  See also the garbage collection daemon
\ref{fakegcd}.

There are two parts to garbage collection policy:  identification of
garbage in the database, and scheduling garbage removal.

\subsection{Garbage Collection Scheduling Policies}

These policies specify when garbage in a \fakefs{} database is 
collected automatically.  

\subsubsection{Resource-Demand Garbage Collection Scheduling}
\label{resource-demand}

This policy performs garbage collection when the amount of some resource
falls below (or grows above) some limit.  Some example policies:

\begin{itemize}

\item Collect garbage when the amount of free space on the physical
filesystem where \fakefs{} resides falls below some threshold.

\item Collect garbage when the amount of space occupied by \fakefs{}
on the physical filesystem where \fakefs{} resides grows above some
threshold.

\end{itemize}

This policy is supported by invoking the garbage collector whenever
\fakefs{} has to consume disk space.

\subsubsection{Interval Garbage Collection Scheduling}

This policy performs garbage collection at arbitrary time intervals.
This policy is supported by the \code{fakegcd} (see section
\ref{fakegcd}).

\subsubsection{Manually Initiated Garbage Collection Scheduling}

This policy performs a single pass of the garbage collector.  It is
initiated explicitly by the user.

\subsection{Garbage Collection Coverage Policies}

These policies specify which data in a \fakefs{} database is `garbage',
subject to automated removal, and which data should not be removed
automatically.

\subsubsection{Nothing is Garbage}

The simplest garbage collection coverage policy is not to remove any data
from the \fakefs{} database at all (or at least not \emph{automatically}).
This is desirable when a \fakefs{} is used as a permanent data repository,
or when the entire \fakefs{} will be destroyed before it grows larger than
its physical filesystem.

\subsubsection{Everything is Garbage}

At the other extreme, all data in \fakefs{} may be considered garbage.
This is useful when \fakefs{} is used to implement some kind of cache,
where data may disappear arbitrarily with no ill effect.

\subsubsection{Only Superceded Inodes are Garbage}

Under this policy, superceded Inodes are garbage, and current Inodes
are not.  Garbage is removed in LRU order.

This set of Inodes may be further subdivided by minimum and maximum age
parameters.  Anything older than the maximum age is removed regardless
of the state of free disk space.  Anything between the minimum and
maximum age is garbage if and only if there is demand for free space
(see section \ref{resource-demand}).  Nothing younger than the minimum
age is considered garbage.

If the maximum age is zero, the entire Inode versioning system may
be disabled.

\section{Hash Functions}
\label{hashes}

\subsection{\code{hash algorithm} \var{FUNCTION}}

Select \var{FUNCTION}, which is the name of a hash function previously
registered by some plugin, as the hash function that will be used by
\fakefs{}.

\subsection{\code{hash length} \var{LENGTH}}

Set the maximum length of the hash algorithm output to \var{LENGTH}
\emph{bytes}.  If the algorithm produces a longer hash, it will be
truncated.  This can be used to play risky games with size/reliability
trade-offs.

Note:  \emph{The output of the hash function is not necessarily constant
length.}

\section{File Transformation, Storage, and Retrieval}

See chapter \ref{file-insert} for details.

\subsection{\code{transform} \var{ALGORITHM INPUT-TYPE OUTPUT-TYPE}}

Defines a transform function using the \var{ALGORITHM} algorithm which
converts a file from type \var{INPUT-TYPE} to type \var{OUTPUT-TYPE}.
Algorithm names are registered by plugins.  

\var{INPUT-TYPE} and \var{OUTPUT-TYPE} are arbitrary strings, subject
to the restriction that it must be possible to traverse a complete DAG
between the endpoints (see next section) without encountering loops.

\subsection{\code{endpoints} \var{INPUT-TYPE OUTPUT-TYPE}}

Specifies the beginning and the end of search paths across the file
transformation DAG.

\subsection{\code{storage} \var{METHOD}}

Adds storage method \var{METHOD} to the list of algorithms that are tried
when storing (or retrieving) a file from the database.  Storage method
names are registered by plugins.

Storage methods are tried in the order they are specified in the 
configuration file.

\section{Cache Configuration}
\label{cache-config}

When a file is opened in \fakefs{}, a physical file is created outside of
the \fakefs{} filesystem, in a real physical filesystem under the host OS.
This real physical filesystem has a number of special properties:

\begin{itemize}

\item It must be large enough to store all files that will be accessed
simultaneously.

\item The larger the cache filesystem is, the more files can be accessed
quickly.

\item It must be accessible by processes accessing the \fakefs{}.  It is
possible that \fakefs{} itself may be less accessible than this cache,
e.g. by using \code{fakefsd} processes with privileges different than
their clients.

\end{itemize}

\subsection{\code{cache directory} \var{DIRECTORY}}

Locate the physical files in the cache at \var{DIRECTORY}.  Each file
will be named according to the hexadecimal ASCII encoding of its filename,
with some amount of directory hashing (see section \ref{dirhash}).

\subsection{\code{cache size} \var{SIZE}}

Set the minimum maximum size of the cache to \var{SIZE}.  If files are
closed and more than \var{SIZE} files are present in the cache directory,
the oldest closed file will be removed from the cache.

Note that the minimum size of the cache is the total size of all files
currently open, so the size limit given by this directive may be exceeded.

\chapter{Storage and Retrieval Methods}

A retrieval method translates a Hash into a file by retrieving the
contents of the file from some kind of stable storage.  A storage method
moves a file to stable storage and arranges for it to be retrieved later
by its Hash.

These are the storage methods provided with \fakefs{}.

\section{\code{berkeley}}

The \code{berkeley} storage method implements simple binary data storage for 
short ($< 2^{20}$ byte) files.  

This storage method uses the \code{data/plugins/berkeley} database,
storing files as key-value pairs using the Hash of the file as a key.

\section{\code{file}}

The \code{file} storage method implements simple binary data storage for 
long ($> 2^{20}$ byte) files.

This storage method uses the \code{data/plugins/file} database to
keep track of partially committed files.  

The file name is generated from the Hash as described in section
\ref{blob-naming} and appended to the \code{blobs} directory.
The physical file's contents are identical to the virtual file's contents.

\section{External Reference}

A \fakefs{} filesystem may contain data that is substantially similar
to data stored in another database.

An example of such a database is a \fakefs{} filesystem that contains
data stored in a revision control system such as CVS, or one that mirrors
a filesystem stored on an FTP or HTTP server.

Such storage methods may be asymmetrical---e.g. they may be able to
retrieve old data but not store new data.

There are no plans to create such a plugin for early versions of
\fakefs{}, but it would be interesting.  

Even more interesting would be to implement plugins for namespaces---this
would allow directories to be generated on the fly from external data.

\chapter{Transform Algorithms}

\section{\code{zlib} and \code{bzlib}}

The \code{zlib} transform implements in-memory text compression using the
\code{zlib} library.  It is most useful for short files, ranging in size
from 64 to 128K bytes.  Its overhead for storing data without compression
is relatively small.

The \code{bzlib} transform implements in-memory text compression using
the \code{libbz2} library, which outperforms the algorithm implemented
by \code{zlib} on some inputs.  \code{bzlib} is most useful for files
at least 256K in size; however, when \code{bzlib} \emph{doesn't} agree
with its input, it will add noticeable overhead---in extreme cases,
\code{bzlib} is much worse than verbatim storage.

For files between 128K and 256K bytes in length, \code{zlib} and
\code{bzlib} are evenly matched---they both succeed where the other
fails about 50\% of the time.  Depending on the user's needs, one can
optimize for speed by always using \code{zlib} up to 256K, or for size
by using both transforms and choosing the one with better compression
performance.

While these plugins theoretically work for larger files, the cost of
holding such objects in RAM usually outweighs the benefit, so they are
generally not used for objects larger than $2^{20}$ bytes.  For larger
files, use the \code{filter} transform with the external \code{gzip}
or \code{bzip2} programs.

In both cases the transform produces output which consists of a four-byte
integer, MSB first, followed by a binary data stream of arbitrary length.

The integer is the length of the decompressed data stream, while the
binary string is the compressed data stream.

\section{\code{filter} Meta-Transform}

The \code{filter} transform is not really a transform---it is a plugin
which dynamically generates transforms out of Unix commands.  This makes
it extremely versatile since it allows users to add any Unix filter as
a transform function.  By using the \code{plugin} command, filters can
be defined in the configuration file.

Note that two Unix commands (where `command' really means `executable
file name and argument list') are required to define a \code{filter}
transform:  one for forward transformation and one for backward!

\subsection{Compression Filter Transforms}

The most obvious filter transforms are \code{compress}, \code{gzip}
and \code{bzip2}, which use the respective Unix programs to implement
compression.

\subsection{Encryption Filter Transforms}

Another obvious filter transform is \code{gpg}, from the GnuPG package.
This is somewhat limited as \fakefs{} lacks direct support for the kind
of extra information that is required for encryption (e.g. passphrases
or security domains); however, it is useful for straightforward 
single-key encryption of all files in a \fakefs{} database.

Note that if this filter is used, every file stored using the filter will
in fact be split into two file entries:

\begin{enumerate}

\item The output of GnuPG, and

\item A key in the database \code{db/plugins/filter/gpg}
whose value is a reference to the Hash of the aforementioned GnuPG output.

\end{enumerate}

\section{Delta Transforms}

\subsection{\code{delta} Transform}

The \code{delta} transform is not really a transform---it is a plugin
which generates transforms.  This makes it extremely versatile since it
allows users to add any Unix program as a transform function.  By using
the \code{plugin} command, delta filters can be defined in the
configuration file.

Note that two Unix commands (where `command' really means `executable
file name and argument list') are required to define a \code{delta}
transform:  one to turn two files $A$ and $B$ into a delta $D$, and
the other to turn $A + D$ or $B + D$ into $B$ or $A$ respectively.

\begin{itemize}

\item \code{diff -a} and \code{patch -efs}

\item \code{xdelta delta -0 --pristine} and \code{xdelta patch --noverify}

\end{itemize}

Notes on \code{xdelta}:  Verification is not required within \code{xdelta}
itself since \fakefs{} will verify inputs and outputs against their
respective Hashes anyway.  \fakefs{} compression is possibly stronger than
\code{xdelta}'s, but not necessarily (it depends on whether \code{xdelta}
uses copied data for compression).

Note that it is highly unlikely that \code{diff}-based patches will ever
really be used, as \code{xdelta} generally produces smaller output without
the text/binary headaches (\code{diff -a}, eh?).

\section{Mux/Demux Transforms}

\subsection{Split Transform}

The `split' transform simply converts a file into an ordered list of
non-overlapping smaller files.  To reconstruct the file, the smaller
files are catenated together in the order they appear in the list.

This can be used to get around such annoyances as $2^{31}-1$-byte file
size limits in the physical filesystem implementation under \fakefs{},
or to compactly express a delta between two files where one file is
identical to the other except for some appended data.

\chapter{\fakedmd{}:  Database Manager Daemon}
\label{dmd}

\fakedmd{} takes care of various database administrative procedures,
and recovery of state after a crash.

\section{Startup Procedure}
\label{fakefs-lock}

\subsection{\code{libfakefs} launches \fakedmd{}}

Whenever \code{libfakefs} starts up, it attempts to get a \code{fcntl}
exclusive lock in non-blocking mode on the file \code{fakefs.lock} in
the \fakefs{} database directory.  In non-blocking mode, the lock will
either fail immediately (indicating that \fakedmd{} is already running),
or it will succeed.  In the latter case, \code{libfakefs} releases the
lock, \code{fork}s a child process, and that child becomes the \fakedmd{}
daemon.

\code{libfakefs} will try this up to $N$ times, with a delay of $T$ 
seconds between tries, before it gives up.  Probably $N = 3$ and $T = 1$.

\subsection{\fakedmd{} acquires \fakefs{} master exclusive lock}

At startup, \fakedmd{} attempts to acquire a lock \code{fakefs.lock}
the same way that \code{libfakefs} did.  If this now fails, it indicates
that either another \fakedmd{} was started at the same time, or that
another \code{libfakefs} is currently executing the startup sequence
and will start a \fakedmd{} in the near future.

\subsection{Berkeley DB Startup}

Once \fakedmd{} has acquired an exclusive lock on \code{fakefs.lock},
it forgets about this file descriptor until it exits.

Next, \fakedmd{} opens the Berkeley DB environment in \code{RECOVERY}
mode.

Assuming this succeeds, \fakedmd{} creates some DB threads:

\begin{enumerate}

\item deadlock detector

\item checkpoint generator

\item \code{fakegcd} (see section \ref{fakegcd})

\end{enumerate}

\subsection{Cache Startup}

\fakedmd{} checks all files in the cache directory and all database
entries in the \code{cache-*} tables, and verifies:

\begin{itemize}

\item Filenames are valid (invalid files are deleted).

\item Files in the \code{hash} cache have the correct contents.

\item Files in the \code{hash} cache exist in the database (unknown files
are added to the \code{cache-*} tables, but not added to the permanent
\fakefs{} database).

\item Files in the \code{inode} database are either re-inserted into
the permanent \fakefs{} database or deleted, depending on policy
configured in \code{fakefs.conf}.

\item File entries in the \code{cache-*} tables that do not exist in
the phyisical cache are removed.

\end{itemize}

Note that the term `deleted' in this section simply means that the file
is no longer present in any `official' \fakefs{} directory.  The file
may in fact be simply renamed or moved to another location for user
inspection and manual recovery.

\subsection{Plugin Startup}

Plugins are loaded and given an opportunity to perform recovery procedures.
Most plugins use only the existing Berkeley database environment, so their
recovery is handled automagically by the Berkeley DB startup section.

\subsubsection{\code{file} Storage Method Cleanup:  Added Files}
\label{ext-file-log}

The \code{file} storage method checks the database
\code{data/plugins/file} for recently modified (added or removed) files.

The normal sequence of events for adding a file to the database is:

\begin{enumerate}

\item DB transaction begins.

\begin{enumerate}

\item Create the file---the details are outside of the scope of this
section.

\item Determine the Hash of the incoming file.

\item \code{fsync()} the incoming file (if not done already).

\item If not already so named, rename the file to its Hash-determined
name in the \code{hash} cache.

\item Determine that the file does not already exist in \code{blobs},
nor in \code{data/plugins/file}.  Obtain a write lock in the process
of verifying the latter.

\item Add the file to \code{data/plugins/file}, key = Hash, value =
`\code{A}' (for Add).

\item Commit the DB transaction.  

\end{enumerate}

If the transaction fails or is aborted, the file is not inserted into
the database at all.  Note that this DB transaction often involves
\emph{many} related changes to the database.

\item DB transaction begins.

\begin{enumerate}

\item Get a write lock on the recently added file in
\code{data/plugins/file}.

\item Copy or \code{link} the file to the \code{blobs} directory.

\item Delete the file's entry in \code{data/plugins/file}.

\item Commit the DB transaction.

\end{enumerate}

If the transaction fails or is aborted, the file is still not inserted
into the database, but another insertion attempt will be made when
recovery is run in the future.

\end{enumerate}

The recovery procedure consists of repeating the second part of this
sequence.

Note that in normal mode, the file's Hash has been `verified' (actually
it was empirically determined recently) and in recovery mode, the cache
recovery procedure verifies that all files in the cache have correct
contents and removes (either deletes or moves into a holding area)
those that don't.  Thus it is not necessary to do an extra validation
step in the second part of this sequence.

\subsubsection{\code{file} Storage Method Cleanup:  Deleted Files}

The normal sequence of events for deleting a file from the database 
is:

\begin{enumerate}

\item DB transaction begins.

\begin{enumerate}

\item Decide to delete the file---the details are outside of the scope
of this section, but include a checking reference counts and possibly
re-inserting files that depend on the file to be removed.

\item Determine the Hash of the outgoing file.

\item Determine that the file does already exist in \code{blobs},
and does not exist in \code{data/plugins/file}.  Obtain a write lock on
both in the process of verifying these.

\item Add the file to \code{data/plugins/file}, key = Hash, value =
`\code{D}' (for Delete).

\item Commit the DB transaction.  

\end{enumerate}

If the transaction fails or is aborted, the file is not deleted at all.
Note that this DB transaction often involves \emph{many} related changes
to the database.

\item DB transaction begins.

\begin{enumerate}

\item Get a write lock on the recently deleted file in
\code{data/plugins/file}.

\item Delete the physical file from \code{blobs}.

\item \code{sync();}

\item Delete the file's entry in \code{data/plugins/file}.

\item Commit the DB transaction.

\end{enumerate}

If the transaction fails or is aborted, the file is still not deleted,
but another deletiion attempt will be made when recovery is run in
the future.

\end{enumerate}

The recovery procedure consists of repeating the second part of this
sequence.

\chapter{\code{libfakefs}:  Application Interface}

\FIXME{The only chapter not written yet.}

\chapter{\ldpreload{}}
\label{Preload}

The \ldpreload{} interface to \fakefs{} allows a dynamically-linked C
executable to use the virtual filesystem through the normal POSIX
file-handling function calls.

In this chapter the process which has the \ldpreload{} interface installed
is called the `victim' process, because `victim' is a much more
\emph{accurate} designation than a mere `client' or even `slave.'

\section{File-Descriptor Juggling}

The first problem to solve is the mechanism by which communication
between \fakefs{} and the \ldpreload{} victim will be done.  Possible
candidates are:

\begin{itemize}

\item System V IPC (\code{sem*, shm*, msg*})

\item Unix domain sockets

\end{itemize}

If Unix domain sockets are employed, they will occupy file descriptors.
It will be necessary for the \ldpreload{} module to ensure that these
file descriptors are `juggled'---that is, some sleight-of-hand will
be employed so that the victim does not accidentally close or otherwise
damage these file descriptors.

It is still necessary to communicate an address to the victim; however,
environment variables are sufficient for this.

\section{\code{fakefsd}:  Database Proxy Server}
\label{fakefsd}

\code{fakefsd} is a proxy server used by the \ldpreload{} hack.  It is a
wrapper around the POSIX interface to \code{libfakefs}.  It implements
a multi-threaded (actually multi-process) server which retrieves and
stores files as requested by the \ldpreload{} client.  This is done
by a separate process in \fakefs{} in order to preserve some sanity
in the run-time environment.

\code{fakefsd} starts up a number of server processes, which then
compete for incoming requests.

\section{\code{fork}, \code{vfork}, \code{clone},
\code{pthread\_{}create}, and Friends}

The \ldpreload{} subsystem will track the process PID in order to 
detect unanticipated \code{fork}s.  If the PID changes, the 
communications system will be shut down and restarted.

Note that \code{fakefsd} will generally \code{fork} along with the
\ldpreload{} victim; however, there does not need to be a one-to-one
mapping between victim processes and \code{fakefsd}s.

\section{\code{rename} Optimization}
\label{posix-rename}

It would be nice to recognize the following sequence of POSIX primitives
as a modification of an existing file $B$:

\begin{enumerate}

\item Open file $A$ for writing, creating a new file in the process.

\item Write file $A$.

\item Close file $A$.

\item Rename file $A$ to existing file $B$.

\end{enumerate}

This is especially true if file $B$ was opened for reading just before
or during this sequence of events.

It is probably sufficient to assume that there will be no intervening
\code{open} or \code{chdir} events interleaved among these operations,
but there may be \code{stat}, \code{chmod}, \code{chown}, etc.

Note that what we really want to do is defer insertion of the written
file until the \code{rename} call, then replace $B$ itself with a 
reverse delta from $A$.

\section{What to do about \code{statfs}?}

\code{statfs} (and \code{fstatfs}) is a system call that returns
information about the size of and free space within a filesystem.
This is used by programs such as \code{df} to report some meaningless
statistics about a filesystem.

The question is:  what should \fakefs{} do about calls to \code{statfs}
on a \fakefs{} filesystem?

\subsection{Complications of \fakefs{}}

The answer is further complicated by these differences in behavior of
\fakefs{} compared to other filesystems:

\begin{itemize}

\item \fakefs{} can be configured to use many physical filesystems for
different purposes.  Space available for one purpose may not be available
for another.

\item \fakefs{} file data is usually compressed.  Compressing filesystems
are not new; however, a surprising amount of software is broken\footnote{
As a rule of thumb, application software that makes decisions based on
information returned by \code{statfs} without detailed knowledge of the
filesystem's structure and the behavior of other users of the filesystem
is \emph{broken}.

\code{statfs} is only useful in an advisory capacity---for example,
an interactive file download or software installation program may
\emph{advise} the user that there may \emph{seem} to be insufficient space
for some operation, but software should never \emph{refuse} to perform an
operation \emph{solely} because \code{statfs} reports insufficient disk
space.  This is true even on uncompressed filesystems---another process
may occupy or free space while the main application runs, which makes
any estimate of free space made at the beginning of the operation moot.

Note this sword cuts both ways:  \code{statfs} may report sufficient
space, but \code{quota} or other mechanisms may disagree.

Network file servers, especially those hosted on non-Unix operating
systems, may implement their own obscure behavior.  There is no way any
application can be made aware of the details of all such systems.
This only emphasizes the uselessness of \code{statfs} even further.

The only way to know in general whether you can write some data on a
filesystem is to try to write it.  If you were successful, then you
could write the data on the filesystem; otherwise, you couldn't.

Only the filesystem can authoritatively allocate free space in the
filesystem.  Applications can only make guesses based on incomplete and
obsolete data.  

Original Unix systems did not implement the \code{statfs} system call,
nor any equivalent.  On these systems, \code{df} is a setuid or setgid
program that reads filesystem superblocks and extracts the amount of
free space from them.  \code{df} was a tool to be used \emph{sparingly},
preferably not at all.

} by
\code{df} reporting pessimistic estimates for the amount of free space
available on a filesystem.  Reporting optimistic estimates can be just
as bad, as this could lead to an application overcommitting disk space.

\item The space occupied by a file in \fakefs{} is not a function of
the amount of data in the file.  If two or more files in \fakefs{} are
identical, all but the first will take up a small, constant amount of
space, regardless of the size of the file.

\item The copy-on-write nature of \fakefs{} means that if a large file
is modified, even without changing the size of the file, it may suddenly
occupy large amounts of previously free space if its data was previously
shared between files.

\item Even in the absence of sharing, \fakefs{} cannot modify part of a 
file.  Every change to a file requires inserting the entire new file.
Even if this results in a very small representation of the new file or
the old file it replaces, there must be sufficient temporary storage
space available for intermediate versions of the file.

\item Operations that traditionally do not consume space in a
filesystem, such as \code{open}ing and \code{read}ing files, do consume
space in \fakefs{}, in some cases permanently.

\end{itemize}

\subsection{Proposals}

\begin{enumerate}

\item We can be fairly sure how much space is \emph{occupied} in a
\code{fakefs}, if we merely track the total size of all the virtual files.
It may be surprisingly difficult to implement this using Berkeley DB, as
the total size variable will be a serious lock bottleneck.

\item Choose a physical filesystem (e.g. the filesystem where \code{blobs}
or \code{db} reside) and report the statistics from that filesystem,
possibly filtered through two `fudge factors'---a constant offset and
a scale factor.

\item Report constant values from the configuration file (``no matter
how much data you write, there's always two gigs free!'').

\item Report the total amount of garbage controlled by the garbage
collectior (see section \ref{garbage}) as free space.

\end{enumerate}

\chapter{\code{fakefs-nfsd}:  NFS Server}

This is a no-brainer.  

Note that the NFS protocol does not have explicit \code{open} or
\code{close} messages.  These will have to be guessed using timeouts.

\chapter{Garbage Collection}

\section{\code{fakegcd}:  Garbage Collector Daemon}
\label{fakegcd}

Asynchronous garbage collection is performed by the \fakefs{} garbage
collector daemon, \code{fakegcd}.  

\section{Ownership of Data}

All data in \fakefs{} is owned by one of these entities:

\begin{enumerate}

\item Root inodes (current version only) in the filesystem are owned.

\item Directory inodes (current version only) in the filesystem own files
and sub directory inodes, recursively, such that the entire filesystem
is ultimately owned by its root inode.

\item The garbage collector owns non-current versions of inodes.  
These inodes are stored in a forward-linked-list, oldest first.

\item Inodes (current and non-current) own Hashes.  

\item Hashes own other Hashes that they depend on.

\item Various processes own all Inode and Hash objects that are
currently open.  These are recorded by process-id in \code{data/clients}.

\end{enumerate}

Circular ownership is an error.

\chapter{FIXMEs}

The following are known problems with this document.  Anything
written here supercedes the stuff above.  Also grep the \LaTeX{}
source for \code{FIXME}.

Need chapters on plugin internals and some data types.

Virtual filesystem mountpoints (mostly for \ldpreload) and mount options
(atime, ro etc).

\end{document}

