Some notes about Mercurial's design
The fundamental storage type in Mercurial is a "revlog". A revlog is
the set of all revisions to a file. Each revision is either stored
compressed in its entirety or as a compressed binary delta against the
previous version. The decision of when to store a full version is made
based on how much data would be needed to reconstruct the file. This
lets us ensure that we never need to read huge amounts of data to
reconstruct a file, regardless of how many revisions of it we store.
In fact, we should always be able to do it with a single read,
provided we know when and where to read. This is where the index comes
in. Each revlog has an index containing a special hash (nodeid) of the
text, hashes for its parents, and where and how much of the revlog
data we need to read to reconstruct it. Thus, with one read of the
index and one read of the data, we can reconstruct any version in time
proportional to the file size.
Similarly, revlogs and their indices are append-only. This means that
adding a new version is also O(1) seeks.
Generally revlogs are used to represent revisions of files, but they
also are used to represent manifests and changesets.
A manifest is simply a list of all files in a given revision of a
project along with the nodeids of the corresponding file revisions. So
grabbing a given version of the project means simply looking up its
manifest and reconstruction all the file revisions pointed to by it.
A changeset is a list of all files changed in a check-in along with a
change description and some metadata like user and date. It also
contains a nodeid to the relevent revision of the manifest. Changesets
and manifests are one-to-one, but contain different data for
Nodeids are unique ids that are used to represent the contents of a
file AND its position in the project history. That is, if you change a
file and then change it back, the result will have a different nodeid
because it has different history. This is accomplished by including
the parents of a given revision's nodeids with the revision's text
when calculating the hash.
Nodeids are implemented as they are to simplify merging. Merging a
pair of directed acyclic graphs (aka "the family tree" of the file
history) requires some method of determining if nodes in different
graphs correspond. Simply comparing the contents of the node (by
comparing text of given revisions or their hashes) can get confused by
identical revisions in the tree.
The nodeid approach makes it trivial - the hash uniquely describes a
revision's contents and its graph position relative to the root, so
merge is simply checking whether each nodeid in graph A is in the hash
table of graph B. If not, we pull them in, adding them sequentially to
Branching and merging:
Everything in Mercurial is potentially a branch and every user
effectively works in their own branch. When you do a checkout,
Mercurial remembers what the parent changeset was and uses it for the
next check in.
To do a merge of branches in Mercurial, you check out the heads of the
two branches into the same working directory which causes a merge to
be performed, and then check in the result once you're happy with it.
The resulting checkin will have two parents.
It decides when a merge is necessary by first determining if there are
any uncommitted changes in the working directory. This effectively
makes the working directory a branch off the checked in version it's
based on. Then it also determines if the working directory is a direct
ancestor or descendent of the second version we're attempting to
checkout. If neither is true, we simply replace the working directory
version with the new version. Otherwise we perform a merge between the
Merging files and manifests:
We begin by comparing two versions manifests and deciding which files
need to be added, deleted, and merged.
Then for each file, we perform a graph merge and resolve as above.
It's important to merge files using per-file DAGs rather than just
changeset level DAGs as this diagram illustrates:
M M1 M2
|`-------v M2 clones M
aB AB file A is change in mainline
|`---v AB' file B is changed in M2
| aB / | M1 clones M
| ab/ | M1 changes B
| ab' | M1 merges from M2, changes to B conflict
| | A'B' M2 changes A
| a'B' M2 merges from mainline, changes to A conflict
??? depending on which ancestor we choose, we will have
to redo A hand-merge, B hand-merge, or both
but if we look at the files independently, everything
The result is a merged version in the working directory, waiting for
When performing a commit or a merge, we order things so that the
changeset entry gets added last. We keep a transaction log of the name
of each file touched and its length prior to the transaction. On
abort, we simply truncate each file to its prior length. This is one
of the nice properties of the append-only structure of the revlogs.
We can also reuse this journal for "rollback".
Merging between repositories:
One of the key features of Mercurial is the ability to merge between
independent repositories in a decentralized fashion. Each repository
can act as a read-only server or a client. Clients operating by
pulling all branches that it hasn't seen from the server and adding
them into its graph. This is done in two steps: searching for new
"roots" and pulling a "changegroup"
Searching for new "roots" begins by finding all new heads and
searching backwards from those heads to the first unknown nodes in
their respective branches. These nodes are the 'roots' that are used
to calculate the 'changegroup': the set of all changesets starting at
those roots. Mercurial takes pains to make this search efficient in
both bandwidth and round-trips.
Once the roots are found, the changegroup can be transferred as a
single streaming transfer. This is organized as an ordered set of
deltas for changesets, manifests, and files. Large chunks of deltas
can be directly added to the repository without unpacking so it's