Commits

Bryan O'Sullivan  committed 75c076c

More concepts stuff.

  • Participants
  • Parent commits 1b67dc9

Comments (0)

Files changed (3)

 	kdiff3.png \
 	metadata.svg \
 	mq-stack.svg \
+	snapshot.svg \
 	tour-history.svg \
 	tour-merge-conflict.svg \
 	tour-merge-merge.svg \

File en/concepts.tex

   \label{fig:concepts:metadata}
 \end{figure}
 
-Note that there is not a ``one to one'' relationship between revisions
-in these different metadata files.  If the manifest hasn't changed
-between two changesets, their changelog entries will point to the same
-revision of the manifest.  If a file that Mercurial tracks hasn't
-changed between two changesets, the entry for that file in the two
-revisions of the manifest will point to the same revision of its
-filelog.
+As the illustration shows, there is \emph{not} a ``one to one''
+relationship between revisions in the changelog, manifest, or filelog.
+If the manifest hasn't changed between two changesets, the changelog
+entries for those changesets will point to the same revision of the
+manifest.  If a file that Mercurial tracks hasn't changed between two
+changesets, the entry for that file in the two revisions of the
+manifest will point to the same revision of its filelog.
 
-\section{An efficient, unified, safe storage mechanism}
+\section{Safe, efficient storage}
 
 The underpinnings of changelogs, manifests, and filelogs are provided
 by a single structure called the \emph{revlog}.
 revisions you must read, hence the longer it takes to reconstruct a
 particular revision.
 
+\begin{figure}[ht]
+  \centering
+  \grafix{snapshot}
+  \caption{Snapshot of a revlog, with incremental deltas}
+  \label{fig:concepts:snapshot}
+\end{figure}
+
 The innovation that Mercurial applies to this problem is simple but
 effective.  Once the cumulative amount of delta information stored
 since the last snapshot exceeds a fixed threshold, it stores a new
 snapshot (compressed, of course), instead of another delta.  This
 makes it possible to reconstruct \emph{any} revision of a file
-quickly.  This approach works so well that it has subsequently been
-copied by several other revision control systems.
+quickly.  This approach works so well that it has since been copied by
+several other revision control systems.
+
+Figure~\ref{fig:concepts:snapshot} illustrates the idea.  In an entry
+in a revlog's index file, Mercurial stores the range of entries from
+the data file that it must read to reconstruct a particular revision.
 
 \subsubsection{Aside: the influence of video compression}
 
 the next key frame is received.  Also, the accumulation of encoding
 errors restarts anew with each key frame.
 
+\subsection{Strong integrity}
+
+Along with delta or snapshot information, a revlog entry contains a
+cryptographic hash of the data that it represents.  This makes it
+difficult to forge the contents of a revision, and easy to detect
+accidental corruption.
+
+Mercurial checks these hashes when retrieving file revisions and when
+pulling changes from a repository.  If it encounters an integrity
+problem, it will complain and stop whatever it's doing.
+
+In addition to the effect it has on retrieval efficiency, Mercurial's
+use of periodic snapshots makes it more robust against partial data
+corruption.  If a revlog becomes partly corrupted due to a hardware
+error or system bug, it's often possible to reconstruct some or most
+revisions from the uncorrupted sections of the revlog, both before and
+after the corrupted section.  This would not be possible with a
+delta-only storage model.
+
+\section{The working directory}
+
+Mercurial's good ideas are not confined to the repository; it also
+needs to manage the working directory.  The \emph{dirstate} contains
+Mercurial's knowledge of the working directory.  This details which
+revision(s) the working directory is updated to, and all files that
+Mercurial is tracking in the working directory.
+
+Because Mercurial doesn't force you to tell it when you're modifying a
+file, it uses the dirstate to store some extra information so it can
+determine efficiently whether you have modified a file.  For each file
+in the working directory, it stores the time that it last modified the
+file itself, and the size of the file at that time.  
+
+When Mercurial is checking the states of files in the working
+directory, it first checks a file's modification time.  If that has
+not changed, the file must not have been modified.  If the file's size
+has changed, the file must have been modified.  If the modification
+time has changed, but the size has not, only then does Mercurial need
+to read the actual contents of the file to see if they've changed.
+Storing these few extra pieces of information dramatically reduces the
+amount of data that Mercurial needs to read, which yields large
+performance improvements compared to other revision control systems.
+
+\section{Other interesting design features}
+
+In the sections above, I've tried to highlight some of the most
+important aspects of Mercurial's design, to illustrate that it pays
+careful attention to reliability and performance.  However, the
+attention to detail doesn't stop there.  There are a number of other
+aspects of Mercurial's construction that I personally find
+interesting.  I'll detail a few of them here, separate from the ``big
+ticket'' items above, so that if you're interested, you can gain a
+better idea of the amount of thinking that goes into a well-designed
+system.
+
 \subsection{Clever compression}
 
 When appropriate, Mercurial will store both snapshots and deltas in
 the snapshot, again saving space compared to a naive delta-only
 approach.
 
-\subsection{Strong integrity}
+\subsubsection{Network recompression}
 
-Along with delta or snapshot information, a revlog entry contains a
-cryptographic hash of the data that it represents.  This makes it
-difficult to forge the contents of a revision, and easy to detect
-accidental corruption.
+When storing revisions on disk, Mercurial uses the ``deflate''
+compression algorithm (the same one used by the popular \texttt{zip}
+archive format), which balances good speed with a respectable
+compression ratio.  However, when transmitting revision data over a
+network connection, Mercurial uncompresses the compressed revision
+data.
 
-Mercurial checks these hashes when retrieving file revisions and when
-pulling changes from a repository.  If it encounters an integrity
-problem, it will complain and stop whatever it's doing.
+If the connection is over HTTP, Mercurial recompresses the entire
+stream of data using a compression algorithm that gives a etter
+compression ratio (the Burrows-Wheeler algorithm from the widely used
+\texttt{bzip2} compression package).  This combination of algorithm
+and compression of the entire stream (instead of a revision at a time)
+substantially reduces the number of bytes to be transferred, yielding
+better network performance over almost all kinds of network.
 
-In addition to the effect it has on retrieval efficiency, Mercurial's
-use of periodic snapshots makes it more robust against partial data
-corruption.  If a revlog becomes partly corrupted due to a hardware
-error or system bug, it's often possible to reconstruct some or most
-revisions from the uncorrupted sections of the revlog, both before and
-after the corrupted section.  This would not be possible with a
-delta-only storage model.
+(If the connection is over \command{ssh}, Mercurial \emph{doesn't}
+recompress the stream, because \command{ssh} can already do this
+itself.)
 
 \subsection{Read/write ordering and atomicity}
 
 makes for all kinds of nasty and annoying security and administrative
 problems.)
 
-Mercurial uses a locking mechanism to ensure that only one process can
-write to a repository at a time.  This locking mechanism is safe even
-over filesystems that are notoriously unsafe for locking, such as NFS.
-If a repository is locked, a writer will wait for a while to retry if
-the repository becomes unlocked, but if the repository remains locked
-for too long, the process attempting to write will time out after a
-while.  This means that your daily automated scripts won't get stuck
-forever and pile up if a system crashes unnoticed, for example.  (Yes,
-the timeout is configurable, from zero to infinity.)
+Mercurial uses locks to ensure that only one process can write to a
+repository at a time (the locking mechanism is safe even over
+filesystems that are notoriously hostile to locking, such as NFS).  If
+a repository is locked, a writer will wait for a while to retry if the
+repository becomes unlocked, but if the repository remains locked for
+too long, the process attempting to write will time out after a while.
+This means that your daily automated scripts won't get stuck forever
+and pile up if a system crashes unnoticed, for example.  (Yes, the
+timeout is configurable, from zero to infinity.)
+
+\subsubsection{Safe dirstate access}
+
+As with revision data, Mercurial doesn't take a lock to read the
+dirstate file; it does acquire a lock to write it.  To avoid the
+possibility of reading a partially written copy of the dirstate file,
+Mercurial writes to a file with a unique name in the same directory as
+the dirstate file, then renames the temporary file atomically to
+\filename{dirstate}.  The file named \filename{dirstate} is thus
+guaranteed to be complete, not partially written.
 
 
 

File en/metadata.svg

Old
Old image
New
New image