What Can We Learn from the Code in Git’s Initial Commit?

The vast majority of Git resources discuss how to use Git. Very few describe how Git actually works and even fewer look under the hood at Git’s code. In this article, I’m going to examine the initial commit of Git’s code to help you understand Git from the code perspective. If you are unfamiliar with what an initial commit is, I recommend you check out my article detailing the concept of an initial commit in Git.

Background

There is a reason I decided to examine the first version of Git’s code instead of the current version. Git was created in 2005 by Linus Torvalds, who is also the creator of Linux. Git has been under active development for about 15 years. Therefore, the current codebase is fairly large and complicated. It includes hundreds of code files written in more than five programming languages, more than 58,000 commits by over 1,300 developers, and tens of thousands of lines of code. This makes sense since new features, enhancements, and optimizations are continuously being added by the open source community.

However, the code in Git’s initial commit is contained in only 10 files, is less than 1000 total lines of code, and is fully written in the C programming language. Most importantly, the code actually runs. The fact that Git’s original version is so small and simple to understand makes it a great resource for learning how Git fundamentally works. Before we dive right into the code, we’ll describe how we can obtain a copy of Git’s original codebase.

Obtaining a local copy of Git’s initial commit

I created a project called Baby Git, in which I checked out Git’s initial commit and fully documented it with inline code comments describing how each piece of code works. The only (minor) modifications made to the code are to facilitate compiling on modern operating systems. The codebase is hosted on BitBucket and you can view it in your web browser using the following link:

https://bitbucket.org/jacobstopak/baby-git/src/master/

You can also clone it down to your local machine using the command:

`git clone https://bitbucket.org/jacobstopak/baby-git.git`

Alternatively, if you’d like a copy of Git’s initial commit in its purest, unadulterated form, you can get it by following these steps:

1) Clone the Git repository to your local machine using the command `git clone https://github.com/git/git.git`

2) Browse into the `git` directory

3) Run the command `git log –reverse` to see Git’s commit log in reverse order. The initial commit has an ID of e83c5163316f89bfbde7d9ab23ca2e25604af290

4) Run the command `git checkout e83c5163316f89bfbde7d9ab23ca2e25604af290` to check out Git’s initial commit into the working directory

Now, we’ll explore what Git’s original revision actually contains and what it can do.

What is in Git’s initial commit?

If you list the contents of Git’s initial commit, you’ll see the eight following C code files with a `.c` extension:

init-db.c

-update-cache.c

-write-tree.c

-commit-tree.c

-read-tree.c

-cat-file.c

-show-diff.c

-read-cache.c

The first seven of these `.c` files each directly correspond to one of the original seven Git commands. When the codebase is compiled, each of these seven `.c` files is compiled into its own executable. For clarity, we can relate each of these to the modern name of that command, as follows:

Git's Initial Commit Current Git Version Purpose
init-db git init Initialize a Git repository
update-cache git add Add a file to the staging area
write-tree git write-tree Write a new tree object to the Git repository using the content in the staging index
commit-tree git commit Create a new commit object in the Git repository based on the specified tree
read-tree git read-tree Display the contents of a tree object from the Git repository
show-diff git diff Show the differences between staged Git files and their corresponding working directory versions
cat-file git cat-file Display the contents of objects stored in the Git repository

The eighth `.c` file is `read-cache.c`. It defines various functions that are used program-wide in Git’s original revision. It also contains definitions of external (or global) variables used by the program.

In addition to these eight `.c` files, Git’s original codebase contains one `.h` header file called `cache.h`. This file is included in all the source files via the `#include` preprocessing directive.  This file contains other `#include` directives of library header files, token definitions, declarations of external variables and structure templates, and function prototypes.

Finally, Git’s first commit contains a `Makefile` which is used to compile the code and a `README` with a humorous, yet very informative, description of Git’s core concepts.

Next, we’ll describe the different parts of a Git repository.

Components of a Git repository

There are four main components in an original Git repository:

  1. Objects (blobs, trees, commits)
  2. Object database
  3. Current directory cache
  4. Working directory

Objects

In order to store and track our code file changes, Git needs a way to understand file content. Git does this by creating three types of objects that can be used to identify file content and keep track of it over time. Git objects can be classified in three types – blobs, trees, and commits.

Blobs

The term “blob” stands for Binary Large Object. A blob is simply a file containing data in binary format. Since all computer files are ultimately stored as binary, any file stored on our computer can be treated as a blob. Git creates a blob for each file that we add to the repository. Furthermore, Git needs a way to be able to identify each blob and distinguish blobs from each other. This is done by naming and referencing each blob using the SHA-1 hash of its deflated (compressed) content. Each piece of content is guaranteed to have a unique SHA-1 hash, so it is safe for Git to reference objects this way.

A blob object has the following structure:

“`

‘blob’         (blob object tag)

‘ ‘            (single space)

size of blob   (in bytes)

‘\0’           (null character)

blob data      (file content)

“`

Trees

In addition to storing the content of each tracked file in a blob, Git needs to keep track of other information like file names, paths where the files are stored, and file permissions and properties. This is done using `tree` objects, which can be used to represent single files or entire directory structures. Trees link blob objects (which contain pure, unlabelled binary content) with identification information like file names, paths, and file properties. Trees can also contain references to other trees, which in turn reference other trees or blobs. This forms a branching structure representing entire directories populated with subdirectories and files. Like blobs, trees are also named and referenced by the SHA-1 hash of their deflated content.

A tree object has the following structure:

“`

‘tree’              (tree object tag)

‘ ‘                 (single space)

size of tree        (in bytes)

‘\0’                (null character)

file 1 mode         (octal number)

‘ ‘

file 1 name

‘\0’

file 1 SHA-1 hash   (hash of file’s deflated contents)

file 2 mode

‘ ‘

file 2 name

‘\0’

file 2 SHA-1 hash

file N mode

‘ ‘

file N name

‘\0’

file N SHA-1 hash

“`

Commits

The last type of object in Git’s original version is the commit. A commit can be thought of as a representation of the state of a directory tree (a set of files and folders) at a particular moment in time. Commit objects also include the name and email of the author who made the commit, a descriptive commit message entered by the committer, and a reference to the commit’s parent(s). By storing a reference to the previous commit in the chain (the parent commit), Git can create and track the full development history for a codebase. Like blobs and trees, commits are also named and referenced by the SHA-1 hash of their deflated content.

A commit object has the following structure:

“`

‘commit’                     (commit object tag)

‘ ‘                          (single space)

size of data                 (in bytes)

‘\0’                         (null character)

‘tree’ SHA-1 hash            (hash value of committed tree)

‘parent’ SHA-1 hash          (hash value of first parent commit)

‘parent’ SHA-1 hash          (hash value of second parent commit)

‘author’ ID email date

‘committer’ ID email date

                             (empty line)

comment

“`

Now that we know what types of objects Git creates and works with, let’s discuss how Git stores those objects.

Object database

Git’s object database is nothing more than a local folder used to organize and store Git objects – blobs, trees, and commits. As mentioned in the previous section, objects contain the content and revision history that enables Git to track changes for a codebase. The object database is created when the user runs the `init-db` command from the command-line. This command creates a hidden folder path in the current directory called `.dircache/objects`. Inside the `.dircache/objects` directory, 256 subdirectories are created named from `00` to `ff`, corresponding to the 256 possible values of a two-digit hexadecimal number. The folder structure inside the `objects` directory looks like this:

“`

.dircache/objects/00

.dircache/objects/01

.dircache/objects/02

.dircache/objects/fd

.dircache/objects/fe

.dircache/objects/ff

“`

When we tell Git to track a file by using the `update-cache <filename.ext>` command, create a tree using the `write-tree <tree-sha-1>` command, or make a commit using the `commit-tree` command, the blobs, trees, and commits are created and stored in the folder corresponding to the first two characters of the object’s SHA-1 hash. In this way, each object that Git creates is stored in one of the 256 folders in the object database.

The remaining digits of the object’s SHA-1 hash are used as the filename of that object. For example, an object with the SHA-1 hash:

`edb213e660d408619d894b20806b1d5086aab03b`

will be stored in the following path in the object database:

`.dircache/objects/ed/b213e660d408619d894b20806b1d5086aab03b`

A database like in this which each object is stored based on a hash value of its own content is called a Content Addressable Database.

Before a commit is made to the object database, file changes are built up in a staging area known as the current directory cache.

Current directory cache

The current directory cache is the ancestor to the staging area in the current version of Git. It is stored in a file called `.dircache/index`. You can think of the current directory cache as a temporary tree object that contains the changes added to the staging area using the `update-cache <filename.ext>` command. When the `update-cache` command is executed, it creates blobs containing file content in the object database, and adds the tree information (file names, paths, properties) into the `.dircache/index` file.

When the developer runs the `write-tree` command, the information in the `.dircache/index` file is written to the object database as a new tree object. Then the `commit-tree` command is used to create a new commit in the database using a specified tree, author information, timestamp, and descriptive message.

Working directory

The working directory in the original version of Git is the same concept as the working directory in the current version of Git. It is simply the state of the codebase that resides on the file system at a particular time. A developer works locally in the working directory to add, edit, and delete code files in the project. As noted in the previous sections, both the object database `.dircache/objects` and current directory cache `.dircache/index` are stored in the working directory, but they typically aren’t modified directly by developers.

Now that we understand the components of a Git repository, let’s take a look at some code examples that highlight how Git’s core functionality works.

Some Important Code Snippets from Git’s Initial Commit

Let’s start with a code snippet from the `init-db.c` file (full file here), which is used to initialize a Git repository when the user runs the `init-db` command. This command creates the object database at the path `.dircache/objects`. Here is the part of the code that creates the 256 subdirectories of the object database under the default `.dircache/objects` directory:

225     /*

    226      * Execute this loop 256 times to create the 256 subdirectories inside the 

    227      * `.dircache/objects/` directory. The subdirectories will be named `00`

    228      * to `ff`, which are the hexadecimal representations of the numbers 0 to 

    229      * 255. Each subdirectory will be used to hold the objects whose SHA1 hash 

    230      * values in hexadecimal representation start with those two digits.

    231      */

    232     for (i = 0; i < 256; i++) {

    233         /*

    234          * Convert `i` to a two-digit hexadecimal number and append it to the 

    235          * path variable after the `.dircache/objects/` part. That way, each 

    236          * time through the loop we build up one of the following paths: 

    237          * `.dircache/objects/00`, `.dircache/objects/01`, …,

    238          * `.dircache/objects/fe`, `.dircache/objects/ff`.

    239          */

    240         sprintf(path+len, “/%02x”, i);

    241 

    242         /*

    243          * Attempt to create the current subdirectory. If it fails, `mkdir()` 

    244          * will return -1 and the program will print a message and exit.

    245          */

    246         if (MKDIR(path) < 0) {

    247             if (errno != EEXIST) {

    248                 perror(path);

    249                 exit(1);

    250             }

    251         }

    252     }

Next, we’ll cover two code snippets from the `read-cache.c` file (full file here). These snippets are run any time Git deflates an object (blob, tree, or commit), hashes it, and stores it in the object database. As we mentioned previously, all objects are deflated before Git writes them to the object database.  Here is a portion of the code in the `write_sha1_file` function that calls `zlib` functions to deflate an object:

    573     /* Initialize the zlib stream to contain null characters. */

    574     memset(&stream, 0, sizeof(stream));

    575 

    576     /*

    577      * Initialize compression stream for optimized compression 

    578      * (as opposed to speed). 

    579      */

    580     deflateInit(&stream, Z_BEST_COMPRESSION);

    581     /* Determine upper bound on compressed size. */

    582     size = deflateBound(&stream, len);

    583     /* Allocate `size` bytes of space to store the next compressed output. */

    584     compressed = malloc(size);

    585 

    586     /* Specify buf as location of the next input to the compression stream. */

    587     stream.next_in = buf;

    588     /* Number of bytes available as input for next compression. */

    589     stream.avail_in = len;

    590     /* Specify compressed as location to write the next compressed output. */

    591     stream.next_out = compressed;

    592     /* Number of bytes available for storing the next compressed output. */

    593     stream.avail_out = size;

    594 

    595     /* Compress the content of buf, i.e., compress the object. */

    596     while (deflate(&stream, Z_FINISH) == Z_OK)

    597     /* Linus Torvalds: nothing */;

    598 

    599     /*

    600      * Free memory structures that were dynamically allocated for the

    601      * compression. 

    602      */

    603     deflateEnd(&stream);

    604     /* Get size of total compressed output. */

    605     size = stream.total_out;

After an object is deflated, its SHA-1 hash value is always calculated.  The 40-character hexadecimal representation of this hash value is used to index and reference the object in the object database. Here is the portion of code in `write_sha1_file` that calculates the SHA-1 hash value of an object:

    607     /* Initialize the SHA context structure. */

    608     SHA1_Init(&c);

    609     /* Calculate hash of the compressed output. */

    610     SHA1_Update(&c, compressed, size);

    611     /* Store the SHA1 hash of the compressed output in `sha1`.  */

    612     SHA1_Final(sha1, &c);

Finally, we’ll examine a code snippet from the `commit-tree.c` file (full file here). This snippet is executed any time the user runs the command `commit-tree <tree hash> [(-p <parent hash>)…] < changelog`. It populates the commit object with the identifying information, including parent commit ID’s, the author’s name and email, the current date, and a user-supplied message.

    525     /* Add the string ‘tree ‘ and the tree SHA1 hash to the buffer. */

    526     add_buffer(&buffer, &size, “tree %s\n”, sha1_to_hex(tree_sha1));

    527 

    528     /*

    529      * For each pareent commit SHA1 hash, add the string ‘parent ‘ and the

    530      * hash to the buffer.

    531      *

    532      * Linus Torvalds: NOTE! This ordering means that the same exact tree 

    533      * merged with a * different order of parents will be a _different_ 

    534      * changeset even if everything else stays the same.

    535      */

    536     for (i = 0; i < parents; i++)

    537         add_buffer(&buffer, &size, “parent %s\n”,

    538                    sha1_to_hex(parent_sha1[i]));

    539 

    540     /*

    541      * Add the author and committer name, email, and commit time to the 

    542      * buffer. 

    543      */

    544     add_buffer(&buffer, &size, “author %s <%s> %s\n”, gecos, email, date);

    545     add_buffer(&buffer, &size, “committer %s <%s> %s\n\n”,

    546                realgecos, realemail, realdate);

    547     /*

    548      * Add the commit message to the buffer. This is what requires the user to 

    549      * type CTRL-D to finish the `commit-tree` command. 

    550      */

    551     while (fgets(comment, sizeof(comment), stdin) != NULL)

    552         add_buffer(&buffer, &size, “%s”, comment);

By taking a look through the four short snippets above, we can learn a surprising amount about how Git works. We saw how the object database structure is created, how Git objects are deflated and hashed before storage in the object database, and how commit objects are populated.

Keep in mind that Git has changed immensely over the years, so much if not all of this code has probably changed. However, looking back at the roots of this essential tool reveals that even the most popular and feature-rich tools have humble beginnings.

Conclusion

In this article, we explored Git’s original source code. We described how to get a copy of the initial commit of Git, what it contains, and reviewed some essential code extracts to highlight how it works.  For more information, check out some related posts about how Git's code works.

About the Author

This article was written by Jacob Stopak, a software developer and consultant with a passion for helping others improve their lives through code. Jacob is the creator of the Initial Commit, a site dedicated to providing resources, services, and tools to coders of all skill levels.