Wiki
Clone wikibitsy / SerializationFormat
Serialization Format
There are a few benefits to understanding the serialization format used by the Bitsy graph database:
- Inspect the text files to see if the vertex and edge properties set by the application are serialized correctly
- Use text editors and text processing tools (like sed, awk, perl) to investigate the database contents
- Fix corrupt data files: Ideally, you shouldn't have to do this. The Backup and Recovery section discusses how you can make online/offline backups and recover the database in case of failures.
All Bitsy files are text files encoded with the UTF-8 charset. Each file consits of 0 or more records which are separated by a Unix line separator (\n). The format for a record is as follows:
<record type>=<record contents>#<checksum>
The record type is a single character and the checksum is a six-digit hexadecimal number. The rest of this page discusses the various record formats.
Header
The first record of every file is a header record of the form:
H=<Log number>#<Checksum>
Header records are not used in any line other than the first line of a data file. The purpose of the header record is to identify the order in which the logs should be loaded into memory. It also helps identify partially re-organized vertex/edge log files which may occur if the system crashes in the middle of a reorganization process performed by the VEReorg thread.
Note: The log number is important to the database's consistency. You should not delete files that only have the header defined.
Vertex
The vertex record captures a vertex that is inserted, modified or deleted. The record has the following format:
V={"id":"<ID>","v":<version>,"s":<state>,"p":<JSON-encoded map of properties>}#<checksum>
The version is an integer and the state is either M/D referring to modified and deleted vertices (respectively). Any vertex record that has a version number that doesn't match the version number in the in-memory version is an obsolete record and is removed during a re-organization process.
Edge
The edge record captures an edge and is similar to the vertex record. It has the following format (in a single line):
E={"id":"<edge ID>","v":<version>,"s":<M/D>,\
"o":"<out vertex ID>","l":"<edge label>","i":"<in vertex ID>",\
"p":<JSON-encoded map of properties>}#<checksum>
The properties "o", "l" and "i" refer to the outgoing vertex, edge label and incoming vertex (respectively).
Transaction
A transaction record captures the end of a transaction flush to the log. It is only present in the transaction log files, viz. txA.txt and txB.txt. The purpose of this record is to capture a successful transaction commit. The format of the record looks like this:
T=<long ID>#<checksum>
The purpose of this record is to recover from crashes where a batch of transactions are only partially written to the transaction log by the MemToTxLogWriter thread. The checksum facilitates the detection of a corrupt state caused by a partial flush. To recover the database to a valid state, Bitsy removes all records after the last valid T record, and doesn't load these records to the in-memory database during startup.
Log
A log record captures the end of a flush from the transaction log to the vertex/edge log. It is only present in vertex and edge log files, viz. vA.txt, vB.txt, eA.txt and eB.txt. The format of this record looks like this:
L=<log counter>#<checksum>
The log counter used here is the log counter of the next transaction log to be flushed into this V/E log. The purpose of this record is to recover from crashes that occur when a transaction log is only partially flushed to a V/E log. Bitsy truncates all records that follow an L record, if its log counter matches that of the header record in txA.txt or txB.txt.
Updated