Wiki

Clone wiki

memoria / String Data Notation

Linked Data

In Memoria containers data is physically represented as arrays of primitive types, grouped into a tree-like PackedAllocator. This structure works well for small hierarchies, but is not optimized for traversal-style access and doesn't scale well to arbitrary trees and graphs like we have for XML or JSON documents.

Core Features

Linked Data or LinkedData Document format was specifically designed to represent trees of arbitrary objects in a contiguous segment of memory, providing O(1) pointer-style access operations, fast O(1) arena-style allocation and zero-cost serialization. LinkedData format is architecture-dependent, but the only dependence is byte-order, that is mostly LE today. The same is true for Memoria containers. LinkedData documents are immediately queryable. If such document fits container block as a whole, it can be queried immediately, without any additional IO. That is the case when most documents are much smaller than a typical b-tree leaf data block. Scattered documents need to be read into contiguous block of memory for querying first. But this operation is a pure memory copy, without any processing. It's fast, but it's O(N), where N is a document size. So, point-like queries are kinda "slow" for relatively large documents. Sequential scans are not affected that much because a document is being read entirely anyway.

LinkedData documents are designed to be relatively small. Best size is below 2KB but this value generally depends on the container block size (typically 4-16KB for OLTP scenarios and up to 1MB for analytics). Large documents should be split into a smaller ones. The goal is to fit several documents into one block.

LinkedData documents are garbage-collected. When a new object is created, it's allocated at the end of the segment.

String Data Notation (SDN)

LinkedData is mostly architecture-independent (except for byte order), but within the same byte-order it's cross-platform. We can create a document in C++ and read it from pure Java or Python, providing that the platform can interpret custom data types stored in the document. That means, for example, there are equivalent Java object wrappers for generic and custom LinkedData objects, initially created from C++. Most cross-platform data interchange frameworks have such libraries of data readers and writers. In case of cross-platform LinkedData, that means a lot of code duplication.

String Data Notation or SDN is a JSON-like textual encoding for LinkedData. Like JSON, SDN has similar set of basic data types: string, integer, double, boolean, array and map. But unlike JSON, SDN objects may have associated type descriptions, and type description is also data type. The actual Boost Spirit grammar for SDN can be found here.

LinkedData documents may contain objects of many different data types. Objects have state and behaviour, and this behaviour is implemented in the code. Accessing LinkedData from different runtimes (like Java or Python) is a very important practical use case, but duplicating this behaviour natively in those runtimes is highly impractical. It will be pretty hard to keep all those versions in sync, and this is at least. Generating bridging code and using FFI is also a process with many complex manual steps. SDN is designed exactly for this purpose: for interoperability between C++-centric LinkedData structures and foreign runtimes. It's pretty easy to work with textual documents in any high-end programming language.

SDN was not designed for performance. Memoria's philosophy is to push computations down to data, not lifting data up to foreign runtimes where the data can be processed. If we need, for example, high performance query language, we can create a highly optimized JIT for it, pushing DSL right down the data. Most runtimes besides C++ was not designed for high performance anyway.

SDN was designed for flexibility. LinkedData container, SDN parser and serializer, all related data types are implemented in C++ and accessible through JMESPath-like LinkedDataQuery (LDQuery) language. Foreign runtimes will be using native library with bindings for that. AntLR4 grammar for SDN will also be provided to parse SDN documents without native libraries where it is appropriate. It's expected that LDQuery is eventually backed by some JIT technology for high performance.

The following is by-example explanation of how LinkedData and SDN work over and under the hood.

SDN Type Descriptions

TBD

Memoria Datatypes and Type Registry

TBD

SDN Objects

TBD

SDN Parsing

TBD

Datum<T>

TBD

LDDocument

TBD

Updated