Overview

HTTPS SSH

Introduction

μ-son, or muson (meaning 'micro-JSON') is a novel encoding scheme for compressing JSON data.

Why another one?

  • It is highly efficient: μ-son documents are significantly smaller than any existing binary JSON encoding.
  • It is highly performant: the included C++ library is comparable or faster than the fastest JSON parser libraries.
  • It is human-readable.
  • It comes with schema support out of the box.

μ-son is drop-in compatible with JSON: when decoded it is exactly equivalent to JSON as long as the JSON has a sane type policy.

μ-son works by infering the static type of a JSON document, after which metadata (object key names) can be omitted as long as the type signature is transmitted along with the data.

μ-son also automatically inserts backreferences to previously-used data, this further compresses the result.

For type inference to work, the input JSON must follow a "sane" type policy:

  • Lists can only contain elements of a single type.
  • Any given field of an object must hold values of a single type.
  • However, null values or missing fields are allowed via an 'optionally null' type.

In practice, JSON documents with "insane" types are very rare.

μ-son comes with a C++11 parser library that is comparable in performance to RapidJSON. (The fastest JSON parser in the world.)

(Also a fully functional reference encoder/decoder written in Python.)

A formal spec of the μ-son encoding follows later in this document.

C++ library

A high-performance statically-typed, compile-time C++11 library is provided.

Here is a complete example:

#include <iostream>
#include "muson/muson.h"

int main(int argc, char** argv) {

    using People = muson::Signature<LIT("[{name:String,isAlive:?Bool,age:Int}]")>;

    static const std::string sig = People::print_signature();

    struct accessor {
        struct val_t {
            std::string name;
            struct {
                bool null;
                bool data;
            } isAlive;
            int64_t age;
        };
        std::vector<val_t> data;
    };

    std::string input{ argv[1] };

    People parsed;
    muson::decode(sig, input, parsed);

    accessor& x = muson::cast<accessor>(parsed);

    for (const auto& i : x.data) {
        std::cout << i.name << "\t"
                  << (i.isAlive.null ? "Maybe" : i.isAlive.data ? "Yes" : "No") << "\t"
                  << i.age << std::endl;
    }

    return 0;
}

See also a slightly more complete example and benchmark, and API documentation

Performance

Benchmark results compared to RapidJSON:

$ make test
μ-son encoded size in bytes: 3496

real    0m4.765s
user    0m4.697s
sys     0m0.000s

JSON encoded size in bytes: 12007

real    0m5.249s
user    0m5.213s
sys     0m0.007s

Document size compared to popular binary JSON encoding schemes:

Encoding Size, bytes
JSON 12008
BSON 12360
Smile 8194
UBJSON 8927
CBOR 7513
μ-son 3497
gzipped JSON 1890

Compared to normal JSON, μ-son data is almost 4 times smaller without sacrificing round-trip performance or human readability.

Formal spec

Grammar

muson := signature '\n'? data

signature     := atomic_type | optional_type | list_type | object_type 
atomic_type   := 'String' | 'Int' | 'Real' | 'Bool' | 'Null'
optional_type := '?' signature
list_type     := '[' signature ']'
object_type   := '{' field_types? '}'
field_types   := field_type (',' field_type)+
field_type    := [^:]+ ':' signature

data   := token+
token  := string | int | real | bool | null | list
string := backreference | '"' char* '"'
char   := '\\"' | [^"]
int    := backreference | '#' [^"#TF~[\]]+
real   := backreference | '#' [^"#TF~[\]]+
bool   := 'T' | 'F'
null   := '~'
list   := '[' token* ']'
backreference := '*' [0-9]

An example μ-son document:

{root:[{cost:Real,id:Int,private:?Bool,tags:?[{name:String,visible:?Bool}]}]}
[#0.25#1~["blue"T]#0.34#2~[*0~]*1#3T~]

It was generated from this JSON:

{ "root" : [
    { "id": 1, "cost": 0.25, "tags": [ {"name": "blue", "visible":true} ] },
    { "id": 2, "cost": 0.34, "tags": [ {"name": "blue", "visible":null} ] },
    { "id": 3, "cost": 0.25, "private": true }
]}

Encoding and decoding JSON to/from μ-son

Encoding a JSON node as a stream of μ-son tokens works by applying a μ-son type signature to a JSON document:

  • String: output "str", where str is the string value, with the double quote symbol (") escaped: \". No other symbols can be escaped, they must be printed as-is.
  • Int and Real: output #num, where num is the numeric value.
  • Bool: output T or F.
  • Null: output ~ if the JSON value is null; if it is not null then generate an error.
  • Optional[T]: output ~ if the JSON value is null; if it is not null then output whatever corresponds to type T.
  • List[T]: output [, then output each list element, then output ].
  • Object[(key,T),(key,T),...]: for each key attempt to get the corresponding value from the JSON object; if the key does not exist, take a default value of null; output the value.

Decoding works in reverse. Note that each output token starts with a unique ASCII symbol that corresponds to its type; this makes it easy to sanity-check input data while decoding.

Note: objects are output as tuples, without keys, delimiters or nested structure. All of that can be inferred from the type signature.

Backreferences

String, integer and floating-point values can be replaced by backreferences. A backreference token has the form *N, where N is a digit from 0 to 9.

A μ-son decoder must keep three caches of the last 10 string, integer and floating-point values. (I.e., each of the three types has a separate cache of ten elements.) Values in the cache are stored in order of insertion.

Here are the rules for tracking backreferences through a cache:

  • When encoding, put() each value into the cache:
    • If the value is in the cache, remove it from the cache and produce a backreference token. *0 if the value equals the most-recently-inserted item in the cache, *1 if was the next-most-recent, etc.
    • Regardless of whether a backreference token was produced, append the value into the case as most-recently-inserted.
  • When decoding, remove the corresponding value from the cache if a backreference token is encountered. Regardless of what kind of token was encountered, append the value into the cache as most-recently-inserted.

Inference rules

Type inference works by traversing the input JSON recursively and generating a type tag for each JSON node. When this algorithm is applied to a JSON document a μ-son type signature is the result.

  • Strings -> String.
  • Integers -> Int.
  • Floating-point numbers -> Real.
  • Booleans -> Bool.
  • The value null -> Null.
  • Lists: recursively infer a type for each list element, then unify the resulting types into a single type. Generate the type List[T].
  • Objects: recursively infer a type for each object value, and for each generate a pair (key,T). Generate the type Object[(key,T),(key,T),...].

Unification is the process of figuring out if two types can coexit in one list. Given two types, it returns one common-denominator type for both. Here are the rules for unifying two types:

  • If the two types are the same type, then return it.
  • If one of the types is Null:
    • If the other type is Optional, return it.
    • Otherwise, return Optional[T].
  • If one or both of the types is Optional, remove Optional where applicable, unify the stripped types, and return the result of the unification wrapped in an Optional.
  • If both types are List, then remove List from both types, unify them, and return the result wrapped in a List.
  • If both types are Object:
    • Combine all keys from both types into a single set.
    • If a key is missing from one of the types, replace it with a (key,Null).
    • For each key, unify the two types corresponding to this key, and put the resuting (key,T) into the resulting Object.
  • Otherwise, the JSON is ill-typed and cannot have a valid type signature, and μ-son encoding aborts.