Source

sulu-ocaml-core / base / bin-prot / doc / README.tex

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
\documentclass[12pt]{article}

\usepackage{hevea}

%BEGIN LATEX
\usepackage{natbib}
%END LATEX

\newcommand{\mail}{\mailto{opensource@janestreet.com}}
\newcommand{\homeurl}{http://www.janestreet.com}
\newcommand{\athome}[2]{\ahref{\homeurl/#1}{#2}}
\newcommand{\www}{\athome{}{Markus Mottl}}

\newcommand{\ocaml}{\ahref{http://www.ocaml.org}{OCaml}}

\newcommand{\myhref}[1]{\ahref{#1}{#1}}
\newcommand{\ocsrc}[2]{\athome{ocaml/#1}{#2}}
\newcommand{\myocsrc}[1]{\athome{ocaml/#1}{#1}}

\newcommand{\janeshort}{\ahref{http://www.janestreet.com}{Jane Street Holding, LLC}}

\newcommand{\trow}[2]{\quad #1 \quad&\quad #2 \quad\\}
\newcommand{\trowl}[2]{\trow{#1}{#2}\hline}

%BEGIN LATEX
\newcommand{\theyear}{\number\year}
%END LATEX

\title{README: library ``Bin\_prot''}
\author{
  Copyright (C) 2007-\theyear \quad \janeshort\\
  Author: Markus Mottl
}
\date{New York, 2011-07-29}

% DOCUMENT
\begin{document}
\maketitle
\section{Directory contents}
\begin{center}
\begin{tabular}{|c|c|}
\hline
\trowl{Changelog}{History of code changes}
\trowl{COPYRIGHT}{Notes on copyright}
\trow{INSTALL}{Short notes on compiling and}
\trowl{}{installing the library}
\trowl{LICENSE}{License of this library}
\trowl{LICENSE.Tywith}{License of Tywith}
\trowl{Makefile}{Top Makefile}
\trowl{README.txt}{This text in ASCII format}
\trowl{doc/}{LaTeX sources for this text}
\trowl{lib/}{Library sources}
\trowl{lib\_test/}{Test applications for the Bin\_prot-library}
\trowl{syntax/}{Preprocessor for type definitions}
\end{tabular}
\end{center}

\section{What is ``Bin\_prot''}

This library contains functionality for reading and writing OCaml-values in a
type-safe binary protocol.  It is extremely efficient, typically supporting
type-safe marshalling and unmarshalling of even highly structured values at
speeds sufficient to saturate a gigabit connection.  The protocol is also
heavily optimized for size, making it ideal for long-term storage of large
amounts of data.\\

The library is highly dependable and safe to use: a rigorous test suite has
to date guaranteed that this library has never exhibited a bug in production
systems in several years of use.  ``Bin\_prot'' has been successfully
employed in mission-critical financial applications, storing many terabytes
of structured data derived from thousands of type definitions, and typically
processing millions of messages a day in realtime for low-latency applications
that must not crash.\\

Since version two this library should work with all CPU architectures
currently supported by OCaml, no matter the word size (32 or 64 bit),
endianness\footnote{Endianness defines the byte order in which machine
representations of integers are stored in main memory.}, or alignment
requirements.  It provides users with a convenient and safe way of performing
I/O on any extensionally defined OCaml type (see later sections for details).
Functions, objects, first-class modules, as well as values whose type is
bound through a polymorphic record field are hence not supported.  This is
hardly ever a limitation in practice.\\

As of now, there is no support for cyclic or shared values.  Cyclic values
will lead to non-termination whereas shared values, besides requiring more
space when encoded, may lead to a substantial increase in memory footprint
when they are read back.  It would not be trivial to support these kinds
of values in a type-safe way without noticably sacrificing performance.
If these kinds of values are needed, the user may want to use the as of
today still unsafe marshalling functions provided by OCaml.\\

This library uses the machine stack for efficiency reasons.  This can
potentially lead to a crash if the stack limit is reached.  Note that this
is also a limitation of the (unsafe) standard marshalling functions shipped
with OCaml.  This problem can happen for large amounts of data if recursion in
the type definition of the datastructure is not limited to the last element.
Only in the latter case will tail-recursion allow for (practically) unlimited
amounts of data.  If this exceedingly rare limitation ever turned out to
be an issue in a user application, it can often be solved by redefining the
datatype to allow for tail-recursion.  The limitation cannot be eliminated in
this library without significant performance impact and increased complexity.

\section{How can you use it?}

The API (.mli-files) in the library directory is fully documented.  Module
\verb=Common= defines some globally used types, functions, exceptions,
and values.  \verb=Nat0= implements natural numbers including zero.\\

Modules \verb=Read_ml= and \verb=Write_ml= contain read and write functions
respectively for all basic types and are implemented in OCaml as far as
reasonable.  If you only want to read or write single, basic, unstructured
values, this module is probably the most efficient and convenient for
doing this.\\

Otherwise you should annotate your type definitions to generate type
converters automatically (see later sections for details).  The preprocessor
in \verb=pa_bin_prot.ml= will then generate highly optimized functions
for converting your OCaml-values to and from the binary representation.
This automatically generated code will use functions in modules
\verb=Unsafe_common=, \verb=Unsafe_read_c= and \verb=Unsafe_write_c=,
which employ unsafe internal representations to achieve this performance.
The auto-generated code is extremely well-tested and should use these unsafe
representations correctly.  Developers who want to make manual use of these
unsafe calling conventions for efficiency are strongly encouraged to test
their code carefully.\\

The module \verb=Size= allows you to compute the size of basic OCaml-values
in the binary representations before writing them to a buffer.  The code
generator will also provide you with functions for your user-defined types.\\

Module \verb=Std= predefines converters for most standard OCaml types.
If you use the preprocessor macros to generate code from type definitions,
make sure that the contents of this module is visible by e.g.\ adding the
following at the top of files using this library:\\

\begin{verbatim}
  open Bin_prot.Std
\end{verbatim}

Note that you can shadow the definitions in the above module in the unlikely
event that the predefined ways of converting data are unsatisfactory to you.\\

The modules \verb=Read_c= and \verb=Write_c= wrap the unsafe low-level
converters for basic values to ones accessible safely from within OCaml and
vice versa.  They also export functions for wrapping user-defined converters.
This should help developers make their converters available in the respective
other representation (low- or high-level).  The test applications in the
distribution use these wrappers to verify the correctness of implementations
for low-level (C) and high-level (OCaml) representations.\\

The module \verb=Type_class= contains some extra definitions for type
classes of basic values.  These definitions can be passed to the function
\verb=bin_dump= in module \verb=Utils= to marshal values into buffers of
exact size using the binary protocol.  However, if bounds on the size of
values are known, it is usually more efficient to write them directly into
preallocated buffers and just catch exceptions if the buffer limits are
unexpectedly violated.  Doing so should never cause a crash.  That way one
does not have to compute the size of the value, which can sometimes be almost
as expensive as writing the value in the first place.\\

In module \verb=Utils= the function \verb=bin_read_stream= can be used
to efficiently read size-prefixed values as written by \verb=bin_dump=
there with the \verb=header= flag set to \verb=true=.  This module also
offers several useful functors.  The ones for \verb=binable= types help
users create readers and writers if a type needs to be converted to or from
some intermediate representation before marshalling or after unmarshalling
respectively.  The functors for \verb=iterable= types are helpful if some
(usually abstract) datatype offers iteration over elements and if the series of
iterated-over elements alone is sufficient to reconstruct the original value.
This allows for a more compact protocol and for robustness against changes
to the internal representation of the datatype (e.g.\ sets, maps, etc.).

\subsection{Examples}

E.g. given the following type definition:\\

\begin{verbatim}
  type t = A | B
  with bin_io
\end{verbatim}

This will generate functions \verb=bin_size_t=, \verb=bin_write_t=, and
\verb=bin_read_t=.  Furthermore the type class values \verb=bin_writer_t=,
\verb=bin_reader_t= and \verb=bin_t=.  If you use the annotation
\verb=bin_write= instead of \verb=bin_io=, then only the write and size
functions and their type class will be generated.  Specifying \verb=bin_read=
will generate the read functions and associated type class only.
The annotation \verb=bin_type_class= will generate the combined type class
only, thus allowing the user to easily define their own reader and writer
type classes.  The code generator may also generate low-level entry points
used for efficiency or backtracking.\\ \\ The preprocessor can also generate
signatures for conversion functions.  Just add the wanted annotation to the
type in a module signature for that purpose.

\section{Specification of the Binary Protocol}

The binary protocol does not contain any data other than the minimum needed
to decode values.  This means that the user is responsible for e.g.\
writing out the size of messages themselves if they want to be able to
preallocate sufficiently sized buffers before reading.  The \verb=Utils=
module provides some simple functions for that matter, though users may
obtain optimum efficiency by managing buffers themselves.\\

The basic OCaml-values are written out character-wise as described below
using hex codes for the character encoding.  Some of these values require
size/length information to be written out before the value (e.g.\ for lists,
hash tables, strings, etc.).  Size information is always encoded as natural
numbers (\verb=Nat0.t=).  The little-endian format is used in the protocol
for the contents of integers on all platforms.\\

\noindent The following definitions will be used in the encoding specifications
below:

\begin{itemize}
\item \verb=CODE_NEG_INT8= $\rightarrow$ \verb=0xff=
\item \verb=CODE_INT16= $\rightarrow$ \verb=0xfe=
\item \verb=CODE_INT32= $\rightarrow$ \verb=0xfd=
\item \verb=CODE_INT64= $\rightarrow$ \verb=0xfc=
\end{itemize}

\subsection{Nat0.t}

If the value is:

\begin{itemize}
\item $<$ \verb=0x000000080= $\rightarrow$ lower 8 bit of the integer.
\item $<$ \verb=0x000010000= $\rightarrow$ \verb=CODE_INT16= followed by lower 16 bits of integer.
\item $<$ \verb=0x100000000= $\rightarrow$ \verb=CODE_INT32= followed by lower 32 bits of integer.
\item $\geq$ \verb=0x100000000= $\rightarrow$ \verb=CODE_INT64= followed by all 64 bits of integer\footnote{Only supported on 64 bit platforms.}.
\end{itemize}

Appropriate exceptions will be raised if there is an overflow, e.g.\ if
a 64 bit encoding is read on a 32 bit platform, or if the 32 bit or 64
bit encoding overflowed the 30 bit or 62 bit capacity\footnote{One bit is
reserved by OCaml for GC-tagging, and the sign bit is lost.} of natural
numbers on their respective platforms.

\subsection{Unit values}

\begin{itemize}
\item \verb=()= $\rightarrow$ \verb=0x00=
\end{itemize}

\subsection{Booleans}

\begin{itemize}
\item \verb=false= $\rightarrow$ \verb=0x00=
\item \verb=true= $\rightarrow$ \verb=0x01=
\end{itemize}

\subsection{Strings}

First the length of the string is written out as a \verb=Nat0.t=.  Then the
contents of the string is copied verbatim.

\subsection{Characters}

Characters are written out verbatim.

\subsection{Integers}

This includes all integer types: \verb=int, int32, int64, nativeint=.
If the value is positive (including zero) and if it is:

\begin{itemize}
\item $<$ \verb=0x00000080= $\rightarrow$ lower 8 bit of the integer.
\item $<$ \verb=0x00008000= $\rightarrow$ \verb=CODE_INT16= followed by lower 16 bits of integer.
\item $<$ \verb=0x80000000= $\rightarrow$ \verb=CODE_INT32= followed by lower 32 bits of integer.
\item $\geq$ \verb=0x80000000= $\rightarrow$ \verb=CODE_INT64= followed by all 64 bits of integer.
\end{itemize}

\noindent If the value is negative and if it is:

\begin{itemize}
\item $\geq$ \verb=-0x00000080= $\rightarrow$ \verb=CODE_NEG_INT8= followed by lower 8 bit of integer.
\item $\geq$ \verb=-0x00008000= $\rightarrow$ \verb=CODE_INT16= followed by lower 16 bits of integer.
\item $\geq$ \verb=-0x80000000= $\rightarrow$ \verb=CODE_INT32= followed by lower 32 bits of integer.
\item $<$ \verb=-0x80000000= $\rightarrow$ \verb=CODE_INT64= followed by all 64 bits of integer.
\end{itemize}

All of the above branches will be considered when converting values of
type \verb=int64=.  The case for \verb=CODE_INT64= will only be considered
with types \verb=int= and \verb=nativeint= if the architecture supports it.
\verb=int32= will never be encoded as a \verb=CODE_INT64=.  Appropriate
exceptions will be raised if the architecture of or the type requested by the
reader does not support some encoding, or if there is an overflow\footnote{An
overflow can only happen with int values: one bit is reserved by OCaml for
the GC-tag.}.

\subsection{Floats}

Floats are written out according to the 64 bit IEEE 754 floating point
standard, i.e.\ their memory representation is copied verbatim.

\subsection{References and lazy values}

Same as the binary encoding of the value in the reference or of the value
calculated lazily.

\subsection{Option values}

If the value is:

\begin{itemize}
\item \verb=None= $\rightarrow$ \verb=0x00=
\item \verb=Some v= $\rightarrow$ \verb=0x01= followed by the encoding of \verb=v=.
\end{itemize}

\subsection{Tuples and records}

Values in tuples and records are written out one after the other in the
order as specified in the type specification without any extra information.
Polymorphic record fields are supported unless a value of the type bound by
the field were accessed, which would lead to an exception.

\subsection{Sum types}

Each tag is assigned an integer from $0$ to $n - 1$ in exactly the same order
as they occur in the type, where $n$ is the number of sum tags in the type.
If a value of this type needs to be written out, then if:

\begin{itemize}
\item $n \leq 256 \rightarrow$ the integer associated with the tag is written out as one character (lower 8 bits).
\item $n \leq 65536 \rightarrow$ the integer associated with the tag is written out as two characters (lower 16 bits).
\end{itemize}

Sum types with more tags are currently not supported and highly unlikely
to occur in practice.  Arguments to the tag are written out in the order of
occurrence without any extra information.

\subsection{Polymorphic variants}

The tags of these values are written out as four characters, more
precisely as the 32 bit hash value computed by OCaml for the given tag in
little-endian format.  Any arguments associated with the tag are written
out afterwards in the order of occurrence without any extra information.\\

When polymorphic variants are being read, they will be matched in order
of occurrence (left-to-right) in the type and depth-first in the case of
included polymorphic types.  The first type containing a match for the
variant will be used for reading.  Note that this only matters if semantic
invariants are used that may impose constraints on whether a value is legal.
It is strongly suggested to not use different constraints for polymorphic
variants with the same representation if it is used within the same type,
since this is inconsistent and hence confusing.

\subsection{Lists and arrays}

For lists and arrays the length is written out as a \verb=Nat0.t= first,
followed by all values in the same order as in the datastructure without
any extra information.

\subsection{Hash tables}

First the size of the hash table is written out as \verb=Nat0.t=.  Then the
writer iterates over each binding in the hash table and writes out the
key followed by the value without any extra information.  Note that this
makes reading somewhat slower than if we used the internal (extensional)
representation of the hash table, because all values have to be rehashed.
On the other hand, the format becomes more robust in case the hash table
implementation changes.

\subsection{Bigarrays of doubles and characters}

First the dimension(s) are written out as \verb=Nat0.t=.  Then the contents
is copied verbatim.

\subsection{Polymorphic values}

There is nothing special about polymorphic values as long as there are
conversion functions for the type parameters.  E.g.:

\begin{verbatim}
  type 'a t = A | B of 'a with bin_io
  type foo = int t with bin_io
\end{verbatim}

In the above case the conversion functions will behave as if \verb=foo= had
been defined as a monomorphic version of \verb=t= with \verb='a= replaced
by \verb=int= on the right hand side.

\subsection{Abstract datatypes}

If you want to convert an abstract datatype that may impose constraints on
the well-formedness of values, you will have to roll your own conversion
functions.  Use the functions in module \verb=Read_c= and \verb=Write_c=
to map between low-level and high-level representations, or implement those
manually for maximum efficiency.  The \verb=Utils= module may also come in
handy as described in earlier sections, e.g.\ if the value can be converted
to and from an intermediate representation that does not impose constraints,
or if some sort of iteration is supported by the datatype.

\section{Contact information}

\noindent In the case of bugs, feature requests and similar, you can contact
us here:\\\\

\hspace{2ex}\mail\\\\

\noindent Up-to-date information concerning this library should be available
here:\\\\

\hspace{2ex}\homeurl/ocaml\\\\

Enjoy!!\\\\

\end{document}