cadastr / README.txt

(* DO NOT EDIT (digest: 1166af0860afb5a6072298ba0ffe2804) *)
This is the README file for the cadastr distribution.

OCaml data structures: generic interfaces, implementations

See the files INSTALL.txt for building and installation instructions. 


    What is it?

  Cadastr ( = "OCaml Data Structures") is a library that allows
to work with OCaml data structures in uniform manner.

  Also there are some simple ways to work around strings
mutability and unicode characters, but it's completely non-restrictive,
so you can mutate every string you want.  See section "Strings" below.

    What does the Cadastr provide?

  Interfaces and implementations for common data structures:
containers (todo), maps from keys to values, maybe other
data structures too, later.
  The interface or implementation is added on demand, when
the author/contributor wants to use it in his work.

    How it is made?

  Data structures are wrapped in classes and objects, and the
method call is dispatched to the correct datastructure's
implementation using classical OO method calls dispatching.

  So the functions which work with Cadastr values does ignore
their representation.

  As for values' creation, you can define your own classes.
For example, you can use either

      class c_seg_map = Simp.map_rw_assoc [seg, disp_handler] ~keq:String.eq;

  to use assoc lists for values created with "new c_seg_map", or

      module Tr = Simp.Tree(String);
      class c_seg_map = Tr.map_rw_tree [disp_handler];

  to use trees (OCaml stdlib's Map module) for "new c_seg_map"
and consequent work with these values.

    Drawbacks (read carefully!)

  With Cadastr abstractions you can easily plug another underlying
data structure for your algorithm, avoiding any code modifications.
But any abstraction have a price to pay usually.  As for Cadastr,
the price is the performance.

  - Every value is wrapped into object:
    - Each value uses more memory (the overhead is constant for each
      value and depends on the type of the value (i.e. size of the
      object value))
    - Each value creation uses more CPU (the overhead is constant
      and depends on the implementation of value's constructor, usually
      it is small)
  - Every operation on value is a method call: it is slower than direct
    function call.  OCaml compiler does optimize it in some cases,
    but anyway it's slower.

  Do not use Cadastr in a very resource-bound code!
  However, usually you can use Cadastr ignoring the performance
overhead, because it is small enough for the most of applications.



  OCaml is a language with a great history and an old age.  Many
programs rely on the fact that strings are very very fast: they are
represented as character arrays, where each character has fixed width
(8 bits), they are null-terminated, but may contain \0 characters,
they use clever encoding to calculate their length in two memory reads.

  But the time goes on.  Unicode is the standard way to represent text.
There exist a library that gives you the full unicode power -- it's
the perfect "Camomile" [1] library.  Also, there is convenient unicode
support in "OCaml Batteries Included" [2] programming environment.

  But it is heavy enough for most of programs.  On author's experience,
only 20% of software written required any knowledge about characters'
encoding (ascii/latin1/one-byte or utf8/unicode), and less than 5%
of software required full unicode support from libraries (it was
Camomile library in these cases).  (of course it's a one man's
experience only.)

  Also, UTF-8 has very good properties to use it just like usual
one-byte strings: you can input/output it, you can concatenate UTF-8
strings to produce valid UTF-8 strings, you can concatenate it with
any ASCII 7-bit strings, and it will be perfectly valid.  The UTF-8
does not use \0 characters, so it's compatible with NUL-terminated
strings.  However, you can't get a string's character by its offset
in constant time (only O(offset) time), and you can't modify the
character by it offset in characters, since UTF-8 represents different
unicode characters with byte sequences of different lengths (1..4 bytes).

  So Cadastr uses a partial solution: if you work with UTF-8 encoded
strings, you can "open Cd_All;; open Strings.Utf8;;" and use only
restricted set of functions (see file "test/", there are
examples of commented code that does not compile when uncommented).
The type "char" is Chars.Unicode.Char.t = private int.

  Some functions like "length" or "sub" will be added later, on demand /
on need.  But there won't be any support of mutating UTF-8 strings --
use "ropes" instead (for example, in Batteries [2]).

[1] --
[2] --


  Except the unicode issue, OCaml strings are not strings in the common
sense, but the "byte arrays".  The great performance is gained because
of this approach, but there are drawbacks in safety: the code can mutate
YOUR string, and you can't prevent it.  Of course, the OCaml coders are
very gentle and they probably won't write the code that mutates YOUR
strings.  Cadastr won't give you a total solution to this problem,
but using "open Cd_All;; open Strings.Latin1;;" or "... open Strings.Utf8"
you give a guarantee that you will not mutate THEIR strings, there simply
no Strings.Latin1.String.{set,fill,blit} operations (the operations
that will be named as "String.{set,fill,blit}" after "open Strings.Latin1").

  The type Strings.Yourencoding.String.t = private string, so you can
easily mutate it after coercion to usual string type, but any such
coercion is a visible operation, you won't write it mindlessly (just
like you'll never use Obj module without a very good reason).

  Also, this open also redefines type "char" and operator "^" to the
type of character of strings that you have opened for use.

  Of course there is a great need in byte arrays and bytes, so there are
modules Byte (just "type t = char") and Bytes (equal to stdlib's String
module) for any kind of byte arrays manipulations.  Any string value
can be created by copying the byte buffer (or a piece of it, in future
versions) with function String.of_bytes (and with String.of_bytes_sub
in future versions).  Bytes.t = string, so there is no overhead, just
the naming issue (it could be named "Strings.MutableOneByte", but "Bytes"
looks better).


  - Dmitry Grebeniuk < gdsfh1 at gmail dot com >
      (commits are sponsored by Amatei)
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.