HTTPS SSH
=====================================================================
Artifact for the ECOOP Paper "Making an Embedded DBMS JIT-friendly"
=====================================================================


by Carl Friedrich Bolz, Darya Kurilova, Laurence Tratt
=======================================================

The paper describes SQPyte: a composed VM allowing the efficient jitted
execution of SQLite queries from within Python programs. The artifact contains
SQPyte, the PyPy-SQPyte bridge as well as benchmarks for comparing the
performance of SQPyte, SQLite and the Java-based embedded database H2.

The abstract of the paper is:

    While DataBase Management Systems (DBMSs) are highly optimized,
    interactions across the Programming Language (PL) / DBMS boundary are
    costly, even for in-process embedded DBMSs. In this paper we show that
    programs that interact with the widely-used embedded DBMS SQLite can be
    significantly optimized – by a factor of 3.4 in our benchmarks – by
    inlining across the PL / DBMS boundary. We achieved this speed-up by
    replacing parts of SQLite’s C interpreter with RPython code and composing
    the resulting meta-tracing VM – called SQPyte – with the PyPy VM. SQPyte
    does not compromise stand-alone SQL performance: it is 2.2% faster than
    SQLite on the widely used TPC-H benchmark suite.

We have packaged the artifact as a script that downloads and builds all the
code. We also provide a VirtualBox VM that already contains everything needed
and can be used directly.

Building the Code
=================

If you don't want to build the relevant code yourself, you can use the
VirtualBox VM and jump to the next section.

Otherwise you need to install the following dependencies:

- python
- mercurial
- git
- Java, including a JDK
- the PyPy dependencies, as documented here: http://doc.pypy.org/en/latest/build.html
- latex, if you want to make a PDF showing the benchmark results

Then you can use the ``make.py`` script to download and build all the
needed software of the artifact. The script assumes a unixy environment, most
flavors of Unix (including Mac OS X) should work.

To download and build all the code, run::

    $ python make.py download
    ...
    $ python make.py build
    ...

This will take about two hours.


Using the VirtualBox VM
========================

You can run and start the VM with the VirtualBox GUI. The password is ``sqpyte``.

To start the VirtualBox VM from the commandline, first import it:: 

    $ VBoxManage import sqpyte_0.2.ovf

Then to start the VM run::

    $ VBoxHeadless --startvm sqpyte -e "TCP/Ports=4455"

The VM will then boot. You can then use a VNC client to connect to the host
port 4455, or you can SSH to host port 3122 (SSH port forward is already
configured in the ovf file):

    $ ssh sqpyte@localhost -p 3122

The password is 'sqpyte'. The sqpyte account has sudo rights, should you need
them.


Scenarios
==========

After building has finished, or immediately after starting the VirtualBox VM,
we envision the following scenarios when using the artifact. We have put the
sections of the paper that are relevant for each scenario.


Using SQPyte from PyPy
-----------------------

Section 7 in the paper.

You can start the PyPy with the SQPyte bridge built in using the ``./pypy``
command. This will start a completely standard Python interpreter, compatible
with Python 2.7.10 that contains an extra ``sqpyte`` module. The ``sqpyte``
module exposes a subset of the interface of the standard Python ``sqlite3``
module. It can be used to interact with SQLite databases in the normal ways from
Python, for example like this::

    $ ./pypy
    Python 2.7.10 (cbb1faa2a19f, Mar 08 2016, 14:03:27)
    [PyPy 4.1.0-alpha0 with GCC 4.7.2] on linux2
    Type "help", "copyright", "credits" or "license" for more information.
    >>>>
    >>>> import sqpyte
    >>>> conn = sqpyte.Connection("tpch/sqlite/tpch.db")
    >>>> # simple arithmetic calculation
    >>>> conn.execute("select 1 + 2").next()[0]
    3
    >>>> conn.execute("select avg(quantity) from lineitem").next()[0]
    25.507967136654827
    >>>>
    >>>> # iterate over query result
    >>>> for items in conn.execute("select * from orders limit 5"):
    ....     print items
    ....
    (1, 36901, u'O', 173665.47, u'1996-01-02', u'5-LOW', u'Clerk#000000951', u'0', u'nstructions sleep furiously among ')
    (2, 78002, u'O', 46929.18, u'1996-12-01', u'1-URGENT', u'Clerk#000000880', u'0', u' foxes. pending accounts at the pending, silent asymptot')
    (3, 123314, u'F', 193846.25, u'1993-10-14', u'5-LOW', u'Clerk#000000955', u'0', u'sly final accounts boost. carefully regular ideas cajole carefully. depos')
    (4, 136777, u'O', 32151.78, u'1995-10-11', u'5-LOW', u'Clerk#000000124', u'0', u'sits. slyly regular warthogs cajole. regular, regular theodolites acro')
    (5, 44485, u'F', 144659.2, u'1994-07-30', u'5-LOW', u'Clerk#000000925', u'0', u'quickly. bold deposits sleep slyly. packages use slyly')
    >>>>
    >>>> # access table metadata
    >>>> for name, sql in conn.execute("select name, sql from sqlite_master where type = 'table';"):
    ....     print name, sql
    ....
    Part CREATE TABLE Part(
            partkey INTEGER PRIMARY KEY, name, mfgr, brand, type, size INTEGER, container, retailprice real, comment)
    Supplier CREATE TABLE Supplier(
            suppKey INTEGER PRIMARY KEY, name, address, nationkey INTEGER, phone, acctbal REAL, comment)
    PartSupp CREATE TABLE PartSupp(
            partKey INTEGER, suppKey INTEGER, availqty  INTEGER, supplycost REAL, comment)
    Customer CREATE TABLE Customer(CustKey INTEGER PRIMARY KEY, name, address, nationkey INTEGER, phone, acctbal REAL, mktsegment, comment)
    Nation CREATE TABLE Nation(
            nationkey  INTEGER PRIMARY KEY, name, regionkey INTEGER, comment)
    Region CREATE TABLE Region(
            regionkey INTEGER PRIMARY KEY, name, comment)
    LineItem CREATE TABLE LineItem(
            orderKey INTEGER, partKey INTEGER, suppKey INTEGER, lineNumber INTEGER, quantity INTEGER, extendedPrice REAL, discount REAL, tax REAL, returnFlag, lineStatus, shipDate, commitDate, receiptDate, shipInstruct, shipMode, comment)
    Orders CREATE TABLE Orders(
            orderKey INTEGER PRIMARY KEY, custKey INTEGER, orderStatus, totalPrice REAL, orderDate, orderPriority, clerk, shipPriority, comment)


Producing the Tables from the Paper
-------------------------------------

Sections 6 & 8 in the paper.

The raw data for the tables in the paper are in the file
``sqpyte-benchmarks/paper-submission.json``. To process and reproduce the tables
in the paper, run::

    $ python make.py makepdf

This will produce the file ``results.pdf`` containing all the tables that appear
in the paper. The process takes about 15 minutes.

Within the JSON file, in the benchmark scripts and in the debug output on the
console the names for the various sqpyte variants are different that in the
paper. The mapping is:

- "sqpyte" in the files corresponds to SQPyte in the paper.
- "dontcache" in the files corresponds to SQPyte[no-flags] in the paper.
- "dontlookinside" in the files corresponds to SQPyte[no-inline] in the paper.
- "sqlite" in the files corresponds to SQLite in the paper.
- "java" in the files corresponds to H2 in the paper.



Running the Benchmarks
-----------------------

Sections 6 & 8 in the paper.

You can re-run the benchmarks of the paper. Doing a full re-run is a somewhat
lengthy process that takes a few days. To start it, run::

    $ python make.py runbench

You can also run the benchmarks with a lot less iterations, which takes only a
few hours (probably at least three), like this::

    $ python make.py quickbench

However, the quick results are not as meaningful, because they contain very few
iterations, thus not giving the JITs enough time to warm up. This gives a
strong advantage to unjitted SQLite, particularly in the TPC-H benchmark set.
The quick benchmark run also does not repeat benchmarks, so there is not enough
data to produce meaningful errors for the numbers.

Both of these commands will put the benchmark results into a file
``sqpyte-benchmarks/output-<DATE>.json``.

After running the benchmark, you can produce tables with your own results by
running::

    $ python make.py makepdf

(makepdf will use the data from the most recent JSON file).


Looking at the Source Code
-------------------------------------

Sections 4 & 7 in the paper.

If you are interested in looking at the source code, here are some entry points
for reading.

Directory Structure
++++++++++++++++++++

The directories created by the ``make.py`` are:

- ``sqpyte`` the SQPyte implementation:

  - ``sqpyte/sqlite`` a modified version of the sqlite source code to make it callable from RPython
  - ``sqpyte/sqpyte`` the RPython source code of SQPyte

- ``pypy-sqpyte`` a fork of PyPy that contains the PyPy-SQPyte bridge

- ``tpch`` the TPC-H benchmark database generator

- ``sqpyte-benchmarks`` the benchmark set for SQPyte


SQPyte
+++++++

- ``sqpyte/sqlite/src/vdbe.c`` contains the source code of the SQLite bytecode
  interpreter.

- ``sqpyte/sqlite/src/sqpyte.c`` contains the SQLite bytecodes broken into
  individual functions.

- ``sqpyte/sqpyte/translated.py`` contains the RPython versions of those
  bytecodes that were translated to RPython

- ``sqpyte/sqpyte/function.py`` contains the RPython versions of (aggregate)
  functions like ``sum`` and ``avg``.


PyPy-SQPyte bridge
+++++++++++++++++++

- ``pypy-sqpyte/pypy/module/sqpyte/interp_sqpyte.py`` contains the RPython
  implementation of the ``sqpyte`` module, which is mostly type-conversion
  code, and some code interacting with ``sqpyte``.


Licenses
=========


This virtual machine image is based upon Ubuntu. Information about Ubuntu
licensing can be found at:

http://www.ubuntu.com/about/about-ubuntu/licensing

A binary and the source for SQPyte is also included. Its constituent
interpreters are licensed separately, both under the MIT License. For more
information, please refer to:

- pypy-sqpyte/LICENSE
- sqpyte/LICENSE

The source code of SQLite is in the public domain:

https://www.sqlite.org/copyright.html


The TPC-H benchmark is from TPC:

http://www.tpc.org/tpch/default.asp