HTTPS SSH
NebulOS is a Big Data platform that allows a datacenter to be treated as 
a single computer. With NebulOS, the process of writing a massively 
parallel program for a datacenter is no more complicated than writing a
Python script for a desktop computer. Users can run pre-existing 
analysis software on data distributed over thousands of machines with
just a few keystrokes. This greatly reduces the time required to develop
distributed data analysis pipelines. The platform is built upon Apache 
Mesos and the Hadoop Distributed File System, from which it inherits fast
data throughput and fault tolerance. NebulOS enhances these technologies 
by adding an intuitive user interface, automated task monitoring, and 
other usability features.

NebulOS is an effort of the Multidisciplinary Image Processing Laboratory 
(MIPLab) at the University of California, Riverside. Funding for the 
development of NebulOS was provided by the Vice Chancellor for Research 
at the University of California, Riverside (Aragon-Calvo  M. A., 2013).

Refer to http://arxiv.org/abs/1503.02233 for more information.


Contents of this document:

    * Framework Concepts
    * Framework Components
        - General task launcher
        - Filename pattern-based task launcher
        - Template-based task launcher
        - Task monitoring, stopping, and restarting
    * Compiling
    * Installing



FRAMEWORK CONCEPTS

The framework can run any executable file---compiled files OR scripts.

Individual tasks are performed on the cluster in parallel, with
no communication among tasks (although automated inter-task communication
is planned). When each task has completed, it can:
  
  1) send its output data to the distributed file system (DFS).
  2) send resulting standard output stream data to the NebulOS Scheduler.
  3) send data to a combination of DFS and the NebulOS scheduler.

The system is aware of data placement in the HDFS. Tasks that contain 
HDFS filenames are treated in a special way; tasks are scheduled to run
on nodes that contain the largest amount of relevant data. This reduces
network traffic and increases data throughput. 

The system is resistant to node failure. If a node fails, tasks
are re-scheduled and performed on other machines.  The system monitors 
the progress of individual tasks and has the ability to detect 
when a process is having problems. If the task is taking too long 
or output data indicates that there is a problem with the task, 
then the framework executor can kill the process and either re-run 
it with different parameters or simply inform the user that there 
is an issue with this task.

Two modes of operation are available: batch mode and streaming mode.
In batch mode, the thread is blocked when tasks are submitted and no
interaction with the system is possible until all tasks have completed.
Streaming mode allows interaction while tasks are still running. It
is also possible to run multiple streams at once, whereas only one batch
can run (from a single script) at any given instant. 



FRAMEWORK COMPONENTS

I)  General task launcher

      Signature: explicit_batch(commands)
                 explicit_stream(commands)
      Argument: 

                * 'commands' is an explicit listing of the full 
                  command for each task:

                  "path/to/program --option=arg --flag filename
                   path/to/program --option=arg --flag filename
                   path/to/program --option=arg --flag filename
                          .
                          .
                          .
                   path/to/program --option=arg --flag filename"

      To instruct the system to treat a particular substring as an HDFS 
      filename, enclose the filename in brackets: <[/path/to/file]>


II)  Filename Pattern-based task launcher

     Signature: glob_batch(command_template, filename_pattern)
                glob_stream(command_template, filename_pattern)
     Arguments: 

                * 'command_template' is a string containing the
                  command that will be performed on a list of files:

                  "/path/to/program --option=arg --flag %f%"

                  where %f% is replaced with a filename matching the 
                  pattern defined in the 'pattern' argument. The 
                  placeholder, %f%, is automatically replaced with 
                  HDFS brackets, as in: <[filename]>. If the file
                  is not on the HDFS, use the generic placeholder %c%.

                * 'filename_pattern' is a string containing a filename 
                  pattern:

                  "/directory/name/file*"

                   where the wildcard character * is replaced, as in the 
                   Unix shell.

III) Template-Based task launcher

     Signature: template_batch(command_template, parameters)
                template_stream(command_template, parameters)

     Arguments:

                * 'command_template' is a string containing the command
                   that will be performed, with a placeholder (%c% or %f%) 
                   indicating parameters that will be replaced with the 
                   content of the 'parameters' variable.

                * 'parameters' is a list of strings or a list containing a list of 
                   strings. For example:

                   ["parameter1, parameter2, parameter3, ..., parameterN"]

                   or

                   [[parameter1_1 ,parameter1_2, parameter1_3, ..., parameter1_N],
                    [parameter2_1 ,parameter2_2, parameter1_3, ..., parameter2_N],
                           .
                           .
                           .
                    [parameterN_1 ,parameterN_2, parameterN_3, ..., parameterN_N]]

IV) Task monitoring, stopping, and restarting

    NebulOS allows the user to write scripts or programs that monitor the 
    standard output and error streams as well as the resource usage of 
    each task. If certain user-defined conditions are met, the monitoring 
    script can terminate the task. The user also has the option of 
    writing a script to perform additional actions and then submit one or 
    more new tasks, based on the output of the monitoring script. Thus, 
    the user can create applications which automatically detect and fix 
    problems with individual tasks. Example syntax:

                 from nebulos import Processor
 
                 job = Processor(master = "foam:5050",
                                 monitor = "/path/to/monitor",
                                 relauncher = "/path/to/relauncher",
                                 update_interval = time_interval_in_seconds
                                 job_id = "demo")

    For each task running on the cluster, the monitor is called every 
    update_interval seconds. For tasks that are expected to take hours to 
    complete, the user may choose to set the update interval to a few 
    minutes (e.g., update_interval = 300), while short-running tasks may
    be checked every second or two.    

    The monitor is provided with two command line arguments:    

        1) the process id number of the task on the host node (pid).
        2) a string containing the standard output and standard error 
           of the task. Only the output produced since the previous 
           time that the monitor was called is provided (task_output).    

    So, the signature is:    

         monitor pid task_output    

    The task_output string contains the standard output and standard 
    error, wrapped in identifying tags:    

         task_output = '<stdout>standard output</stdout><stderr>standard error</stderr>'    

    The monitor can then make use of this data to perform tests. 
    For example, the PID can be used to access the /proc/pid virtual 
    directory, which (on Linux) contains information about the process. 
    Of particular interest are the files:    

        /proc/pid/cmdline
        /proc/pid/stat
        /proc/pid/statm
        /proc/pid/status
        /proc/pid/io    

    If the memory usage or CPU usage is found to be unusual or the 
    output streams contain keywords than indicate a problem with the 
    task, the monitor can write "nebulos_relaunch" to its standard 
    output stream. For example, if the monitoring code is written as
    a shell script,    

        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
         #!/bin/bash
          
         pid=$1
         output=$2
          
         # do some analysis here, using $pid and $output 
          
         # if the analysis reveals that the task needs to be stopped:
          
         echo "nebulos_relaunch"
          
         # if you want to provide additional information:
          
         echo "additional information goes here"    
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    When the framework detects "nebulos_relaunch" in the 
    monitor's output stream, it terminates the task. If a 
    relauncher is specified, then the relauncher is then 
    executed:    

        relauncher output_from_monitor    

    where output_from_monitor is a string containing all of the 
    output generated by the monitor, except for the leading 
    "nebulos_relaunch" keyword. The relauncher uses the information 
    contained in this string to build a new command or list of new 
    commands that will be submitted to the scheduler. These commands 
    are written to the relauncher's standard output stream, one 
    command per line (i.e., delimited by newline characters). If 
    your relauncher is written in C++, it would look something 
    like this:    

        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
        #include <iostream>
         
        int main(int argc, char* argv[])
        {
           if (argc != 2)
           {
              return 1;
           }
           
           // read argv[1] here and create new commands
         
           std::cout << "first command\n";
           std::cout << "second command\n";
           std::cout << "third command\n";
         
           return 0;
        }
        ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

    NebulOS then sends these new tasks to the scheduler. 
    This feature works in batch mode as well as streaming mode.



COMPILING

Requirements:

    * NebulOS currently targets Linux and has been tested with the 
      g++ 4.8 and clang++ 3.5 compilers. It has only been tested with Python 2.7. 
      Compiling on a system with a different software environment will 
      likely require some modifications to the Makefile and possibly the 
      source, as well.

    * Apache Mesos: You will need the Mesos source and the compiled library,
      libmesos.so in order to compile NebulOS, so download and compile
      Mesos first. Use Mesos 20.0 or later (earlier versions have not been
      confirmed to work with NebulOS)

    * Hadoop Distributed File System library (libhdfs.so). Only versions 
      2.4.1 and later have been tested.

    * cURLpp (https://code.google.com/p/curlpp/)

    * Boost::Python and Boost::Algorithm are required. It is a good idea
      to install the entire collection of Boost libraries, but the 
      essentials are the headers and compiled libraries associated with
      boost/algorithm/string.hpp and boost/python.hpp.

    * The headers and library for SQLite are required. NebulOS has been 
      verified to work with SQLite3 version 3.8.2 (other versions may work).


Once the requirements are met, the framework can be installed by navigating to the
build/ directory and issuing the command:

    $ make -j

This will generate files in build/bin and build/bin/Python

It would be a good idea to read the Makefile if there are compilation or linking 
errors. You will likely need to modify the makefile in order to successfully 
compile the framework. The 'install' target will certainly need to be customized for your 
system if you wish to use the "make install" command.



INSTALLATION

The installation process will eventually be automated, but at the moment, manual 
intervention is needed. Here are the basic steps: 

1) Install Apache Mesos, the Hadoop Distributed Filesystem (HDFS) and FUSE-DFS, 
   (although FUSE_DFS will eventually be unnecessary, it is currently required).

Once Apache Mesos, HDFS, and FUSE-DFS are installed and running, you need to mount 
the HDFS on each node of the system, using FUSE-DFS.

2) Place the files 'nebulos' and 'nebulos-exec' (found in build/bin/) into a directory 
   that is accessible to all nodes (i.e., on the HDFS or on an NFS). This directory is
   the NebulOS Home directory. Add it to your environment by placing 

      export NEBULOS_HOME=/path/to/nebulos/home

   in your .bashrc file.

3) Place build/bin/libnebulos.so in the LD_LIBRARY_PATH. Typically, putting it in 
   /usr/local/lib/ works.

4) Copy the build/bin/python directory to your PYTHONPATH or modify the PYTHONPATH so
   that Python can find the nebulos module. In Ununtu, the module is placed in:
   /usr/local/lib/python2.7/dist-packages/

Steps 3 and 4 can be automated by modifying the 'install' target in the MakeFile and 
issuing the command: 

   $ sudo make install

5) Optionally, you can add a file named mesos_master to the NEBULOS_HOME/etc/ directory.
   In this file, include the address:port or hostname:port of the Mesos Master
   (example: foam:5050). After doing this, you will no longer need to manually specify
   the Mesos master when constructing the Processor object. So, rather than typing:

       job = Processor(master="foam:5050")

   you would just need to type

       job = Processor()