Wiki

Clone wiki

sabroso / Creating novel analyses with sabroso

Introduction

Analyses are represented in Sabroso by two major pieces: the visualization, and the data for the visualization. In order to build a visualization, we will need to develop a configuration explaining to Sabroso what we want. Sabroso uses, but is not limited to, Vega to perform visualizations of analyses. Our visualization will start with the raw data from our project, but this data must then be passed through various functions in order to be usable by our visualization. To do this, we write functions, which are contained in the python/application/functions.py file.

We will first go over the YAML configuration file, which registers our visualization and analysis type with Sabroso's back-end. Afterward, we will discuss the data transformation path for getting data to our visualization.

Visualization Configuration

The example visualization configuration that will be used for this tutorial can be found in the python/examples/d3_scatter/config.yml file. The configuration file is comprised of two major entries: data_types and analysis_types. The data_type is used by the Sabroso back-end to compartmentalize data of differing types into distinct catalogs. An analysis (defined by the analysis_type configuration) must be designed to consume the data from one or more of these existing data types.

Data Types

The data type given in our example configuration is:

#!yaml
data_types:
    - test_type
This means that when you add data to a project, you can specify that it is of test_type. Any number of unique data types can be added to the system.

Analysis Types

Our example configuration only has one analysis type defined: scatter. Let's take a look at the first elements of this configuration.

Identity and Consumable Data

#!yaml
analysis_types:
    -
        name: scatter
        data_types:
            - test_type
The name of our analysis type must be unique as it acts as an identifier. The data_types field is a list, which allows you to assign multiple data types to this analysis. If you assign test_type to this analysis, you are telling Sabroso that this analysis can consume and display data from these test types. We will attach our previously configured data type: test_type.

## Plot Configuration ## Plot configuration defines our plot controls and our plotting behavior. The controls we define will interact with our plot; interaction is either a round-trip data relay between client and server, or a manipulation of the existing plot state on the client side. ### Controls ### Three types of controls can be added: argument controls, filter controls and configuration controls.

#!yaml
  controls: {}
####Filter and Argument Controls####
#!yaml
          -
            name: "xRangeControl"
            title: "X range"
            text:
              - ""
              - "to"
            placeholders:
              - 'Min'
              - 'Max'
            inputNumber: 2
            inputSize: "small"
            textSize: "small"
            inputType: "free-number"
            controlType: "filter"
Let's first break-down how an argument or filter control. This filter control creates a "free-number" set of fields (which are free typing number input boxes). Size of the text and fields may be specified, either "small", "medium" or "large". inputNumber specifies the number of inputs to create. The text field places the text before the inputs or can be omitted by having the value be null. Placeholder text is the default text in the input. The name of the control specifies the argument name for the function on the backend. The control type must be "filter", "args" or "config".

####Configuration Controls####

#!yaml
          -
            name: "colorScheme"
            title: "Color scheme"
            text:
              - ""
            placeholders:
              - "Choose colors..."
            inputNumber: 1
            inputSize: "medium"
            textSize: "small"
            inputType: "dropdown"
            controlType: "config"
            options:
              -
                -
                  key: schemeCategory10
                  label: 10 color scheme
                -
                  key: schemeDark2
                  label: Dark 8 colors
Configuration controls configure the plot configuration in realtime. The control is similar to the argument and filter controls, but this one uses a dropdown. Currently, "dropdown" and "free-number" are the allowed control types. Options provides a list of options. They value specified by key is what is passed to the plot in the case of a configuration control and to the backend in the case of a filter or argument control.

Plot

#!yaml
    plot_configuration: {}
    plugin: D3
The plot section of our configuration follows the structure of the visualization plugin we are using-- in this case, D3. Because we are using D3, no plot configuration is specified here. There is a custom D3 function added and named "scatter" to match the analysis in js/app/utils/d3.js. There are some elements about our plot definition that must match Sabroso definitions in order to function properly. For an example of a vega configuration, see the bubble example. To learn how to create a Vega configuration, refer to the links below for Vega documentation and the Vega live-editor.

Transformation Configuration

#!yaml
        transformation_configuration:
            "filterFunction": "buildQuery"
            "toDataObject": "testDataConverter"
            "transform": [{"function": "passThrough", "kwargs": {}}]
            "finalizeData": "dataFrameToBins"
        available_method: "scatter_available"
This section of the configuration will tell Sabroso how to manipulate the data in your project to prepare it for visualization.

The "filterFunction" attribute tells Sabroso which function will build the database query. It is expected that analysis developer will create a function that translates possible user behavior into appropriate queries for the intended data store.

The toDataObject attribute tells Sabroso which function to use to convert the data from our database into a state more convenient for us to work with. In this case, we are using the testDataConverter function, which simply strips the _id component from our records, and returns a pandas DataFrame of our data for each requested data set. This is the first function to be called before we begin whatever series of transforms we wish to perform on our data.

After we have a data object, we pass that data through a series of transformations, outlined by the transform attribute of our configuration. The transform attribute is a list of functions, with optional arguments, by which our data will be modified in succession. Data is transformed function by function in order listed in this parameter.

The data resulting from our transform series is then passed to finalizeData, where do any final adjustments to our data before sending it back to our visualization. In our case, we call the dataFrameToBins function which converts our initial (from toDataObject function) data object into a dictionary specifying the requested data sets, ready for consumption by our visualization.

Finally the available_method function, specifies which data of the data available in the database should be considered as analysis options.

Functions

In order for our visualization to have properly formatted data, we will stage our data through a series of functions. Each of these functions must be registered in the function registry, located at python/application/utils/registry.py. To register our functions, we'll need to create an instance of our Registry:

#!python
registry = Registry('TestFunctions')
The first step is to build the query from user parameters. For each requested data set, called column here, Sabroso expected a query which will locate that data. Because the default database is Mongo, this constructs the two part Mongo queries. Notice that user state from filters comes in named as specified by the filter. Data from each field comes in as a number index cast as a string, so '0' for the first field, etc.
#!python
def buildQuery(bucket, columns, **kwargs):
    queries = []
    for column in columns:
        query = ({}, {})
        query[1]['x'] = 1
        query[1][column.split('data_')[1]] = 1
        if 'xRangeControl' in kwargs:
            query[0]['x'] = {}
            if '0' in kwargs['xRangeControl'] and kwargs['xRangeControl']['0']:
                query['x']['$gte'] = float(kwargs['xRangeControl']['0'])
            if '1' in kwargs['xRangeControl'] and kwargs['xRangeControl']['1']:
                query['x']['$lte'] = float(kwargs['xRangeControl']['1'])
        queries.append(query)
    return queries
The next function called to process and transform our data is the toDataObject function. In our example, we have defined this as testDataConverter, below:
#!python
def testDataConverter(data):
    ret = {}
    for oneSet in data:
        data_set = []
        for datum in data[oneSet]:
            del datum['_id']
            data_set.append(datum)
        ret[oneSet[0]] = pandas.DataFrame(data_set)
    return ret
This function simply removes the _id field from our data, and then converts it to a pandas DataFrame for our subsequent functions to consume. We then need to add our function to the registry:
#!python
registry.add('testDataConverter', testDataConverter)
Next, our data will make its way through all of our transform functions. In this example, we are only passing our data through one transformer, but there is no limitation on how many transforms you may pass your data through. The function passThrough is seen below:
#!python
def passThrough(data, *args, **kwargs):
    for key in data:
        col = key.split("data_")[1]
        try:
            data[key][col] = data[key][col] + 1
        except KeyError:
            pass

    return data
The passThrough function takes our data and if a column is specified in the keyword arguments from our config. To demonstrate functionality, this adds one to every value. Arguments are passed into these functions in the same manner as filters above for args controls. We also add this to the registry:
#!python
registry.add('passThrough', passThrough)
The final function call in our transform series is called the finalizeData function. In our example, we're using the function dataFrameToBins:
#!python
def dataFrameToBins(data, **kwargs):
    ret = {}
    for col in ['data_y', 'data_z']:
        if col in data:
            ret[col] = []
    for data_set in data:
        data[data_set] = data[data_set].dropna(axis=0)
        for i, row in data[data_set].iterrows():
            if data_set == 'data_y':
                ret['data_y'].append([row['x'], row['y']])
            if data_set == 'data_z':
                ret['data_z'].append([row['x'], row['z']])
    return ret
This function outputs appropriately renames and outputs data into the appropriate expected data sets.
#!python
registry.add('dataFrameToBins', dataFrameToBins)
The other required function is called the available_method function. In our example, we only want to specify that y and z are to be part of the analysis.
#!python
def scatter_available(name):
    if name not in  ['x', 'a', 'title', 'project_id', '_id', 'owner']:
        return 'data_' + name
    else:
        return None

registry.add('scatter_available', scatter_available)

Putting it together

The final step in loading a new analysis type into the Sabroso system is to load your configuration data into the administrative database. From the python/web directory, run:

python manage.py addadmindata <location of your config file>
In our example scenario, we would run (from the python/web directory):
python manage.py addadmindata ../examples/example_config.yml
This will give the front-end access to your new configurations. You should see the response Successfully added data. if all went well, or an error related to parsing your YAML file if not. Your new analysis should now be available to your projects. To check, simply start up Sabroso and browse to a project. If your analysis uses a new data type that you've created, you'll also need to upload some data matching this data type before you can create an analysis.

Updated