Wiki
Clone wikisabroso / Creating novel analyses with sabroso
Introduction
Analyses are represented in Sabroso by two major pieces: the visualization, and the data for the visualization. In order to build a visualization, we will need to develop a configuration explaining to Sabroso what we want. Sabroso uses, but is not limited to, Vega to perform visualizations of analyses. Our visualization will start with the raw data from our project, but this data must then be passed through various functions in order to be usable by our visualization. To do this, we write functions, which are contained in the python/application/functions.py
file.
We will first go over the YAML configuration file, which registers our visualization and analysis type with Sabroso's back-end. Afterward, we will discuss the data transformation path for getting data to our visualization.
Visualization Configuration
The example visualization configuration that will be used for this tutorial can be found in the python/examples/d3_scatter/config.yml
file. The configuration file is comprised of two major entries: data_types
and analysis_types
. The data_type
is used by the Sabroso back-end to compartmentalize data of differing types into distinct catalogs. An analysis (defined by the analysis_type
configuration) must be designed to consume the data from one or more of these existing data types.
Data Types
The data type given in our example configuration is:
#!yaml data_types: - test_type
test_type
. Any number of unique data types can be added to the system.
Analysis Types
Our example configuration only has one analysis type defined: scatter
. Let's take a look at the first elements of this configuration.
Identity and Consumable Data
#!yaml analysis_types: - name: scatter data_types: - test_type
name
of our analysis type must be unique as it acts as an identifier. The data_types
field is a list, which allows you to assign multiple data types to this analysis. If you assign test_type
to this analysis, you are telling Sabroso that this analysis can consume and display data from these test types. We will attach our previously configured data type: test_type
.
## Plot Configuration ## Plot configuration defines our plot controls and our plotting behavior. The controls we define will interact with our plot; interaction is either a round-trip data relay between client and server, or a manipulation of the existing plot state on the client side. ### Controls ### Three types of controls can be added: argument controls, filter controls and configuration controls.
#!yaml controls: {}
#!yaml - name: "xRangeControl" title: "X range" text: - "" - "to" placeholders: - 'Min' - 'Max' inputNumber: 2 inputSize: "small" textSize: "small" inputType: "free-number" controlType: "filter"
####Configuration Controls####
#!yaml - name: "colorScheme" title: "Color scheme" text: - "" placeholders: - "Choose colors..." inputNumber: 1 inputSize: "medium" textSize: "small" inputType: "dropdown" controlType: "config" options: - - key: schemeCategory10 label: 10 color scheme - key: schemeDark2 label: Dark 8 colors
Plot
#!yaml plot_configuration: {} plugin: D3
plot
definition that must match Sabroso definitions in order to function properly. For an example of a vega configuration, see the bubble example. To learn how to create a Vega configuration, refer to the links below for Vega documentation and the Vega live-editor.
Transformation Configuration
#!yaml transformation_configuration: "filterFunction": "buildQuery" "toDataObject": "testDataConverter" "transform": [{"function": "passThrough", "kwargs": {}}] "finalizeData": "dataFrameToBins" available_method: "scatter_available"
The "filterFunction" attribute tells Sabroso which function will build the database query. It is expected that analysis developer will create a function that translates possible user behavior into appropriate queries for the intended data store.
The toDataObject
attribute tells Sabroso which function to use to convert the data from our database into a state more convenient for us to work with. In this case, we are using the testDataConverter
function, which simply strips the _id
component from our records, and returns a pandas DataFrame
of our data for each requested data set. This is the first function to be called before we begin whatever series of transforms we wish to perform on our data.
After we have a data object, we pass that data through a series of transformations, outlined by the transform
attribute of our configuration. The transform
attribute is a list of functions, with optional arguments, by which our data will be modified in succession. Data is transformed function by function in order listed in this parameter.
The data resulting from our transform series is then passed to finalizeData
, where do any final adjustments to our data before sending it back to our visualization. In our case, we call the dataFrameToBins
function which converts our initial (from toDataObject
function) data object into a dictionary specifying the requested data sets, ready for consumption by our visualization.
Finally the available_method function, specifies which data of the data available in the database should be considered as analysis options.
Functions
In order for our visualization to have properly formatted data, we will stage our data through a series of functions. Each of these functions must be registered in the function registry, located at python/application/utils/registry.py
. To register our functions, we'll need to create an instance of our Registry:
#!python registry = Registry('TestFunctions')
#!python def buildQuery(bucket, columns, **kwargs): queries = [] for column in columns: query = ({}, {}) query[1]['x'] = 1 query[1][column.split('data_')[1]] = 1 if 'xRangeControl' in kwargs: query[0]['x'] = {} if '0' in kwargs['xRangeControl'] and kwargs['xRangeControl']['0']: query['x']['$gte'] = float(kwargs['xRangeControl']['0']) if '1' in kwargs['xRangeControl'] and kwargs['xRangeControl']['1']: query['x']['$lte'] = float(kwargs['xRangeControl']['1']) queries.append(query) return queries
toDataObject
function. In our example, we have defined this as testDataConverter
, below:
#!python def testDataConverter(data): ret = {} for oneSet in data: data_set = [] for datum in data[oneSet]: del datum['_id'] data_set.append(datum) ret[oneSet[0]] = pandas.DataFrame(data_set) return ret
_id
field from our data, and then converts it to a pandas DataFrame
for our subsequent functions to consume. We then need to add our function to the registry:
#!python registry.add('testDataConverter', testDataConverter)
passThrough
is seen below:
#!python def passThrough(data, *args, **kwargs): for key in data: col = key.split("data_")[1] try: data[key][col] = data[key][col] + 1 except KeyError: pass return data
passThrough
function takes our data and if a column is specified in the keyword arguments from our config. To demonstrate functionality, this adds one to every value. Arguments are passed into these functions in the same manner as filters above for args controls. We also add this to the registry:
#!python registry.add('passThrough', passThrough)
finalizeData
function. In our example, we're using the function dataFrameToBins
:
#!python def dataFrameToBins(data, **kwargs): ret = {} for col in ['data_y', 'data_z']: if col in data: ret[col] = [] for data_set in data: data[data_set] = data[data_set].dropna(axis=0) for i, row in data[data_set].iterrows(): if data_set == 'data_y': ret['data_y'].append([row['x'], row['y']]) if data_set == 'data_z': ret['data_z'].append([row['x'], row['z']]) return ret
#!python registry.add('dataFrameToBins', dataFrameToBins)
available_method
function. In our example, we only want to specify that y and z are to be part of the analysis.
#!python def scatter_available(name): if name not in ['x', 'a', 'title', 'project_id', '_id', 'owner']: return 'data_' + name else: return None registry.add('scatter_available', scatter_available)
Putting it together
The final step in loading a new analysis type into the Sabroso system is to load your configuration data into the administrative database. From the python/web directory, run:
python manage.py addadmindata <location of your config file>
python manage.py addadmindata ../examples/example_config.yml
Successfully added data.
if all went well, or an error related to parsing your YAML file if not. Your new analysis should now be available to your projects. To check, simply start up Sabroso and browse to a project. If your analysis uses a new data type that you've created, you'll also need to upload some data matching this data type before you can create an analysis.
Updated