Wiki

Clone wiki

neo4j-databridge / 3.3 Schema Mappings

3.3 Schema Mappings

Schema mapping files are written in JSON format.

This section explains all the elements of a schema mapping file.


Name

Every schema mapping file must be identified by a unique name.

{
    "name": "satellites",
}

Resource

The resource attribute describes which resource descriptor file should be associated with this mapping file. By convention, resource descriptor file names correspond to the mapping file name, but you can name them whatever you like. The resource descriptor file itself must be located in the /resources folder of the import task.

{
    "name": "satellites",
    "resource": "satellites-resource.json"
}

Nodes

The nodes section defines how to map the rows of data provided by the adapter to nodes in the graph.

The example below shows two nodes mappings, both having the two mandatory attributes, 'type' and 'identity'.

{
    "nodes": [
        { "type" : "Satellite", 
          "identity": [ "Object" ]
        }, 
        { 
          "type" : "Orbit",  
          "identity": [ "Orbit" ]
        }
    ]
}

The mandatory attributes - type and identity

The type attribute defines the node's type or class. A label corresponding to the type will be created in the graph and associated with each node of that type.

The identity attribute consists of an array of values taken the data, whose concatenated value can be used to generate a unique identity for an instance of this node type. In the example above, the identity value is taken from a single field in each case. This identity value is automatically mapped to a node property of the same name, and is used to build the schema index for the node's type.

Dynamic node types

Although it is usual to specify node types directly in the mapping schema, this isn't the only option. It is also possible to have the node type generated at runtime from the data values. For example, our satellite data source might provide information about the satellite's intended purpose, e.g. Military, Commercial or Mixed-use. In this situation, where the number of type variants is small, you might choose to use this data value directly as the node's type.

To define a dynamic node type in this way the node's type attribute should be specified using curly brackets around the column that will provide the value:

{
   ...
   nodes [
       { "type" : "{Purpose}", "identity" : [ "Object" ] },
       ...
   ] 

}

Defining more than one node of the same type

Sometimes the data row to be processed will contain more than one instance of the same node type. For example, suppose our satellite data contained information about two satellites, the current one, and an obsolete one it replaced. In this case, we need to disambiguate the two instances in our schema mapping, and this is done using dotted type notation:

{
    "nodes": [
        { "type" : "Satellite.1", 
          "identity": [ "Object" ]
        },
        { "type" : "Satellite.2", 
          "identity": [ "Replaced" ]
        }
    ]

}

Dotted types allow us to map (and connect if required) two or more nodes of the same type being presented in the same row of data. Note that the value to the right of the dot in a dotted type does not have to be a number - it can be any value that disambiguates it from another instance.

Creating a node conditionally

It is also possible to skip creating a node entirely. This is done using a condition expression, which returns true or false, based on some evaluation of the current row of data. If it returns false, the mapping element associated with it (node, label, property or edge) will be skipped.

Here, a condition expression is used to exclude processing of satellites that are in a high-earth-orbit (HEO).

{
    "nodes": [
        { "type" : "Satellite", 
          "identity": [ "Object" ],
          "condition": "Alt != 'HEO'"
        }
    ]
}

Node properties

Mappings for node properties are specified using property definitions. Every property definition requires a name attribute by which the property will known in the graph. Usually, the column name providing the value will be specified as well.

The following example shows how the property definitions for two Satellite node properties name and manned are written:

{
    "nodes": [
        { "type" : "Satellite", 
          ... 
          "properties" : [
              { "name": "name", "column": "Object" },
              { "name": "manned", "column": "Manned" }
          ] 
        }
    ]
}

Note: if the property name in the graph will be the same as the column name in the data, the column name attribute can be omitted and the property name simply defined as the column name

Conditional properties

In the same way as we can exclude nodes from being processed, we can also create a property only when a certain condition is met. And, just as with nodes, we use a condition expression in the property definition to achieve this. Here, the manned property in the graph will only be created if the value in the Manned column is equal to 'Y' or 'y'.

{
    "nodes": [
        { "type" : "Satellite", 
          ... 
          "properties" : [
              { "name": "manned", "column": "Manned", "condition": "Manned == 'Y' or Manned == 'y'"}
          ] 
        }
    ]
}

Using constants as property values

Instead of providing data for a property from a column in the normal way, we can instead provide a user-defined value by specifying a "value" attribute in place of the "column" attribute in the property definition.

Let's change the previous example so that when the condition is met, the property on the node will always have the value true, rather than the value from the Mapping column:

{
    "nodes": [
        { "type" : "Satellite", 
          ... 
          "properties" : [
              { "name": "manned", "value": "true", "condition": "Manned == 'Y' or Manned == 'y'"}
          ] 
        }
    ]
}

Providing defaults for missing property values

When a property value is missing in the data, or is not generated in the graph because a condition has not been met, it is often useful to provide a default value instead. In the previous example, the manned property is not created at all if the condition is not met, but this is probably not what we want - we'd normally expect the property to have the value false in this case. To accomplish this we can specify a default value to be used if the condition fails:

{
    "nodes": [
        { "type" : "Satellite", 
          ... 
          "properties" : [
              { "name": "manned", "value": "true", "default": "false", "condition": "Manned == 'Y' or Manned == 'y'"}
          ] 
        }
    ]
}

Property type conversion and extension

Databridge performs automatic type conversion for certain string formats:

Conversion Action
string->double Strings representing real-valued numbers will be converted to doubles
string->long Strings representing integers will be converted to longs
string->boolean Strings representing boolean values ("true/false") will be converted to booleans

Databridge also performs type extensions: real-number data values are extended to 64 bit double values, and integer values are extended to 64-bit longs.

Note that if a value appears that does not conform to the expected type as seen from a previous instance of the same data item, the value will be handled, but a warning will also be logged, as this situation usually indicates an error in the data source.

Finally, if the default type conversion does not produce the desired value for a data item, you can override it using one of the built-in type converters (or create your own). Please refer to the following section for more details: Using property type converters


Labels

It is possible to define extra labels for the nodes you are mapping, in addition to the one associated with the node type which is created automatically by Databridge. Additional labels are often created in conjunction with some condition expression, but they don't have to be.

The example below illustrates this: a 'SpaceStation' label will be associated with all Satellite nodes when the value of the "manned" data field is equal to Y.

{
    ...
    "nodes": [
        { 
          "type" : "Satellite", 
          ...
          "labels" : [ { "name: "SpaceStation", "condition": "manned == 'Y'" } ]
        }
        ...
    ]
}

Dynamic labels

Rather than using a constant for a label's name, it can instead be generated from a data value. In the same way as dynamic node types, dynamic labels are specified by using curly brackets around the required column name.

Here, we're using the value of the Object column in the data to generate the dynamic label's name.

{
    "nodes": [
        { "type" : "Satellite", 
          ...
          "labels" : [ { "name: "{Object}" } ]
        }
        ...
    ]
}

Indexes

You can also create schema indexes for any additional labels you define by using the indexes attribute in the label definition. Please note that because a schema index requires the node properties being indexed to exist in the graph, the values in the indexes attribute must correspond to property names in your property definitions.

{
    "nodes": [
        { "type" : "Satellite", 
           ...
          "properties": [
              {"name": "satellite", "column": "Object"}
          ],
          "labels" : [ {"name: "SpaceStation", "indexes": ["satellite"] } ]
        }
    ]
}

Defining the update strategy for a node

When Databridge is importing data that may contain duplicates, or re-importing the same data set a second time, or when it is processing multiple data sources that may contain copies of the same objects, it needs to know what to do with duplicates.

A duplicate node is one whose type and identity as defined in a schema mapping matches the type and identity of one previously imported into the graph. What happens when a duplicate is detected depends on the update strategy in use at the time.

Databridge defines four node update strategies, summarised below:

update strategy action
UNIQUE create this node if it does not exist, skip processing any duplicates
MERGE create this node if it does not exist, update it when a duplicate occurs
VERSION always create a new copy of this node
REFERENCE do not check, create or update this node, we know it exists and we're providing a reference to it

If you don't define an update strategy yourself, Databridge will use the following simple rules to apply one.

  • if the node has no properties defined for it in the schema mapping, the UNIQUE strategy will be applied.

  • if the node does have properties, the MERGE strategy will be applied.

These simple rules are designed to ensure that you will never lose data during an import, but in the case where you have many actual duplicates (as opposed to genuine updates), continually invoking the default MERGE strategy involves a performance cost that is accompanied by no benefit. In this case, you'd want to avoid reprocessing the same data if possible. There may also be situations where you never want to MERGE, but instead create a new copy of the node in question.

To handle these different scenarios, you can define the update strategy you want Databridge to use, as shown below:

{
    "nodes": [
        { "type" : "Satellite", 
          "update_strategy": "unique",
           ...
           }
    ]
}

You can read more about update strategies in general and how to best choose them here.


Edges

The edges section of the schema mapping is a list of edge definitions, each of which describes how to connect two nodes that have been defined in the nodes section.

Each edge definition in the edges section must contain at least the following three attributes:

attribute meaning
type the edge (relationship) type in the graph
source a node type specified in the nodes section that represents the start node of this edge
target a node type specified in the nodes section that represents the end node of this edge

There are a couple of points worth noting:

  • When you specify the source and node target types, Databridge maps these to the actual instances of the nodes that have already been processed on the current row.

  • If a node has been excluded from processing by virtue of some condition not being met, any edges relating to that node will also be skipped.

Below is a complete schema mapping that defines two node types, and an edge that should connect them.

{
    "name": "satellites",
    "resource": "satellites-resource.json"
    "nodes": [
        { "type" : "Satellite", 
          "identity": [ "Object" ]
        }, 
        { 
          "type" : "Orbit",  
          "identity": [ "Orbit" ]
        }
    ],
    "edges": [
       { "type": "IN_ORBIT", "source": "Satellite", "target": "Orbit" }
    ]
}

Creating an edge between two nodes of the same type

As described in the nodes section, two nodes of the same type must be disambiguated using dotted type notation. You use these dotted type names to specify the source and target when creating an edge between two nodes of the same type:

{
    "nodes": [
        { "type" : "Satellite.CURRENT", 
          "identity": [ "Current" ]
        }, 
        { 
          "type" : "Satellite.OBSOLETE",  
          "identity": [ "Obsolete" ]
        }
    ],
    "edges": [
       { "type": "REPLACED", "source": "Satellite.CURRENT", "target": "Satellite.OBSOLETE" }
    ]
}

Conditional edges

As with nodes, labels and properties, an edge can also be created conditionally via a condition expression in the edge definition. In the example below, an ACTIVE edge is created between a Satellite and the Space Agency that launched it only if the satellite is still active:

{
    "nodes": [
        { "type" : "SpaceAgency", 
          "identity": [ "Agency" ]
        }, 
        { 
          "type" : "Satellite",  
          "identity": [ "Object" ]
        }
    ],
    "edges": [
       { "type": "LIVE", "source": "SpaceAgency", "target": "Satellite", "condition": "Status == 1" }
    ]
}

Dynamic edges

An edge's type can be generated directly from the data values of the current row, instead of being defined as a constant in the schema mapping. In an earlier example, we defined an IN_ORBIT edge between a satellite and its orbital location. Let's change that edge now so that its type is generated from the name of the actual orbital location (HEO, LEO, MEO).

A dynamic edge type is specified in exactly the same way as a dynamic node type or a dynamic label: by using curly brackets around the required column name.

{
    "name": "satellites",
    "resource": "satellites-resource.json"
    "nodes": [
        { "type" : "Satellite", 
          "identity": [ "Object" ]
        }, 
        { 
          "type" : "Orbit",  
          "identity": [ "Orbit" ]
        }
    ],
    "edges": [
       { "type": "{Orbit}", "source": "Satellite", "target": "Orbit" }
    ]
}

Edge Properties

Edge properties are defined in exactly the same manner as node properties, and support all the same features such as automatic type conversion, conditional processing and user-specified type converters.

{
    ...
    "edges": [
       { "type": "IN_ORBIT", "source": "Satellite", "target": "Orbit" ,
         "properties: [
             {"name": "launched", "column": "LAUNCH_DATE", "convert":"iso8601_date:dd MMM yyyy"}
         ]
       }
    ]
}

Please refer to the Node Properties section above for more details.

Edge update strategies

Databridge applies the same update strategy rules for edges as it does for nodes, and you can override this default strategy in the same way. The only difference is that the update strategies defined for edges do not include the REFERENCE strategy, as this is meaningless for edges:

update strategy action
UNIQUE create this edge if it does not exist, skip processing if it does
MERGE create this edge if it does not exist, update it otherwise
VERSION always create a new copy of this edge

Please refer to the section "Definining the update strategy for a node" above for further information.

Updated