Wiki

Clone wiki

neo4j-databridge / 8. Performance tips

8. Performance tips

The following tips should help you avoid common problems when using the importer.


Test First

When developing the schema and resource descriptors it is often useful to test the import before creating the graph. Creating the graph each time requires writing to disk and is the slowest part of the import process.

To test the import you can use the test command.

#!java

bin/databridge test import/atlas

If your input data set is very large, you can also test just a few rows of it. The example below will just test import the first 1000 rows:

#!java
bin/databridge test -l 1000 import/atlas

Prefer the 'unique' strategy

When creating nodes and edges, the unique strategy is the most efficient in terms of the number of database writes required to create the graph. You should prefer this option as the default update_strategy unless you need to merge data for a single node from different sources, or you need to keep multiple copies of the same node object.

#!json

{
 "type": "Student",
 "update_strategy":"unique",
 ... 
}

Please refer to Choosing the correct update strategy for more information on the different update strategies, and when to choose which.


Avoid creating duplicates

You should take care not to create the same node multiple times in the graph. This can happen if the underlying data for the node is identically defined in different resources. This is particularly important when an entity is very large or "heavy".

A "heavy" entity is any data item (row) containing a large number of fields and/or whose field contents are very large. As a general rule of thumb, a row of data is considered "heavy" if it contains more than 512 bytes of data. Heavy entities present significant challenges to overall performance, so you should avoid them wherever possible, particularly if you have a lot of data to load.

In these situations, the optimal strategy is to load the heavy entity first, then just refer to its identity subsequently.

In the following example, we will suppose that the same entity is identically described in two separate data resources, data-1 and data-2.

In the schema mapping for resource data-1, we import entity, setting up its labels and properties:

schema-1.json

#!json

{ 
    "resource": "data-1"
    "nodes": [
        {
             "type":"Heavy",
             "identity" : ["HEAVY_ID"],
             "update_strategy": "unique",
             "labels": ....
             "properties": .... // lots of properties 
         }
    ],
    "edges":[]
}

In the schema mapping for resource data-2 we reference the entity we previously loaded. To do this we just define the identity of this entity in the second data resource, without re-specifying any of its labels or properties. This reference node is all we need to create edges in the graph between it and the other node types in the second schema. Note that because the update_strategy is set to "unique" on the entity, it will not be accidentally recreated in the graph.

schema-2.json

#!json

{ 
    "resource" : "data-2"
    "nodes": [
        {
            "type":"Heavy",
            "update_strategy": "unique", 
            "identity" : ["HEAVY_ID"]
        },
        { 
            "type":"EntityOne",
            "update_strategy": "unique",
            "identity" : ["ENTITY_ONE_ID"],
            "labels": ....
            "properties": ....
        }
    ], 
    "edges": [ 
         { "name": "E1_HEAVY", "source": "EntityOne", "target": "Heavy" }
    ]
}

Finally, in our schema.json file, we define the ordering of the two schema mappings to ensure the data is imported according to our desired strategy:

#!json
{
    "include": [
        "schema-1.json",
        "schema-2.json"
    ]
}

Updated