Wiki

Clone wiki

neo4j-databridge / 7. Update strategies

7. Update strategies

In the event that more than one node is discovered to have the same identity during the import (i.e. two or more records have the same key), the importer needs to know what action to take. This is done using the "update_strategy" attribute in the schema mapping for each node type. There are three options:

  1. "update_strategy": "merge" - combines the data into a single node
  2. "update_strategy": "version" - creates a new node regardless of any others with the same key
  3. "update_strategy": "unique" - ignores any nodes that are duplicates of an existing node

These options are described in more detail below.


Unique

Most of the time you will want to use the unique strategy. This strategy is the most efficient overall as it requires the fewest write operations to the graph. It is meant to handle the case where you're importing data and the same objects are represented on different rows, and where the object's properties do not change. In other words, for when you have duplicate data. Duplicate data often occurs in CSV extracts or when SQL queries return joins across tables. In this case, selecting the unique strategy will ensure only one copy of the node and its associated properties will be created during the import.


Merge

The merge strategy is useful when the properties for a particular node are distributed during the import process. For example, the properties for a node might be distributed over different rows in the same data resource, but it is more usually the case when data is being imported from different resources.

For example, information about a specific customer might be partially held in one database and partially held in another. By importing the customer data from both resources, and using the merge strategy the data from the two sources will be combined into a single customer node. For this strategy to work, both resources must be able to agree on the identity of the customer.

Note: if a merge operation attempts to update an existing property, it will succeed and no warning will be generated: the merge strategy is very simple in this case - last-write wins.


Version

The version strategy is meant to handle those situations where you don't want to or cannot use either of the other two strategies. There are two main scenarios where version is useful.

The first is where you want to maintain state over time. For example, you may want to track the high/low/close values of NASDAQ stocks over a month. By choosing the version strategy you can be sure that every time a particular stock is identified in your input data, a new version of it will be created in the graph, instead of an existing one being updated.

The second scenario is where you have data for an entity that is maintained in two different resources, but where those resources cannot agree on the identity of that entity. Taking the customer example above, you might have a SALES database and a MARKETING database which both contain the customer information for the same customer, but where the database keys for the same customer are different. In this situation, the version strategy will allow you to load two "versions" of the customer - one from each of the different resources.

Updated