Support incremental data updates

Issue #7 open
Vince Bickers repo owner created an issue

This is a huge topic. But would be important for many places that rely on continuous batch processing of data as part of an ETL pipeline. Why reload hundreds of gigabytes of data when only a couple of things have changed?

The technical challenge to overcome is how to maintain the difference files for nodes and edges that have changed, been removed or added.

For additions/deletions we cannot rely on the node/relationship ids, because in Neo4j these are not stable.

Comments (1)

  1. Vince Bickers reporter
    • edited description
    • changed status to open

    Persistent catalogs will be used to determine data sets already loaded.

    Deletions are required to be logical, rather than physical. Either a property, or a label can be used to annotate a node deletion.

    Edge deletions are more difficult to handle. If A is linked to B and subsequently re-linked to C, there doesn't seem to be an easy way to identify that the link between A and B should be deleted. (If B is removed, it will happen automatically). Needs more thought.

  2. Log in to comment