Curator-public /

Filename Size Date modified Message
230 B
1.4 KB
6.3 KB
516 B
704 B
558 B
275 B
8.1 KB


Data provenance, the history of data as it moves through and between systems, provides distributed system operators with a potentially rich source of information for a wide-range of uses. Operators can use provenance for troubleshooting, auditing, and forensic analysis.

Curator is a data provenance toolkit for generating, logging, storing, visualizing, and analyzing provenance information. The philosophy behind Curator is for it to be easily to incorporate into a variety of existing and new distributed systems. We have therefore structured Curator as a toolkit where the components of the architecture have different implementions and a developer can select the implementations that best match their distributed system. The Curator deployment architecture is shown below.

Curator deployment

The top of the figure above shows that Curator includes helper functions that allow a developer to easily add provenance information to log files. Curator uses the W3C PROV model to represent provenance. The toolkit provides classes to implement this model and provides serializers and deserializers for these classes. The helper functions serialize provenance objects to, for example, PROV-JSON and these JSON documents are written to the logs of the distributed system.

We take this approach because it is very common for a distributed system to already be logging its own activity and we want to reuse this infrastructure, rather than build our own. Accordingly, the toolkit supports multiple logging libraries so that the developer can use their preferred library, such as Log4j.

Curator also includes a service for gathering log information and writing that information into a provenance database. This service is relatively simple, but is useful for testing and perhaps smaller deployments. Many distributed systems already include a scalable distributed logging system to gather and store log data. We again want to reuse this infrastructure and recommend that a developer deploy a logging infrastructure, such as Logstash. Curator will include plugins for popular logging systems that retrieve provenance information from streaming logs and store this provenance information.

The toolkit supports several different storage backends for provenance information. A developer can select a backend based on performance needs and based on the databases that are already in use in the distributed system or that they are familiar with. Curator currently supports several relational databases and the Accumulo key-value store.

For analysis, the Curator query interface supports simple lookup operations as well as graph traversal and construction operations. For visualization, Curator provides web applications, HTML pages, and JavaScript that a developer can integrate into the web interfaces of their distributed system.

Code Organization

The Curator Java code is organized into a set of modules with dependencies between some of them. The modules are:

  • core: The base module that defines the classes that implement the W3C PROV model and common interfaces and exceptions used throught Curator. We have tried to minimize the number of packages that this module depends on for easy integration with other code.
  • serialize-w3c: A serializer from Curator provenance objects to W3C formats such as PROV-JSON.
  • deserialize-w3c: A deserializer from W3C formats to Curator provenance objects.
  • logging-log4j: Support for logging via version 1.x of Log4j
  • logging-log4j2: Support for logging via version 2.x of Log4j
  • db-sql: A provenance database implementation that supports relational databases. This module currently supports MariaDB/MySQL, PostgreSQL, H2, and Derby.
  • db-accumulo: A provenance data implementation atop the Accumulo key/value store.
  • ingest: The base implementation of a simple service to ingest log data, extract provenance records, and write provenance information to a provenance database. The service defined in this module is abstract.
  • ingest-log4j: A concrete ingest service that can receive Log4j 1.x records.
  • ingest-log42: A concrete ingest service that can receive Log4j 2.x records.


To build and install the Java components of Curator, simply execute:

$ mvn install


A Curator component is configured by passing a configuration class to it. These classes are written so that they can be initialized via YAML files. Below is one example:

  provJson: true

    driver: com.mysql.jdbc.Driver
    url: "jdbc:mysql://localhost:3306/provenance"
    username: prov
    password: 5P@d1N6

    port: 4560


ProvenanceLogger logger = new ProvenanceLogger(Logger.getLogger("App"), new ProvJsonSerializer());
Entity input = new Entity();
input.setAttribute("filename", "IMG-0942.jpg");
Activity transform = new Activity();
Used used = new Used(transform, inputData);

public class ProvChanInter extends ChannelInterceptorAdapter {
    public Message<?> preSend(Message<?> message, MessageChannel channel) {
      Entity msg = new Entity(message.getHeaders().getId());
      for (String name : message.getHeaders()) {
           msg.setAttribute(name, message.getHeaders()
      if (message.getHeaders().containsKey("previousId")) {
          logger.log(new WasDerivedFrom(message.getHeaders()
      return MessageBuilder.fromMessage(message)

Technical Report

There is a technical report associated with Curator that can be found here.