Knowledge Expansion

The knowledge expansion algorithm is used for the inference engine in ProbKB, a PROBabilistic Knowledge Base system. It applies first-order inference rules and infers implicit knowledge from existing knowledge bases. ProbKB models knowledge bases as database relations, and accordingly, the knowledge expansion algorithm can be expressed as a few joins among the facts and rules tables, applying the rules in batches. Our approach results in a speedup of 237 on the TextRunner knowledge base compared to the state-of-the-art, Tuffy. Furthermore, ProbKB works on massive parallel processing (MPP) databases, including Pivotal Greenplum and Apache HAWQ, where the queries are executed in parallel. ProbKB uses semantic constraints to improve both quality and efficiency during the expansion task. The application of constraints allows us to improve the precision of inferred facts by 0.61. This repository provides the knowledge expansion software and datasets we use for our experiments.


ProbKB is released under the BSD license.

If you use ProbKB in your research, please cite our paper:

  title={Knowledge expansion over probabilistic knowledge bases},
  author={Chen, Yang and Wang, Daisy Zhe},
  booktitle={Proceedings of the 2014 ACM SIGMOD international conference on Management of data},

Quick Start

To install the software for knowledge expansion, you need to have PostgreSQL installed. The latest version can be downloaded from After the installation process, you will need to create a database, and install SQL scripts into the database you just created. Suppose the database is called probkb, then the following scripts will install into the database the probkb schema, import the data, and create and the core probkb.ground() and probkb.groundFactors() procedures that perform the grounding task.

$ createdb probkb
$ psql probkb -f sql/create.sql  # Create the probkb schema and tables.
$ psql probkb -f sql/qc.sql      # Create quality control procedures.
$ psql probkb -f sql/load.sql    # Load the files in CSV format.
$ psql probkb -f sql/ground.sql  # Create grounding procedures.

To apply the procedures, first login to the probkb database:

$ psql probkb

and make the procedure calls:

probkb=# SELECT probkb.ground();
probkb=# SELECT probkb.groundFactors();

It would be useful to tune the PostgreSQL environment for better performance:

probkb=# SET work_mem = '8GB';
probkb=# SET enable_mergejoin = OFF;   # Use hash joins.

The queries can be parallelized on MPP databases, e.g., Pivotal Greenplum, and achieve better performance depending on the hardware. The installation steps are the same once you have an MPP database installed.


This repository contains the following datasets for experiments:

  • A. Fader, S. Soderland, and O. Etzioni. Identifying relations for open information extraction. In EMNLP, 2011.
  • S. Schoenmackers, O. Etzioni, D. S. Weld, and J. Davis. Learning first-order horn clauses from web text. In EMNLP, 2010.
  • T. Lin, O. Etzioni, et al. Identifying functional relations in web text. In EMNLP, 2010.

We include the original datasets in the data/ directory and the parsed CSV files in the csv/ directory.


The ProbKB project is partially supported by NSF IIS Award # 1526753, DARPA under FA8750-12-2-0348-2 (DEFT/CUBISM), and a generous gift from Google. We also thank Dr. Milenko Petrovic and Dr. Alin Dobra for the helpful discussions on query optimization.


If you have any questions about Ontological Pathfinding, please visit the project website or contact Yang Chen, Dr. Daisy Zhe Wang, DSR Lab @ UF.