1. udbiotmgroup
  2. Untitled project
  3. BioNLP2BioC


Clone wiki

BioNLP2BioC / Home


BioNLP2BioC is a simple converter that converts BioNLP-ST 2011 and 2013 GE task corpora to BioC format. BioC format is a simple XML format to share text documents and annotations.

In the converted data, text files (in .txt) in the BioNLP corpora are split by 'newlines' and stored into BioCPassages. Entities (in .a1) and event triggers (in .a2) are stored into separate passages based on their positions in the text files. Target annotations (in .a2), including event, relation, event modification, and equivalence, are annotated at the document level.

This converter was created to participate BioCreative IV Track 1.


You can use Git to get the code. The distribution includes:

  • Java source files;
  • Eclipse project files to import the whole project into Eclipse;
  • Converted corpora in the corpus folder.

The distribution doesn't include the origin corpora of BioNLP-ST 2011 and 2013 GE task corpora. But they can be downloaded following the above links. To reproduce the converted corpora,

  1. Downloaded the corpus and extract it into [BioNLP corpus directory].
  2. Run java -cp "bin:lib/*" bionlp.BioNLP2BioC [BioNLP corpus directory] [BioC output file]
    • [BioNLP corpus directory] is the corpus folder.
    • [BioC output file] is the output BioC XML file name. Only one BioC XML file will be generated for the above folder.


  • bioc.jar in lib directory.


This work is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part of the authors' official duties as a United States Government employee and thus cannot be copyrighted within the United States. The data is freely available to the public for use. The National Library of Medicine and the U.S. Government have not placed any restriction on its use or reproduction

Although all reasonable efforts have been taken to ensure the accuracy and reliability of the data and its source code, the NLM and the U.S. Government do not and cannot warrant the performance or results that may be obtained by using it. The NLM and the U.S. Government disclaim all warranties, express or implied, including warranties of performance, merchantability or fitness for any particular purpose.

Please cite the authors in any work or product based on this material:

BioC: A Minimalist Approach to Interoperability for Biomedical Text Processing Donald C. Comeau, Rezarta Islamaj Dogan, Paolo Ciccarese, Kevin Bretonnel Cohen, Martin Krallinger, Florian Leitner, Zhiyong Lu, Yifan Peng, Fabio Rinaldi, Manabu Torii, Alfonso Valencia, Karin Verspoor, Thomas C. Wiegers, Cathy H. Wu, and W. John Wilbur, accepted, DATABASE, 2013.

  • commons-lang3-3.1.jar, commons-io-2.4.jar, commons-cli-1.2.jar

Apache License, Version 2.0, January 2004 http://www.apache.org/licenses/


Yifan Peng, yfpeng@udel.edu