Wiki
Clone wikiBioNLP2BioC / Home
Description
BioNLP2BioC is a simple converter that converts BioNLP-ST 2011 and 2013 GE task corpora to BioC format. BioC format is a simple XML format to share text documents and annotations.
In the converted data, text files (in .txt) in the BioNLP corpora are split by 'newlines' and stored into BioCPassages. Entities (in .a1) and event triggers (in .a2) are stored into separate passages based on their positions in the text files. Target annotations (in .a2), including event, relation, event modification, and equivalence, are annotated at the document level.
This converter was created to participate BioCreative IV Track 1.
Usage
You can use Git to get the code. The distribution includes:
- Java source files;
- Eclipse project files to import the whole project into Eclipse;
- Converted corpora in the
corpus
folder.
The distribution doesn't include the origin corpora of BioNLP-ST 2011 and 2013 GE task corpora. But they can be downloaded following the above links. To reproduce the converted corpora,
- Downloaded the corpus and extract it into
[BioNLP corpus directory]
. - Run
java -cp "bin:lib/*" bionlp.BioNLP2BioC [BioNLP corpus directory] [BioC output file]
where,[BioNLP corpus directory]
is the corpus folder.[BioC output file]
is the output BioC XML file name. Only one BioC XML file will be generated for the above folder.
Libraries
- bioc.jar in lib directory.
PUBLIC DOMAIN NOTICE
This work is a "United States Government Work" under the terms of the United States Copyright Act. It was written as part of the authors' official duties as a United States Government employee and thus cannot be copyrighted within the United States. The data is freely available to the public for use. The National Library of Medicine and the U.S. Government have not placed any restriction on its use or reproduction
Although all reasonable efforts have been taken to ensure the accuracy and reliability of the data and its source code, the NLM and the U.S. Government do not and cannot warrant the performance or results that may be obtained by using it. The NLM and the U.S. Government disclaim all warranties, express or implied, including warranties of performance, merchantability or fitness for any particular purpose.
Please cite the authors in any work or product based on this material:
BioC: A Minimalist Approach to Interoperability for Biomedical Text Processing Donald C. Comeau, Rezarta Islamaj Dogan, Paolo Ciccarese, Kevin Bretonnel Cohen, Martin Krallinger, Florian Leitner, Zhiyong Lu, Yifan Peng, Fabio Rinaldi, Manabu Torii, Alfonso Valencia, Karin Verspoor, Thomas C. Wiegers, Cathy H. Wu, and W. John Wilbur, accepted, DATABASE, 2013.
- commons-lang3-3.1.jar, commons-io-2.4.jar, commons-cli-1.2.jar
Apache License, Version 2.0, January 2004 http://www.apache.org/licenses/
Developer
Yifan Peng, yfpeng@udel.edu
Updated