Hiroyoshi Komatsu avatar Hiroyoshi Komatsu committed 33888c3

minor bugfix and add usage of raw_output option

Comments (0)

Files changed (3)

 # A Python wrapper for the Java Stanford Core NLP tools
 ---------------------------
 
-This is a fork of [stanford-corenlp-python](https://github.com/dasmith/stanford-corenlp-python)
+This is a fork of Dustin Smith's [stanford-corenlp-python](https://github.com/dasmith/stanford-corenlp-python). A Python interface to [Stanford CoreNLP](http://nlp.stanford.edu/software/corenlp.shtml). It can either be python package, or run as a JSON-RPC server.
 
 ## Edited
    * Update to Stanford CoreNLP v3.2.0
     parsed = batch_parse(raw_text_directory, corenlp_dir)  # It returns a generator object
     print parsed  #=> [{'coref': ..., 'sentences': ..., 'file_name': 'new_sample.txt'}]
 
+The function uses XML output feature of Stanford CoreNLP, and you can take all information by `raw_output` option. If true, CoreNLP's XML is returned as a dictionary without converting the format.
+
+    parsed = batch_parse(raw_text_directory, corenlp_dir, raw_output=True)
+
+(note: The function requires xmltodict now, you must install it by `sudo pip install xmltodict`)
+
 ## Developer
    * Hiroyoshi Komatsu [hiroyoshi.komat@gmail.com]
    * Johannes Castner [jac2130@columbia.edu]

corenlp/corenlp.py

     call(command, shell=True)
 
     #reading in the raw xml file:
-    result = []
+    # result = []
     try:
         for output_file in os.listdir(xml_dir):
             with open(xml_dir+'/'+output_file, 'r') as xml:
                 # parsed = xml.read()
                 file_name = re.sub('.xml$', '', os.path.basename(output_file))
-                result.append(parse_parser_xml_results(xml.read(), file_name,
-                                                       raw_output=raw_output))
+                # result.append(parse_parser_xml_results(xml.read(), file_name,
+                #                                        raw_output=raw_output))
+                yield parse_parser_xml_results(xml.read(), file_name,
+                                               raw_output=raw_output)
     finally:
         file_list.close()
         shutil.rmtree(xml_dir)
-    return result
+    # return result
 
 class StanfordCoreNLP:
     """
 AUTHOR = "Hiroyoshi Komatsu"
 AUTHOR_EMAIL = "hiroyoshi.komat@gmail.com"
 URL = "https://bitbucket.org/torotoki/corenlp-python"
-VERSION = "3.2.0-0"
+VERSION = "3.2.0-1"
 
 # Utility function to read the README file.
 # Used for the long_description.  It's nice, because now 1) we have a top level
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.