ruscorpora-tools / README.rst


This package provides Python interface to a free corpus subset available at


pip install ruscorpora-tools


Corpus downloading

Download and unpack the archive with XML files from

Corpus reading

ruscorpora.parse_xml function parses single XML file and returns an iterator over sentences; each sentence is a list of ruscorpora.Token instances, annotated with a list of ruscorpora.Annotation instances.

ruscorpora.simplify simplifies a result of ruscorpora.parse_xml by removing ambiguous annotations, joining split tokens (+ joining their annotations) and removing accent information.

>>> import ruscorpora as rnc
>>> for sent in rnc.simplify(rnc.parse('fiction.xml')):
...     print(sent)

Working with tags

ruscorpora.Tag class is a convenient wrapper for tags used in ruscorpora:

>>> tag = rnc.Tag('S,f,inan=sg,nom')
>>> tag.POS
>>> tag.gender
>>> tag.animacy
>>> tag.number
>>> tag.tense

(there are also other attributes).

Check if a grammeme is in tag:

>>> 'S' in tag
>>> 'V' in tag
>>> 'Foo' in tag
Traceback (most recent call last)
ValueError: Grammeme is unknown: Foo

Test tags equality:

>>> tag == rnc.Tag('S,f,inan=sg,nom')
>>> tag == 'S,f,inan=sg,nom'
>>> tag == rnc.Tag('S,f,inan=sg,acc')
>>> tag == 'S,f,inan=sg,acc'
>>> tag == 'Foo,inan'
Traceback (most recent call last)
ValueError: Unknown grammemes: frozenset({Foo})

Tags returned by rnc.simplify are wrapped with this class by default.


Development happens at github and bitbucket:

The issue tracker is at github:

Feel free to submit ideas, bugs, pull requests (git or hg) or regular patches.

Running tests

Make sure tox is installed and run

$ tox

from the source checkout. Tests should pass under python 2.6..3.3 and pypy > 1.8.

Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.