Wiki

Clone wiki

lmf / Module-Classifier

How to configure the text classification component in the Linked Media Framework

Introduction

The Linked Media Framework offers generic statistical text classification services (based on maximum entropy classification) that allow classifying text content to URI resources according to the underlying classification model. This functionality can e.g. be used to automatically categorize texts to SKOS thesaurus concepts or for sentiment analysis, categorizing texts as "positive", "negative" or "neutral".

In general, the maximum entropy classification works by comparing an input text with a trained model and assigning the text to the categories for which the training data is most similar to the input text (i.e. preserving the maximum entropy between the categories). Each assignment is given a probability value between 0.0 and 1.0, indicating to which extent the text matches to the category.

LMF classification is available as web services under the /classifier endpoint.

Creating/Removing Classifiers

The LMF classification services allow to define an arbitrary number of classifiers that can be trained and used individually. This allows users to classify the same content according to different dimensions, e.g. topic, sentiment and kind.

Creating Classifiers

A new classifier can be created by issuing an HTTP POST request to the /classifier/{name} endpoint, where {name} is the name of the classifier to be created. After the POST request, a new untrained classifier will be available in the system. Before the classifier can be used for classifying texts, in needs to be trained as described below.

Listing Classifiers

All classifiers can be listed by issuing a HTTP GET request to the /classifier/list endpoint. The result will be a JSON list of classifier names currently registered in the system.

Getting Classifier Information

When issuing a HTTP GET request to an individual classifier (i.e. to /classifier/{name}), a JSON description of the classifier, including the name and a list of the managed concepts, will be returned.

Removing Classifiers

A classifier can be removed by issuing an HTTP DELETE request to the classifier endpoint. When the optional query parameter removeData=true is given, all training and model data created by the classifier will also be removed.

Training Classifiers

Before text classification can be used, classifiers need to be trained with sample data for each category managed by the classifier. Categories need to be URI resources already managed by the LMF system, e.g. a SKOS thesaurus that has previously been imported into the system. In simple cases, it is sufficient to just create resources using the HTTP POST request to the resource web service.

Training mainly involves uploading a number of text samples for each category. The classification service will then automatically take care of creating the model. Training can be carried out incrementally, i.e. when the quality of classification is in sufficient, additional training data can be added and the classifier will automatically be retrained.

A new training sample is uploaded for a category by a HTTP POST to the /classifier/{name}/train?resource={category uri} endpoint with content type "text/plain" and the sample text as the request body. {name} is the name of the classifier (as created above) and {category uri} is the URI of the category for which to add training data. For example, the following HTTP request would upload training data in the "sample" classifier for the concept http://localhost:8080/LMF/resource/Concept1 (note that the query parameter should be URL-encoded):

POST http://localhost:8080/LMF/classifier/sample/train?resource=http%3A%2F%2Flocalhost%3A8080%2FLMF%2Fresource%2FConcept1
Content-Type: text/plain

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam dolor purus, ...

A typical figure is to have at least 10 text samples for each category. The more sample texts are uploaded for each category and the higher the representativeness of the text for the category, the better will be the classification results when the classifier is used.

The classifiers are automatically retrained after a certain threshold is reached or a timeout expired. It is also possible to trigger immediate retraining manually by sending a POST request to the /classifier/{name}/retrain endpoint.

Classifying Texts

When a classifier is sufficiently trained, it can be used to classify text into categories. Text classification is straightforward: a text is sent to the classifier endpoint, and the endpoint returns a list of category URIs together with the probability that the text belongs to this category. The result list will always be ordered in descending order by probability.

To classify a text using the classifier identified by "{name}", it is sent as plain text to the /classifier/{name}/classify endpoint using a HTTP POST request. The endpoint also accepts an optional threshold parameter that can be used to provide a minimum probability for the category URIs to return. For example, the following request would try to classify a sample text:

POST http://localhost:8080/LMF/classifier/sample/classify
Content-Type: text/plain

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Aliquam dolor purus, ...

The result will be a JSON list of classifications that could look as follows:

[
    {
        concept:  "http://localhost:8080/LMF/resource/Concept1",
        probability: 0.7123
    },
    ...
]

Updated