Wiki

Clone wiki

CATI / Start

Getting Started

Follow these instructions to install the system and start using it.

Requirements

CATI requires elasticsearch-6.8.1, scipy, numpy and networkx; these scientific libraries come pre-installed with the Anaconda Python distribution. You can also install them manually via pip:

pip install scipy
pip install numpy
pip install networkx
...

You can install all the required libraries (listed in the file requirements.txt) by executing the command:

pip install -r requirements.txt

If something fails and PyCharm is not detecting the dependencies, then install them by using the UI ( File > Settings > Project > Project Interpreter > + )

Note that in Ubuntu and other Linux distribution python and pip are bound to python2, so one should explicitly call python3 and pip3 for these to work

Usage

Provided a set of tweets, CATI can (i) perform event detection and (ii) generate a visualization of the detected events.

Import tweets into Elasticsearch

Import from local file

We provide two ways of importing data for CATI into Elasticsearch from local files: by using a Logstash script, or by using a Python Script.

Import from existing Elasticsearch index

To create an index for CATI from an existing index (for example, an index filled by the Twitter API collect tool), follow these instructions

Import newspapers articles into Elasticsearch

Refer to Import newspaper articles

As newspaper articles in CATI currently don't embed image clustering, ignore the "Generate the image clusters" and "Genere the text_images field" sections.

Interfacing data for new document types in CATI

Data in CATI is read from an ElasticSearch index containing documents.

We currently implemented methods for inserting tweets and newspaper articles with the methods described below. However, CATI is designed to be compatible with any kind of multimodal documents containing text, images and metadata.

To interface new document indices properly with CATI, you must ensure that your data structure contains the following attributes (it may require specific preprocessing steps, according to the raw format of your data).

Mandatory attributes used by CATI

  • full_text (text) : Contains the full text of the document
  • created at (date with format EEE MMM dd HH:mm:ss Z yyyy) : Document creation date
  • user.name (text) : the name of the author if exists, should be set to None if not.

You also need to generate the following attributes by running the text_images_preprocess.py , taking your index name as input parameter

  • clean-text-no-tag(text) : Cleasned text (stopwords, punctuation and URLs removed)
  • clean-text (text) : same as clean-text-no-tag, but lemmatized, and with cluster ID + object detection labels if appliable

Optional attributes used by CATI

  • imagesCluster (int) : cluster image number
  • image_tags (text list) : list of items detected in the image (example : ["dog", "car"])

Set the source in the config.json file

Before running the application, it is important to properly configure the available list of indexes you want to use from the application. To do so, you can edit the config.json file at the root of the project's folder. To add a new index, please add a new entry into the elastic_search_sources:

{
    "elastic_search_sources":[
        {
          "host": ,
          "port": ,
          "user": ,
          "password": ,
          "timeout": ,
          "index": [the name of index, it was used in the logstash_tweets_importer. E.g. lyon2017],
          "doc_type": "tweet",
          "images_folder": [ name of the folder containing the images related to the dataset. E.g. "lyon2017-images"],
          "image_duplicates": [full path to duplicates file. E.g. home/user/mabed/browser/static/images/image-clusters-lyon2017.json or C:\\Users\\...\\image-clusters-lyon2017.json]
        }
    ],
    "default": {
        "index": [ this can be the first index],
        "session" : "",
        "sessions_index" : {
          "host": ,
          "port": ,
          "user": ,
          "password": ,
          "timeout": ,
          "index": "mabed_sessions",
          "doc_type": "session"
        }
    }
}
Restart CATI server if it is already running, with something like :

#!python
systemctl restart cati  
.

The default values are the default index and session you want the application to load. The sessions_index entry is for the mabed_sessions index, that will contain the list of created sessions with the system. It is automtically created the first time you run the system.

Preprocessing

Then, some preprocessing must be applied for some cases. For instance, if your documents have associated images, you may want to generate image-clusters.

Get the index images

Before generating the image clusters, you need to gather the images related to the documents of your index.

For that purpose, you can run the following script to copy the target images into a new directory: (Note : if you take a directory with subdirectories as an input, the script will crawl to find images in the subdirectories as well) (if you are on the server use sudo su and then conda activate cati to enter the right context)

python preprocess_copy_images_from_index.py -source <original_image_directory> -i <index> --output <destination_directory>

Example :

python preprocess_copy_images_from_index.py -source  /..../Tweet_collector/images/Twitter_Stream/ -i cati_index_tweets_20211008_20220117 --output /...../Tweet_collector/images/images_20211008_20220117

Generate the image clusters

Generate the duplicate-image clusters by using the DuplicateFinder.exe application. Make sure that the images to be analized are in a folder placed at:

mabed/browser/static/images

E.g. mabed/browser/static/images/lyon2017-images

Once you analyzed and generated the clusters, export the json file and keep track of such filename. Let's say we name it and save it as: mabed/browser/static/images/image-clusters-lyon2017.json

You can also place the images in another folder and create a symlink in this location:

ln -s <nom du fichier ou répertoire de destination > <nom du lien symbolique>

So, e.g.

 cd browser/static/images/
 ln -s ../../../../IMAGES/lyon2017-images lyon2017-images

Import images clusters into Elasticsearch

Make sure you didn't forget to set the image_duplicates entry in the config.json file. Then, run (eventually set the right environment (sometimes needed in the idenum server) with : conda activate /root/anaconda3/envs/cati/ ) :

python images.py -i <index name> -d <images directory>

Where -i is the parameter for the Elasticsearch index you want to associate the image clusters to.

This process adds a new field to the tweets in elasticsearch, called "imagesCluster", which is used by es_corpus.py to retrieve the tweet corpus with an extra feature integrated to the textual value:

tweet_text = tweet_text + cluster_str

If you execute the images.py script more than one, the values are updated, not duplicated. You can delete the generated field from Kibana by executing:

POST twitterfdl2017/_update_by_query?conflicts=proceed
{
    "script" : "ctx._source.remove('imagesCluster')",
    "query" : {
        "exists": { "field": "imagesCluster" }
    }
}

Generate the text_images field

This field is the combination of the text field in the documents with the image cluster numbers. This field is used by certain Active Learning strategies, so if you are planning to use them, please consider generating this field. To do so, run the text_images_preprocess.py file indicating the name of your index. E.g.:

#!python

python text_images_preprocess.py -i lyon2017

Server configuration

First, configure the URL that the client should use to communicate with the server. To do so, set the environment varialbe SERVER_NAME E.g.

In Debian-based systems:

export SERVER_NAME=[your adress, it defaults to localhost otherwise]
In Windows:
set SERVER_NAME=[your adress, it defaults to localhost otherwise]

Or some url like: https://your_sub_domain.your_domain.fr/

Then, start the elasticsearchserver:

python3 server.py

And visit localhost:5000 by using, preferably, Google Chrome. The first time the system is running, a new “mabed_sessions” index will be automatically created. Just in case a read-only error arises, please run the following using Kibana:

PUT mabed_sessions/_settings { "index": { "blocks": { "read_only_allow_delete": "false" } } }

Once the application isrunning, go to Settings > Create Session. choose a name and an existing index in Elasticsearch, and click Save. The process may take a while.

Once the new session is created, please select it from the Switch Sessions combo and click on the "Swithc session" button. The information presentes in the "Current Session" section should be updated.

Once you are working with the right session, you can generate as many ngrams as you want, to be further used in the "Tweets Search" tab. E.g. choose "2" and press the "(Re) generate" button. If you execute the process with the same parameters more than once, the ngrams are updated, not duplicated.

Production

To deploy the application in a distant server read the deployment guide

Updated