Wiki

Clone wiki

CATI / Removing_duplicate_documents

When dealing with tweets, sometimes we can have duplicated documents containing exactly the same tweet (same str_id and text). To demove these documents, please run:

python remove_duplicated_by_str_id.py -i your_index_name

E.g.

python remove_duplicated_by_str_id.py -i lyon_2017

In case you want to check it with Kibana:

GET your_index_name/_search
{
  "size":0,
  "query": {
    "match_all" : {}
  },
  "aggs" : {
    "wordcounts":{
      "terms":{
        "field" : "id_str.keyword",
        "min_doc_count": 2,
        "size": 20
      }
    }
  }
}

Updated