Wiki
Clone wikiCATI / Removing_duplicate_documents
When dealing with tweets, sometimes we can have duplicated documents containing exactly the same tweet (same str_id and text). To demove these documents, please run:
python remove_duplicated_by_str_id.py -i your_index_name
E.g.
python remove_duplicated_by_str_id.py -i lyon_2017
In case you want to check it with Kibana:
GET your_index_name/_search
{
"size":0,
"query": {
"match_all" : {}
},
"aggs" : {
"wordcounts":{
"terms":{
"field" : "id_str.keyword",
"min_doc_count": 2,
"size": 20
}
}
}
}
Updated