number of tweets read from a collection is too high

Issue #17 resolved

Ali Hürriyetoglu repo owner created an issue 2016-06-27

Total number of tweets for the collection floodtags20140729 is around 130k. But we see on the screen around 130k. What is the reason behind it? What is correct, 130k or 230k?

Comments (11)

ugur ozcan
Where do you see 230k? We did not understand. If you show that, we can find the solution. We checked on terminal and mlab, we've seen 134.567.
- 2016-06-27T08:52:26+00:00
Ali Hürriyetoglu reporter
You see 230k on the overview after selecting users and hashtags to be excluded.
- 2016-06-27T13:16:11+00:00
Erkan Başar
It says there are more tweets than it should be (e.g. 230k instead of less than 130k) , because right now the total number of tweets is calculated by summing the number of tweets per option(urls, photos, no_photos_urls) before rendering it in the template.

See the code causes this on the 1161st line in the views.py;
```
'numtwt_total': len(query.tweets_no_photos_urls) + len(query.tweets_photos) + len(query.tweets_urls),
```
- 2016-06-27T14:01:06+00:00
Ali Hürriyetoglu reporter
Yes, we spot this line. But 230k is too much. It is normal to have some larger number due to tweets that contain several features we are using to divide the collection. I suspect the subset that does not have any tweets in it erroneously affect the result. Ugur is checking this.
- 2016-06-27T14:06:56+00:00
Ali Hürriyetoglu reporter
We printed the size of each subset. The size of each subset is doubled. Are we sure, we are not inserting the same tweet IDs to the filter twice? Do we need the "query.save()" in line 1126 of https://bitbucket.org/hurrial/relevancer/src/d20fe6372c824aa3b8d99ca133b6183f4db9ed4b/RelevancerDjango/main/views.py?at=master&fileviewer=file-view-default?
- 2016-06-27T17:50:51+00:00
Erkan Başar
The "query.save()" in line 1126 is there to be sure the list of excluded tweets are saved in case of any interruption at the exclude_hashtags() methods. In common sense, saving an item multiple times should not duplicate the items in a list in a field, however, I am not sure about the exact behaviour of the method.

The possible reason of this issue can occur during querying with hashtags in exclude_hashtags() method. Here in the following code, we are querying the database for every hashtag one-by-one and then appending the results of those queries each other by chain method without checking the uniqueness. Since we are retrieving the tweets which don't contain the current hashtag, many of the tweets in the database are added to the 'tweets' list in each loop.

Here is the code that causes this;
```
tweets = []
for hst in hashtaglist: 
    tweets_tmp = model.objects(Q(pk__in = tweet_ids) & Q(text__not__icontains = hst)).only('id')
    tweets = list(chain(tweets, tweets_tmp))
```
- 2016-06-27T18:12:23+00:00

ugur ozcan

I added a line and the error was corrected. I added the line in views.py after "query.tweets_urls = docids_url" line.

            #############  The Tweets with Urls 

            docids_url = exclude_hashtags(collname, query.tweets_urls, hashtaglist_urls)

            query.tweets_urls = docids_url

            query.tweets_urls= list(set(query.tweets_urls)) ### remove duplication line

2016-07-04T14:45:05+00:00

Ali Hürriyetoglu reporter
Well done! We can use this solution for now.

As a further investigation, we have to check: - Can you see where the duplication happens? Could you print size of this list at each step? We may be able to prevent of duplicate entries. In this manner, we can handle big data sets easier. Otherwise, duplicating a big collection will cause problems. - Does this problem happens in all subsets? - At which step does this occur? - Does the duplicates come from the database? In this case, how can we prevent duplicate writing? Should we use something like: http://stackoverflow.com/questions/2801008/mongodb-insert-if-not-exists

Good luck ;)

Ali
- 2016-07-04T15:15:13+00:00
Erkan Başar
This solution looks nice, however, it seems now that it's only applied for the tweets with URLs. There can also be duplicates in the other tweet datasets; tweets with photos and tweet with no photos or URLs. If we are going to use this solution, my suggestion would be using this solution in exclude_hastags(), in the return of the method.
```
def exclude_hashtags(collname, tweet_ids, hashtaglist):
...
        return list(set(docids))
```
- 2016-07-04T18:44:31+00:00

ugur ozcan

def exclude_hashtags(collname, tweet_ids, hashtaglist):

    cntr = Counter()

    model = get_document(collname)

    if(hashtaglist):
        tweets = []
        for hst in hashtaglist: 
            tweets_tmp = model.objects(Q(pk__in = tweet_ids) & Q(text__not__icontains = hst)).only('id')
            tweets = list(set(chain(tweets, tweets_tmp))) #my solution

Better solution for now. I am working on codes to find better solution but I think we use this solution to solve duplicate problems.

2016-07-06T07:58:07+00:00

Ali Hürriyetoglu reporter
- changed status to resolved
The logic of retrieving and merging tweets which do not contain a hashtag was wrong. Ugur corrected it. But since we are not working with MongoEngine any more, this bug is not relevant for now.
- 2016-08-24T12:22:38+00:00
Log in to comment

Assignee: ugur ozcan

Type: bug

Priority: minor

Status: resolved

Votes: 0

Watchers: 2