number of tweets read from a collection is too high

Issue #17 resolved
Ali Hürriyetoglu repo owner created an issue

Total number of tweets for the collection floodtags20140729 is around 130k. But we see on the screen around 130k. What is the reason behind it? What is correct, 130k or 230k?

Comments (11)

  1. ugur ozcan

    Where do you see 230k? We did not understand. If you show that, we can find the solution. We checked on terminal and mlab, we've seen 134.567.

  2. Erkan Başar

    It says there are more tweets than it should be (e.g. 230k instead of less than 130k) , because right now the total number of tweets is calculated by summing the number of tweets per option(urls, photos, no_photos_urls) before rendering it in the template.

    See the code causes this on the 1161st line in the views.py;

    'numtwt_total': len(query.tweets_no_photos_urls) + len(query.tweets_photos) + len(query.tweets_urls),
    
  3. Ali Hürriyetoglu reporter

    Yes, we spot this line. But 230k is too much. It is normal to have some larger number due to tweets that contain several features we are using to divide the collection. I suspect the subset that does not have any tweets in it erroneously affect the result. Ugur is checking this.

  4. Erkan Başar

    The "query.save()" in line 1126 is there to be sure the list of excluded tweets are saved in case of any interruption at the exclude_hashtags() methods. In common sense, saving an item multiple times should not duplicate the items in a list in a field, however, I am not sure about the exact behaviour of the method.

    The possible reason of this issue can occur during querying with hashtags in exclude_hashtags() method. Here in the following code, we are querying the database for every hashtag one-by-one and then appending the results of those queries each other by chain method without checking the uniqueness. Since we are retrieving the tweets which don't contain the current hashtag, many of the tweets in the database are added to the 'tweets' list in each loop.

    Here is the code that causes this;

    tweets = []
    for hst in hashtaglist: 
        tweets_tmp = model.objects(Q(pk__in = tweet_ids) & Q(text__not__icontains = hst)).only('id')
        tweets = list(chain(tweets, tweets_tmp))
    
  5. ugur ozcan

    dublicate_solution.pngI added a line and the error was corrected. I added the line in views.py after "query.tweets_urls = docids_url" line.

                #############  The Tweets with Urls 
    
                docids_url = exclude_hashtags(collname, query.tweets_urls, hashtaglist_urls)
    
                query.tweets_urls = docids_url
    
                query.tweets_urls= list(set(query.tweets_urls)) ### remove duplication line
    
  6. Ali Hürriyetoglu reporter

    Well done! We can use this solution for now.

    As a further investigation, we have to check: - Can you see where the duplication happens? Could you print size of this list at each step? We may be able to prevent of duplicate entries. In this manner, we can handle big data sets easier. Otherwise, duplicating a big collection will cause problems. - Does this problem happens in all subsets? - At which step does this occur? - Does the duplicates come from the database? In this case, how can we prevent duplicate writing? Should we use something like: http://stackoverflow.com/questions/2801008/mongodb-insert-if-not-exists

    Good luck ;)

    Ali

  7. Erkan Başar

    This solution looks nice, however, it seems now that it's only applied for the tweets with URLs. There can also be duplicates in the other tweet datasets; tweets with photos and tweet with no photos or URLs. If we are going to use this solution, my suggestion would be using this solution in exclude_hastags(), in the return of the method.

    def exclude_hashtags(collname, tweet_ids, hashtaglist):
    ...
            return list(set(docids))
    
  8. ugur ozcan
    def exclude_hashtags(collname, tweet_ids, hashtaglist):
    
        cntr = Counter()
    
        model = get_document(collname)
    
        if(hashtaglist):
            tweets = []
            for hst in hashtaglist: 
                tweets_tmp = model.objects(Q(pk__in = tweet_ids) & Q(text__not__icontains = hst)).only('id')
                tweets = list(set(chain(tweets, tweets_tmp))) #my solution
    

    Better solution for now. I am working on codes to find better solution but I think we use this solution to solve duplicate problems.

  9. Ali Hürriyetoglu reporter

    The logic of retrieving and merging tweets which do not contain a hashtag was wrong. Ugur corrected it. But since we are not working with MongoEngine any more, this bug is not relevant for now.

  10. Log in to comment