improve near-duplicate detection by temporal chracteristics

Do not suffle all to pick a bucket of tweets.

1 - Sort the tweets based on their time, then: 2- Use a bucket of tweets starting from the first. [0:bucket_size] - after eliminating near-duplicates, tweet at bucket_size index will be a slower value bucket_size_2 3- Get the second bucket after eliminating near-duplicates in the previous bucket. - the second bucket should be [bucket_size_2:bucket_size+bucket_size_2]

Repeat until the End. One batch can be enough.

Comments (1)