-
assigned issue to
improve near-duplicate detection by temporal chracteristics
Issue #9
new
Do not suffle all to pick a bucket of tweets.
1 - Sort the tweets based on their time, then: 2- Use a bucket of tweets starting from the first. [0:bucket_size] - after eliminating near-duplicates, tweet at bucket_size index will be a slower value bucket_size_2 3- Get the second bucket after eliminating near-duplicates in the previous bucket. - the second bucket should be [bucket_size_2:bucket_size+bucket_size_2]
Repeat until the End. One batch can be enough.
Comments (1)
-
reporter - Log in to comment