Source

probCounting / src / bitbucket.org / joshi4 /

Filename Size Date modified Message
..
getHash
loglogCount
1.3 KB
5.3 MB

Probabilistic 'Log-Log' Counting

I came across this blog post by a friend and thought it would be fun to implement it in Go.

The algorithm, specified in Durande et al's log-log paper is able to provide an accurate estimate of the cardinality of large multisets using a very very small amount of auxillary memory.

Results

Shakespeare

  • The logLogCount estimate is: 70280.703003
  • The actual numer of unique words is: 83696.000000
  • The error percentage is: 16.028600%

Patent Citation Network Edge List- Estimate Total Number of Nodes

  • The estimated number of nodes is: 3522826.369602
  • The actual numer of nodes is: 3774768.000000
  • The error percentage is: 6.674361%

Patent Citation Network Edge List- Estimate Total Number of Edges

  • The estimated number of edges is: 16509780.050316
  • The actual numer of edges is: 16518948.000000
  • The error percentage is: 0.055500%

The accuracy improves significantly as the cardinality of the multiset increases.

The files I used for evaluation can be found here: