Commits

Author Commit Message Labels Comments Date
Stephen Roller
Initial code for permuting a large text file using hadoop.
Stephen Roller
Filter to only include tweets in North America. Delete code that ranks users.
Stephen Roller
Add support for centroid to the regular grid cell method. Massively improves results. Also, a cleaner method of making sure training documents are not counted twice within a cell.
Stephen Roller
Filter by number of tweets to remove some more spammers and inadequate documents.
Stephen Roller
Update scoobi again. (Sorry). Keep track of follower/following statistics and do some filtering this way.
Stephen Roller
Add in a patched version of Scoobi that doesn't die on long strings. Use Twokenize for word counting.
Stephen Roller
Downgrade scoobi (blah) and update to TwitterPull that kind of works, but dies when strings get too long due to a Scoobi/Java nuance.
Stephen Roller
Initial Twitter Pull processor.
Stephen Roller
Add in a scala JSON parser. Update to scoobi-0.2.0.
Stephen Roller
Fix a terribly subtle bug introduced by processing the KD tree in two passes; the global counts were off, causing the PGT distribution to be wrong. Took me a week to find this :(
Stephen Roller
Fix this silly error where UnigramWordDist wouldn't print.
Stephen Roller
Remove the use of the Remembering train as per Ben's Dec 9 email in 'Possible other KD-tree breakage, due to incorporating my streaming code'.
Stephen Roller
Use new system for building KD-tree based on entire data set.
Stephen Roller
Fix whitespace.
sarat
Added a constructor which can be used to build KD Tree using all nodes from start
Ben Wing
In FrobCorpus, move remove before add so we can remove a value field and add a fixed field of the same name
Ben Wing
Rename the 'create' param of find_best_cell_for_coord to 'create_non_recorded' to clarify the non-recordedness of any cells created this way
Ben Wing
Use ExperimentMeteredTask in more places instead of calling heartbeat, and put heartbeat calls in more places in ExperimentMeteredTask, even if this may be overkill
Ben Wing
Changed --kd-tree and --kd-use-backoff to flags, i.e. you don't need to add 'true' to them; BEWARE this may break some scripts
Ben Wing
Use ExperimentMeteredTask instead of directly calling driver.heartbeat
Ben Wing
Automatic merge
Ben Wing
Fix problems with GenerateKML due to ordering issues when creating word-dist factory
Ben Wing
Also track records skipped due to error
Ben Wing
Automatic merge
Ben Wing
Automatic merge
Ben Wing
Move stack trace to where it will be more useful
Ben Wing
typo
Ben Wing
Clean up the statistics, old stats no longer relevant with various changes made
Ben Wing
Further separation of generic (Abhimanu-style) stuff from stuff specific to our applications
Ben Wing
One more step towars not recording documents before using them: Don't load eval set during training, and make an executive decision to eliminate dependencies on this (specifically, in incoming-link count); doing this has unwanted side effects on K-d tree generation. Correspondingly, we need to handle the possibility that an eval document might not have a cell to go in. It also means we need to redo the way we do evaluation -- we have to load and process the eval set during evaluation. In the process, remove the (badly handled and in any case unused) possibility that initializing the distribution might fail to actually create a distribution. Also remove `t…
  1. Prev
  2. Next