Commits

Author Commit Message Labels Comments Date
Ben Wing
Create ExperimentMeteredTask for a metered task that also calls `heartbeat` every time an item is processed, to tell Hadoop that progress is being made. Use it in various places and eliminate unnecessary driver parameter passed in (retrievable from table parameter or other parameter).
Ben Wing
Fix a bug handling zero and negative floating-point values in format_float leading to infinite loop
Ben Wing
Delete commented-out no-longer-necessary code in MultiRegularCell.scala
Ben Wing
Build up cell distributions in MultiRegularCells incrementally rather than remembering documents and doing it at the end -- a step towards not remembering training documents at all
Ben Wing
Eliminate junky empty classes EvaluationDocument, EvaluationResult, instead making these types into type parameters; make DefaultEvaluationOutputter a trait on GeolocateDocumentEvaluator and such
Ben Wing
Rename placeholder types so that they consistently begin with T instead of ending in Type, to make them easier to identify and distinguish from class types, and shorten them whenever possible
Ben Wing
Use DocType instead of DocumentType in the type signature of generic classes for conciseness
Ben Wing
A bit of cleanup -- use imp_* instead of do_* for abstract functions that actually implement an operation and are wrapped by a concrete function that adds some extra functionality, e.g. verifying that things are done in the right order
Ben Wing
Some cleanup of WordDist and related code, adding more asserts to check attempts to change a distribution after it's been 'frozen', or attempts to use a distribution before it's been 'frozen'
Ben Wing
A bit more CombinedWordDist cleanup
Ben Wing
Various cleanups of RegularCellIndex and related things, preparing for moving to stream handling of documents
Ben Wing
Add comment describing CellGrid in more detail
Stephen Roller
Add in another heartbeat and keep the dictionaries from blowing up too big in size.
Stephen Roller
Fix some variables so things run on hadoop.
Ben Wing
Fix last error preventing Hadoop from working; now works at least in non-distrib mode
Ben Wing
Automatic merge
Ben Wing
Automatic merge
Ben Wing
Remerge after within-repository split
Ben Wing
Fix problems causing slowness in Wikipedia tables -- we weren't recording the document title
Ben Wing
It turned out that many Wikipedia documents had no distribution computed during pre-processing because of paren/bracket-related errors in the docs; try to fix/work around this
Ben Wing
Fix some silly errors preventing Hadoop from working
Ben Wing
Some more coding in an attempt to get Hadoop to work again (not tested)
Ben Wing
Merge GeoDocument and DistDocument into the latter; rename old DistDocumentFileProcessor to DistDocumentTableFileProcessor
Ben Wing
Minor comment tweak
Ben Wing
Some fixes for getting Hadoop working better (not tested, probably needs more work)
Ben Wing
Fixes in comments describing how to convert the old GeoText corpus to new style
Ben Wing
Some hacking on preprocessing scripts to convert GeoText corpus and such to newest, split-training/dev/test format
Ben Wing
Automatic merge
Ben Wing
Minor cleanup of pseudo-Good-Turing dist code
Ben Wing
A few more changes related to not always recording all training docs
  1. Prev
  2. Next