Commits

Author Commit Message Labels Comments Date
Ben Wing
Add FIXME comments about eliminating remaining Params
Ben Wing
Eliminate most global Params references in Toponym.scala
Ben Wing
Eliminate most cases of use of global Params
Ben Wing
Begin to rewrite document loading so we don't necessarily load the entire training/eval set into memory.
Stephen Roller
Mega merge.
Stephen Roller
Fix an error where tg-copy-data wasn't copying the right stuff. Liberally add context.progress throughout the code so Hadoop doesn't think nodes go down when they're just initializing cells.
Stephen Roller
Whoops, compile error in the last merge.
Stephen Roller
Whoops, compile error in the last merge.
Stephen Roller
Merge.
Stephen Roller
Implement interpolation with parent nodes. Hooray.
Stephen Roller
Get rid of a record_oracle_result that was missed in a refactor.
Ben Wing
manual merge
Ben Wing
Redo corpus loading in DistDocument.scala so it expects split corpora
Ben Wing
Modify FrobCorpus.scala so that it can split a corpus into sub-corpora based on the 'split' field; merge functionality of ConvertTextToUnigramCounts.scala into FrobCorpus.scala, and eliminate ProcessCorpus.scala, with relevant functionality merged into ioutil.scala; replace FieldTextWriter with CorpusWriter for writing out a corpus in the expected format; various related changes
Ben Wing
Expand corpus schemas so that fixed fields (fields that have the same value for all rows) can be specified; create a Schema class to hold schemas, including field names and fixed fields+values; split TwitterDocument into TwitterTweetDocument (when a document is a tweet) and TwitterUserDocument (when a document is the set of tweets for a user); make FileIterator throw an error when calling next() when no more lines available, like other iterators
Ben Wing
Generalize corpus handling so that any corpus in .../corpora can be used; existing corpus names are recognized as aliases; remove special handling of twitter-geotext doc-thresh
Ben Wing
Minor comment changes
Ben Wing
Split out Sphere-specific stuff from SphereEvaluation.scala
Ben Wing
Separate SphereDocument into subtypes TwitterDocument, WikipediaDocument, GenericSphereDocument. Create corresponding subtables/factories for each of these types. Extract out all the Wikipedia-specific code into the WikipediaDocument and WikipediaDocumentSubtable classes. This separation into corpus-specific subclasses has the effect of ensuring that we only record information relevant to a particular corpus type, and in the most efficient way possible, since out-of-memory errors were a major problem. Rename (un)memoize_word to (un)memoize_string and memoize all the strings that may occur more than once. Add support for half-primitive hash tables in Trove-Scala 0.0.2 (our own version), and use them in the memoization tables. Also clean up the debugging code in the memoization code. Move much of the Sphere-specific stuff into SphereDocument.scala.
Ben Wing
Automatic merge
Ben Wing
Fix problems when exiting a file-reading loop: Need to actually break, not just set a stop flag, since if so we'll continue to read the whole file till end; also make sure we close the file if we break early; also add debug-flags --debug sleep-at-docs=5000 or whatever to sleep for 5 secs after reading a certain number of docs, so we can easily use jmap to get a heap map right then; also --debug stop-after-reading-dists to stop abruptly, also for help in heap-walking
Ben Wing
An apparent bug in Scala recorded class params 'keys' and 'values' as local variables, thereby retaining a reference to these arrays. Work around by using a non-default constructor to handle this.
Ben Wing
Add debug flag stop-after-reading-dists to abruptly stop after reading dists, for heap debugging purposes and such
Ben Wing
Fix broken oracle results
Ben Wing
Fix bugs in KML generation while preserving memory-saving optimization
Ben Wing
Add debug flags 'rethrow' and 'stacktrace' to help debug errors in parsing a corpus
Ben Wing
Fix a bug in param handling in GenerateKML.scala
Ben Wing
Add debug flag 'all-scores' to output the scores (e.g. KL divergence) of all cells when evaluating a test document
Ben Wing
Rename DocumentCorpusFileProcessor to CorpusFileProcessor
Ben Wing
Consistently use corpus suffix without leading hyphen
  1. Prev
  2. Next