1. Alexandre Patry
  2. mallet


mallet / mallet2-changes.txt

MALLET 2.0 Changes

* SVN instead of CVS

MALLET 2.0 source code will be maintained in an SVN repository.

* Package Names

Moved edu.umass.cs.mallet.base.* to cc.mallet.*
This results in much shorter package names.
The 'users' and 'projects' packages all appear off cc.mallet; 
these will likely be placed in a different SVN repository, however.

The 'maximize' package has been renamed 'optimize'.

* Optimizers Tied to their Optimizable objects

Maximizers (the old Optimizers) had an interface that was stateless:  calls 
to maximizer.maximize(maximizableObject) were supposed to complete their 
optimization work before returning, maintain no state, and then support 
another call to the same maximizer with a different maximizableObject.

Optimizers (the new Maximizers) are now stateful and associated with a 
particular maximizableObject.  So rather than  

Thus old code like this:
		MaximizableTrainer mt = new MaximizableTrainer (trainingSet, (MaxEnt)initialClassifier);
		Maximizer.ByGradient maximizer = new LimitedMemoryBFGS();
		maximizer.maximize (mt);
should be changed to
		MaximizableTrainer mt = new MaximizableTrainer (trainingSet, (MaxEnt)initialClassifier);
    Optimizer optimizer = new LimitedMemoryBFGS(mt);
		maximizer.maximize ();

This also means that there is no more need for those awkward "eval" objects
with so many arguments.  You can now safely and easily call an optimizer or 
a trainer for just one or a few iterations, and then evaluate whatever you 
want---without a complicated call-back.

Optimizable.ByGradient -> Optimizable.ByGradientValue
There is a new Optimizable.Gradient, but it only has the gradient method, not the value method...

* Pipes

In MALLET 0.x the work of transforming objects in feature extraction
pipelines was performed entirely in the Pipe.pipe(Instance) method,
which takes an Instance argument, transforms the fields of the Instance
however it sees fit, and returns the transformed Instance.  This scheme is
simple and has been widely sub-classed, however, it only allows
"one-to-one" transformations of Instances.  There was no way for a pipeline
to produce more than one Instance from a single source Instance.  For 
example, there was no way to take a document (as an Instance) and produce
Instances representing all pairs of mention-coreference decisions as output. 

Pipe objects now have two "standard" ways to be invoked: the old one, 
plus a new one that allows one-to-many and many-to-one flow of instances
through the Pipe.

The old one-to-one mapping, provided by the pipe(Instance) method is still
available, and all your old Pipe subclasses should continue to work unchanged. 

The one-to-many and many-to-one mappings are provided by
a new method, called newIteratorFrom(InstanceIterator source).   
Given an iterator, this method will return a new InstanceIterator over the 
result of pipe-processing the elements of the first iterator.  
The standard way to run many instances through a Pipe is to
now to use this iteration scheme.  For example:
InstanceIterator ii = pipe.newIteratorFrom (InstanceIterator a)

Because a Pipe object may be asked to produce more than one iterator,
Pipe objects themselves should not be stateful, but their iterators may be.

The most fundamental way to run many instances through a Pipe is to
now to use this new iteration scheme.  Backwards compatibility to the 
one-to-one pipe(Instance) method is provided by a standard implementation
of the newIteratorFrom() method that calls pipe().  You can override 
newIteratorFrom() or pipe(), but cannot meaningfully override both.

The Pipe.pipe(instance) method is now protected.
You should not call this yourself.  The new way to call a pipe is

Instance no longer has a five argument constructor, the last of which is
a pipe object.  The standard way to get raw data, through a pipe, into
an instance is
Instance inst = pipe.instancesFrom(new Instance(data,target,name,source)[0];
The method returns an array of Instance objects, since in general, a
pipe may not map instances one-to-one.
or more briefly
Instance inst = pipe.instanceFrom(new Instance(data,target,name,source);

The the interface PipeInputIterator and the abstract class
AbstractPipeInputIterator have been replaced with InstanceIterator.
You will have to make these changes in your code:
s/PipeInputIterator/InstanceIterator/ and

Thus the interface PipeInputIterator and the abstract class
AbstractPipeInputIterator have been replaced with cc.mallet.types.InstanceIterator.
You should make these changes in your code:
s/PipeInputIterator/InstanceIterator/ and

Pipe instances no longer have a parent.  This enables reuse of Pipes as 
sub-parts of other Pipes.

Pipe instances no long try to be so clever (and fragile) about automatically
creating Alphabets or automatically obtaining them from elsewhere when 
necessary.  If a Pipe will need an Alphabet, the Alphabet must be constructed 
and passed to the Pipe's constructor.  Thus, code like this:
  super (null, LabelAlphabet.class);
should be changed to   
  super (null, new LabelAlphabet());
The Pipes' Alphabets still get properly shared through a SerialPipes collection.
See the code for details.

* Alphabets and Pipes for Weak Type Checking

In Mallet 0.x Instances and InstanceLists made many safety checks by 
ensuring that Instances, InstanceLists, Classifiers, etc used exactly the
same Pipe.  For example, InstanceLists ensured that all their Instances
came from the same Pipe.  Furthermore, to ensure that final Instance 
contents were indeed produced by the Pipe they recorded, virtually the only 
way to create an Instance was by a constructor the took this Pipe as an
argument.  A corollary of this arrangement is that Instances didn't store
their Alphabets; they stored their Pipe, from which the Alphabet was obtained.

Although well-intentioned, this checking and the inflexibility in Pipe usage
this entailed proved to be more trouble than it was worth.  

In Mallet 2.0, type checking does not happen by checking Pipe equality---
it happens by merely checking Alphabet equality.  Thus, as long two
Instances have the same Alphabets, they will be declared compatible (for
the purposes of training classifiers, etc), even if they came from different
Pipes.  One concrete use of this that there might be one Pipe for processing
a plain text document into a FeatureVector and another Pipe for processing
a web page into a FeatureVector; as long as they share the same Alphabet,
a classifier can use Instances from a mixture of these Pipes.

There is a new interface for objects that store Alphabets called AlphabetCarrying.
This can return a single Alphabet, or an arbitrarily-sized array of Alphabets.  
There are also convenience methods for asking if two AlphabetCarrying objects
have matching alphabets---this is the primary method of doing "type checking".

The standard way to ensure that it is safe to use an Instance
in some context is to compare their Alphabets.  There are several
methods for doing this; the most general is obj1.alphabetsMatch(obj2),
where both obj1 and obj2 both implement AlphabetCarrying.

A consequence is that all places in your code in which you previously asked
  instance.getPipe() == myPipe
you should instead ask
* Pipes, Instances and InstanceLists

Instances no long store the Pipe that produced them.  (Since Pipes have become
more flexible, it may even be hard to determine or store a single Pipe that
produced them).  Instead Instances store their Alphabets.  Thus, the
method no longer exists.  So rather than
  instance.getPipe().getDataAlphabet(), etc
instead do one of
The implementation of getDataAlphabet() simply calls getAlphabet() on the
instance.data Object, if it implements the AlphabetCarrying interface.

This infrastructure lays the groundwork for implementing data objects that are
based on more than one Alphabet, and Instances that have more than two Alphabets.

Previously, there was a version of Instance.getData() that took a Pipe
argument.  (This allowed different users of the instance different
"views" of the instance through different pipes.  This method has been removed.
There is now only a Instance.getData() with zero arguments.

InstanceList is now a subclass of ArrayList, so it supports all Java collection

InstanceList.add(InstanceList) has been removed.  Instead you might want
instanceList1.add(instanceList2.iterator) if you want them to be piped with 
instanceList1's pipe.  To avoid the piping, you'll have to roll your own loop, 
calling instanceList1.add(Instance).

ROUGH NOTES: ??  Forces data through pipe:
InstanceList.addUsingPipe(InstanceIterator) or addWithPipe, addThruPipe, addUsingPipe
Puts in raw, does not send through pipe:
//InstanceList.add(InstanceIterator)  Add this 6 months from now

There are new facilities for "editing" SerialPipes and causing skipping behavior:

First, SerialPipes has a few new methods for returning a new SerialPipes that has a 
subset of the Pipe steps in the original SerialPipes.  These include
newSerialPipesFromRange(int start, int end)
newSerialPipesFromSuffix(SerialPipes.Predicate testForStartingNewPipes)

Second, Pipe has a new method  
  public boolean precondition(Instance).  
Before the Pipe processes an instance it tests the precondition.  If the precondition
fails, the Instance gets passed through the pipe unchanged.  This can be used to build
SerialPipes that skip early processing steps if they are not necessary.  In particular, 
the standard use case is to define an anonymous inner sub-class:
	  		SerialPipes sp = new SerialPipes (new Pipe[] {
				new CharSequence2TokenSequence() {
					public boolean precondition (Instance inst) { return ! inst instanceof TokenSequence; } },
				new TokenSequence2FeatureSequence(), });

* ClassifierTrainers are stateful

In most cases, you can replace MyTrainer.train (instanceList) with MyTrainer.Factory.train(instanceList)
But some arguments for some of the more complex methods have changed in evaluation is no longer done by callback.
		public abstract Classifier train (InstanceList trainingSet,
				InstanceList validationSet,
				InstanceList testSet,
				ClassifierEvaluating evaluator,
				Classifier initialClassifier);
		public abstract Classifier train (InstanceList trainingSet,
				InstanceList validationSet,
				Classifier initialClassifier);

ClassifierTrainer.classifier is a new inherited variable.
You must set this variable in your train methods before you return.
    return new NaiveBayes (...)
    this.classifier = new NaiveBayes (...)
    return this.classifier;

* Transducer and Friends get Re-factored for Flexibility and Clarity

In MALLET 0.x, Transducer and its subclasses such as CRF encapsulated model 
parameters, inference functionality and training functionality.  Transducer (and 
its TransitionIterator in particular) was implemented in such a way that is was 
impossible to split these apart.  The result was that anyone that wanted to 
implement a new inference method (e.g. beam-search based Viterbi, etc), or a 
new learning method (pseudo-likelihood, etc) needed to either add this code to 
CRF.java or copy all the CRF.java code to a new class.

In MALLET 2.0, the model, inference and learning have all been separated.

Inference classes are no longer inner classes, but distinct classes that take
a Transducer as an argument to their constructor.  For example, rather than the 
  Lattice foo = crf.forwardBackward (input, output);
instead use
  ForwardBackwardLattice foo = new FBLattice (crf, input, output);
  BeamFBLattice foo = new BeamFBLattice (crf, input, output, 3);
Similarly for Viterbi.  Old style
  ViterbiPath bar = crf.viterbiPath (input);
must be changed to something like
  ViterbiPath bar = new ViterbiPath (crf, input);
Also available are variants such as
  ViterbiPath bar = new ViterbiPathNBest (crf, input, 5);  

Unlike ClassifierTrainer, but like Optimizers, TransducerTrainer is tied to a Transducer at constructor time.

CRF has control over feature selection and feature freezing.
CRFTrainer has control over weight values and their sparsity (implying that CRFTrainer actually allocates the memory for CRF.weights) and induction.

TransitionIterator no longer has complicated methods for incrementCount.  
Now handled by Transducer.Incrementor interface passed as an argument to inference methods.
The objects implementing these interfaces are typically anonymous inner classes.
Nice because can see effect and style of incrementing sufficient statistics right next to the call to the inference algorithm.

TransducerEvaluator is gone.  Now can write your own eval code in between calls to train(one iteration)

public void evaluate (TransducerEvaluator eval, InstanceList testing) is gone.

public Instance transduce (Instance) is new.

training by proportions used to always train on 100% of the data at the end.  It no longer does that.
You can train on only a fraction of the data by new double[] {0.2}.
If you want to end by training on all the data, you must new double[] {0.1, 0.5, 1.0}
Last round does get special behavior, though: numIterationsPerRound is ignored?  No, verify this...

  Maximizable.ByValue mcrf = crf.getMaximizableCRF (instances)
  CRFTrainerByLikelihood crft = new CRFTrainerByLikelihood (crf);
  Optimizable.ByValue mcrf = crft.getOptimizableCRF (instances)
TransducerEvaluator is no longer used as a call out from training() methods, 
and no longer manages its own schedule of which iterations to eval on or not.
There was previously a lot of code repetition, long lists of difficult to read
arguments, control of evaluation frequency far from its caller.  New scheme makes
the timing logic more visible.
  crf.train (trainSet, validationSet, testSet, evaluator)
  TokenAccuracyEvaluator eval = new TokenAccuracyEvaluator (new Instances[] {trainSet, validationSet, testSet},
    new String[] {"training", "validation", "testing"});
  while (crft.train(trainingSet)) {
    if (crft.getIteration() % 5 == 0)
    if (crft.getIteration() % 10 == 0)
      new ViterbiWriter("experiment1-", new InstanceList[] {training, testing}, new String[] {"training", "testing"}).evaluate(crft);

  crf.setWeightsDimensionAsIn (one, false); now takes an additional (second) argument: boolean useUnsupportedTrick.

		Sequence predicted = new ViterbiPath (model, input).output ();
		Sequence predicted = new MaxLatticeDefault (model, input).bestOutputSequence();
		new ViterbiPath (model, sequence).getCost();		
		new MaxLatticeDefault (model, sequence).bestOutputAlignment().getCost();		

    lattice.getStateAtRank (ip+1, k);
    List<Sequence<Transducer.State>> stateSequences = lattice.bestStateSequences(max);
    Transducer.State state = stateSequences.get(k).get(ip+1);

 Evaluators take trainers as arguments, but have InstanceLists implicit.  Trainers take InstanceLists as arguments.  Is this right?

* Minor Name Changes and Other Member Changes.

Instance.setLock() has been renamed to lock().

Instances no longer have parents.

pipe.iterator.SimpleFileIterator -> pipe.iterator.SimpleFileLineIterator

The method Instance.Iterator.nextInstance() has been renamed
Instance.Iterator.next(), since InstanceList now uses generics.

mallet.util.Random has been renamed to mallet.util.Randoms (to help avoid confusiong wiht java.util.Random)  
The naming scheme paralles that of mallet.util.Maths.