Overview

==Overview==
AMI2 is a general approach to Open Content Mining of STM documents. It is designed to extract semantic 
information from both textual and diagrammatic content (if present as graphical primitives, not bitmaps)

==Installing==
AMI2 is a Java/maven project using a number of third-party and PMR libraries. The latest version is at:
https://bitbucket.org/petermr/ami2
You will need mercurial to download/sync the code.
Then run 
hg clone https://petermr@bitbucket.org/petermr/ami2
which will download the latest version into a new directory ami2 (unxedr the current dir)

==Libraries==
AMI2 depends on the following PMRGroup libraries:
https://petermr@bitbucket.org/wwmm/euclid-testutils
https://petermr@bitbucket.org/wwmm/euclid
https://petermr@bitbucket.org/wwmm/cmlxom
https://petermr@bitbucket.org/wwmm/jumbo-testutils
https://petermr@bitbucket.org/wwmm/jumbo6
https://petermr@bitbucket.org/petermr/svg
https://petermr@bitbucket.org/petermr/html
In alpha-phase the safest approach is to clone these and for each:
    cd foo
    mvn clean install
(At a later stage we shall include these in the wwmm repo at http://wwmm.ch.cam.ac.uk)
The install should generate entries in your maven repo e.g:
  .m2\repository\org\xml-cml
with components similar to (* may be absent at this stage):
  16/06/2012  15:04    <DIR>          cifxom*
  10/10/2012  06:35    <DIR>          cmlxom
  10/10/2012  06:35    <DIR>          euclid
  15/06/2012  13:18    <DIR>          euclid-testutil
  16/06/2012  18:40    <DIR>          html
  22/06/2012  16:43    <DIR>          jc*
  14/10/2012  01:29    <DIR>          jtmt1*
  15/06/2012  13:56    <DIR>          jumbo
  10/10/2012  06:35    <DIR>          jumbo-testutil
  24/06/2012  14:06    <DIR>          svg


==Running from commandline==
At a later stage we expect to have a jar-file installed but with the project changing daily 
you need to run from the ami2 directory:
  cd ami2
and run PDF2XMLConverter with arguments: 
  mvn -e exec:java -Dexec.mainClass="org.xmlcml.graphics.pdf2svg.PDF2XMLConverter" 
    -Dexec.args="-c src/main/resources/org/xmlcml/graphics/styles/basic.xml -i src/test/resources/test"

-c commandfile (the commands to execute, here "basic.xml").
-i The directory containing the PDFs to convert.
Note that maven "home" is where the pom.xml is, so src/ is relative to this or you can use absolute
filenames

==Operation and output==
The process involves 2 steps:
 * convert PDF to SVG with no normalization. output is in ./raw relative to -i directory. 
    One page ({n}page.svg) is generated for each PDF page.n starts at ZERO (we shall probably 
    change this to ONE)
 * convert ./raw/{n}page.svg to ./out/{n}start.svg and ./out/{n}end.svg. The exact filenames depend
    on the write statements in the commandfile. 
    
Note:
 * each page is converted separately. A thread is launched with a timeout (ca 15 secs) in case the 
   conversion hangs or takes too long. If the thread fails to complete, no *end*.svg is written.
 * images are discarded from the output to save space and processing time. This can/will be changed later
 * the conversion is highly controllable by the commandfile. 

==Control==
The commandfile is designed for precise control but will normally be generic at some level. Where possible
the specific input (e.g. directory to convert) should be at commandline and the generic material
(e.g. page normalization) at controlfile. The controlfile is designed to be as generic as possible 
but it may be necessary to specify stylefiles for different document types (e.g. publishers). In general
the commandline overrides the controlfile and allows for the input of values for symbolic variables.