Overview

CrystaleyeTestdata README

This is a subset of publications automatically downloaded from Acta Crystallographica E
Open Subset. It currently contains:
README.txt this file
LICENSE.txt
parseActa.xml	// control for parsing

* download
    ... 254 folders (uuid names generated by Nick Day's downloader)
        ... entry.xml metadata on each publication
        ... aadddd.html supplementary data for each pub
        
 To create the parsed files checkout and compile:
     http://bitbucket.org/petermr/crystaleye-moieties (contains code for CrystaleyeProcessor)
     (This requires maven 2.0)
      work out the relative location of crystaleyeTestData to the top directory of where you
      checked out CrystaleyeProcessor). if it's a sibling the path is ../crystaleyeTestData/parseActa.xml
      
      edit the path in parseActa.xml to reflect this relative path (REALLY tacky, sorry)
      
      run org.xmlcml.cml.crystaleye.CrystaleyeProcessor with arg: ../crystaleyeTestData/parseActa.xml
 (This arg assumes sibling relationship)
 
 This should then create /html and fill it...
 
* html
    ... all folders will be generated by crystaleye-processor. Some will be downloaded.
    the directory structure will be of the form /x/x/x/uuid where x/x/x are the first 3 
        letters of the uuid. 
        
        AFTER BUILDING In each directory should be...
        
    annotated.xml		probably obsolete
	chemicalTagger.xml	result of chemicalTagger
	chemicalTreeBank.xml	result of adding semantics with ChemicalTreeBank
	cifmetadata.xml		biblographic metadata from CIF file
	data.cif		CIF file downloaded from Acta site
	data.cif.cml		Cif processed to CML stage 1
	data.cif.xml		Cif processed to CML stage 0
	data.complete.cml	Cif processed to CML stage 2
	data.morganized.cml	possibly obsolete
	data.png		probable 2D chemical structure from 3D coordinates
	experiment.xml		experimental paragraph (to be parsed by chemical tagger)
	full.html		full-text of paper (downloaded)
	image.png		chemical structure of compound as image (roundtripping)
	imageMorgan.cml		converted to chemical identifier
	imageStructure.cml	converted to chemical structure
	metadata.xml		bibliographic metadata from Acta site
	morganized.xml		experiment with resolved chemical names
	opsin.png		parsing of title as compound (image)
	opsin.xml		parsing or title into chemistry
	opsinCoords.xml		coordinates of opsin structure
	opsinMorgan.xml		identifier for opsin structure
	resolved.xml		obsolete?
	scheme.gif		image of chemical structure (from Acta site)
	summary.html		summary page on crystaleye
	summaryPageUrl.xml	summary page URL
	suppText.html		supplemental data (tidied)
        title.xml		title of compound (chemical name)
        
        Not all files may be present. Failures in parsing result in zero-length files