These tasks will be added as tickets and issues.
we shall crawl BMC (complete journal selection) PLoSONE (complete journal selection), Molecular Phylogenetics and Evolution (looking for OA).
- where to put output?
- when to start stop?
- what problems occur?
- Ross: get phylo articles, figure images & captions from PLOS ONE articles
- Ross: ask Bath if we can store OA processed data on their infrastructure
- Ross: get weka text classification going on sample figure captions
- Ross & PMR: blog project progress frequently
- need to locate Urls for as many files types as possible
- download files
- transform PDF-> HTML+SVG+PNG
- package as epub [check readability]