1. petermr
  2. PLUTo

Wiki

Clone wiki

PLUTo / task_page

These tasks will be added as tickets and issues.

crawler

Crawlers

we shall crawl BMC (complete journal selection) PLoSONE (complete journal selection), Molecular Phylogenetics and Evolution (looking for OA).

Tasks:

  • where to put output?
  • timing?
  • when to start stop?
  • what problems occur?
  • logging
  • Ross: get phylo articles, figure images & captions from PLOS ONE articles
  • Ross: ask Bath if we can store OA processed data on their infrastructure
  • Ross: get weka text classification going on sample figure captions
  • Ross & PMR: blog project progress frequently

scraper

  • need to locate Urls for as many files types as possible
  • download files
  • transform PDF-> HTML+SVG+PNG
  • package as epub [check readability]

document structuring

Updated