Clone wiki

PLUTo / scraping

Scraping is the process of extracting components from a single page. These include:

  • links
  • boilerplate text (maybe with embedded named entities
  • downloaded supplementary (linked) files

Scraping is normally done on HTML, possibly after cleaning/tidying. It may include:

  • regular expressions
  • Xpath run on well-formed XHTML
  • bespoke code
  • machine learning and pattern recognition

Part of the skill is recognising standard or repeated components and maybe guessing how they were created. It is most suitable for machine-geneterated pages. Scraping can fail on unusual language and may cease completely if the publisher changes their template.