Most publications have a subset of the following:

  • Tables of contents. This may be the current issue or may be a hierarchical lists of journals / issues / articles
  • splash/landing page for single article. These are usually in HTML, often not well formed.
  • PDF for full text. BioMedCentral has "provisional" PDF which has a simpler format than the final typeset version. It may be worth capturing and keeping
  • XML (rare in TA publishers except behind paywall)
  • epub (used by BioMedCentral)
  • images linked from text. These include image types : 1 PNG or GIF 2 JPG 3 SVG, TIFF - these are rarer
  • supplemental files. These are usually outside the paywall (if there is one). They include: 1 PDF which may be text, figures or tabular data. These files are not trivially parsable 2 DOC(x) 3 TXT 4 CSV (simple tables) 5 XLS(x) spreadsheet 6 domain-specific (NexML, FASTA, Newick, etc.) There are many of these and normally a specific reader/viewer is required.