Most publications have a subset of the following:
- Tables of contents. This may be the current issue or may be a hierarchical lists of journals / issues / articles
- splash/landing page for single article. These are usually in HTML, often not well formed.
- PDF for full text. BioMedCentral has "provisional" PDF which has a simpler format than the final typeset version. It may be worth capturing and keeping
- XML (rare in TA publishers except behind paywall)
- epub (used by BioMedCentral)
- images linked from text. These include image types : 1 PNG or GIF 2 JPG 3 SVG, TIFF - these are rarer
- supplemental files. These are usually outside the paywall (if there is one). They include: 1 PDF which may be text, figures or tabular data. These files are not trivially parsable 2 DOC(x) 3 TXT 4 CSV (simple tables) 5 XLS(x) spreadsheet 6 domain-specific (NexML, FASTA, Newick, etc.) There are many of these and normally a specific reader/viewer is required.