Clone wiki

BibSonomy / development / modules / scraper / Scraper

general information

A Scraper extracts publication metadata from a web page.

As most websites are written differently, we need to write a scraper for each site. A list of supported sites is available at http://www.bibsonomy.org/scraperinfo.

On the sites we created a scraper for, you only need to click on the post publication bookmarklet. The scraper then extracts the information and you only need to enter tags and save the post.

If you found a web page that is currently not supported by the scrapers, i.e., its publication metadata is not automatically extracted upon pressing the "post publication" bookmarklet in your web browser, please report it on the issue tracker.

Please help to extend this web page and document scraper development - add and update text!

Resources

How does it work?

  1. user presses the "post publication" bookmarklet/add-on-button on a web page
  2. URL + selected text (if it exists) is send to BibSonomy
  3. BibSonomy sends input to the scraper chain
  4. one of the scrapers feels "responsible", i.e., is able to extract BibTeX metadata from the URL (or the selected text)
  5. the scraper is extracting the metadata
  6. the BibTeX metadata is parsed and filled into the web forms which are returned to the user.

Types of Scrapers

URL-based

  • sub-classes of AbstractUrlScraper or GenericBibTeXURLScraper
  • Given a URL, they find a way to get the metadata of the publication described by that URL. Sometimes this is simple, sometimes not (for Springer and ACM it's not so simple).

GenericBibTeXURLScraper

For some pages writing scrapers is very simple: given a URL to a publications web page, they can create another URL that points to the BibTeX metadata for that publication. E.g., the ApsScraper receives the URL http://physrev.physiology.org/content/91/4/1281.short extracts the article id from it and constructs and returns the URL http://physrev.physiology.org/citmgr?type=bibtex&gca=physrev%3B91%2F4%2F1281 which points to the article's BibTeX metadata (which is then scraped by the superclass AbstractUrlScraper).

  • Scrapers of the GenericBibTeXURLScraper must implement the method getBibTeXURL().
  • Please use this kind of scraper whenever possible!
  • Since the class is rather new, look if existing scrapers can be migrated.

Generic

These scrapers are not made for a particular URL (or web site) but can extract metadata from any web site, as long as it has some "features" that the scraper can detect. Since we assume that our more specific scrapers (i.e., the URLScrapers) do better extract metadata than the generic scrapers, the generic scrapers appear after the URL scrapers in the scraper chain.

  • DOIScraper: Checks if the selected text (!) contains a DOI and if so, resolves that DOI to a URL and forwards the URL to the remaining scrapers. (Hence, this scraper is the first in the scraper chain).
  • ContentNegotiationDOIScraper: If the URL is a DOI URL or the selected text contains a DOI the DOIScraper redirects the URL and the URLScrapers try to get information from the redirected page. If that fails the ContentNegotiationDOIScraper sends a request, trying to get BibTex directly from the DOI by content negotiation.
  • UnAPIScraper: looks for unAPI metadata within the web page and if it finds it, extracts it.
  • HighwireScraper: generic scraper for Highwire-based web sites (do these still exist/work?).
  • SnippetScraper: Checks if the selected text (!) contains BibTeX and if so, extracts it.
  • CoinsScraper: looks for COinS metadata within the web page and if it finds it, extracts it.
  • ISBNScraper: Checks if the selected text (!) contains an ISBN and gets the metadata for the ISBN from WorldCat.
  • BibtexScraper: Checks if the web page (!) contains BibTeX and extracts it.

The following scrapers have special purposes:

(TODO: describe)

Information Extraction

The IEScraper tries to extract publication metadata from the selected text by employing machine learning (namely the Conditional Random Field implementation of MALLET).

It was described in this blog post.

Scraper Chain

The scrapers are called in a specific order which must be carefully designed in order to not to trigger the "wrong" scraper. The current order is as follows:

The rationale behind that order can be understand by reading the part about generic scrapers.

Most importantly, the most specific scrapers (i.e., URLScrapers) should come first, the least specific ones (i.e., IEScraper) last (the IEScraper extracts strings from any selected text you give it, but it has also the worst metadata quality).

Scrapers that are not contained in the chain will not be called. URLScrapers should be added to the KDEUrlCompositeScraper.

Uses of the Scrapers

Posting Publications

on the /editPublication page

see this blog post

Posting a DOI/ISBN/etc.

see this blog post

Posting Bookmarks

see this blog post

Scraping Service

see this blog post

Scraper Info

http://www.bibsonomy.org/scraperinfo

JabRef

Scraper Development

Guidelines

General

  • Try to use the existing super classes, if possible.
  • Look if the web page supports one of our existing generic scrapers (unAPI, COinS, etc.) or another standard which we could support.

Efficiency

  • All extraction patterns should be private static final Pattern class variables.

Accessing the Web

Scrapers naturally have to access other web sites. This should be done using the static methods from the WebUtils class, which use the Apache Commons HttpClient to access the web. The preferred way is to call WebUtils.getContentAsString(url) and get the content. However, some pages require cookies or POST content and this is where it's getting ugly. :-(

Cookies

Typically, one has to call one page to get the cookie and then call the page with the data we want to have including the cookie (see this page - which you can't access without cookie).

Therefore, the method WebUtils.getHttpClient() provides a correctly configured HttpClient. Never create an instance of the HttpClient on your own, since it will not be correctly configured. (And we want to bundle all web access to WebUtils to get an overview which classes access the web and to allow easier refactoring).

Post Content
Redirects

TODO

Which approach is better?

  1. build link based on ids in the scraping URL
  2. download page content to get a link to a bibtex/endnote/… file

is the preferred order.

Helper Classes

Converter

Not all publishers provide BibTeX so sometimes we must convert the metadata we get using one of the following scrapers. These convertes are contained in the package org.bibsonomy.scraper.converter.

This list of BibTeX types and fields helps you to write a converter that correctly fills all required and optional fields.

EndnoteToBibtexConverter

EndnoteToBibtexConverter

The format looks similar to RIS. Example:

%A Knox, John A.
%T Recent and Future Trends in U.S. Undergraduate Meteorology Enrollments, Degree Recipients, and Employment Opportunities
%0 Journal Article
%D 2008
%J Bulletin of the American Meteorological Society
%P 873-883
%V 89
%N 6
%U http://dx.doi.org/10.1175%2F2008BAMS2375.1
%8 June 01, 2008
RisToBibtexConverter

RisToBibtexConverter

Example input:

TY  - JOUR
AU  - Gosse, Philippe
TI  - Regression of Left Ventricular Hypertrophy: Should We Echo Echo[quest]
JA  - Am J Hypertens
PY  - 2008/03/18/print
VL  - 21
IS  - 4
SP  - 373
EP  - 373
PB  - American Journal of Hypertension, Ltd.
SN  - 0895-7061
UR  - http://dx.doi.org/10.1038/ajh.2008.9
ER  -
OAIConverter

OAIConverter

PicaToBibtexConverter

PicaToBibtexConverter

Pica is the format of the OPAC of many university libraries.

CslToBibtexConverter

CslToBibtexConverter

Example input: The CSL has the structure from JSON

{
   "authors":[
       {"forename":"Christoph","surname":"Schmitz"},
       {"forename":"Andreas","surname":"Hotho","profile":{}},         
       {"forename":"Robert","surname":"J\u00e4schke"},
       {"forename":"Gerd","surname":"Stumme"},  
       {"forename":"Dominik","surname":"Benz"},
       {"forename":"Miranda","surname":"Grahl"},
       {"forename":"Beate","surname":"Krause"}
   ],
   "editors":[
       {"forename":"Andreas","surname":"Blumauer"},
       {"forename":"Tassilo","surname":"Pellegrini"}
   ],
   "title":"Social Bookmarking am Beispiel BibSonomy",
   "identifiers":{"eid":"2-s2.0-84864170657",
       "issn":"1439-3107",
       "isbn":"978-3-540-72215-1",   
       "sgr":"84864170657",
       "doi":"10.1007\/978-3-540-72216-8"
   },
   "type":"book_section",
   "published_in":"Social Semantic Web",
   "publisher":"Springer",
   "year":2009,
   "oa_journal":false,
   "website":"http:\/\/dx.doi.org\/10.1007\/978-3-540-72216-8_18",
   "pages":"363-391",
   "url":"http:\/\/www.mendeley.com\/catalog\/social-bookmarking-beispiel-bibsonomy\/", 
   "path":"\/catalog\/social-bookmarking-beispiel-bibsonomy\/",
   "canonicalId":"8bae9c19-d394-3df4-a52d-b7cac3595d1e"
}

Scraper Testing

For each scraper, at least one JUnit test should exist. Better: for every URL where the scraper once had particular problems with, a test should exist.

All tests are run every morning and an e-mail with the test results is sent to the scraper developers.

Most test classes must be flagged with the

@Category(RemoteTest.class)

annotation because otherwise every automatic build would run them which would a) make the build time reaaaaaally long and b) we would get trouble with the publishers for so often calling their web pages. To run such a test manually with Maven you must activate the profile allTests. You can activate the profile when using the command line mvn by just appending -P remoteTests to the Maven command. In Eclipse the m2e plugin has a text field for setting the profiles to use (just insert remoteTests). You can run single tests as normal in Eclipse.

Writing Scraper Tests

A scraper test for a specific scraper and URL consists of three parts:

  1. A BibTeX file in src/test/resources/org/bibsonomy/scraper/data/ that contains the BibTeX that corresponds to the publication that is shown on the URL.
  2. A Java test file in src/test/java/org/bibsonomy/scraper/ that contains a method for the test. You can use our assertScraperResult method in org.bibsonomy.scraper.junit.RemoteTestAssert to compare the file with the scraper result.

Hints:

  • The BibTeX file should be committed in binary mode (not ASCII) to CVS to prevent operating system specific encoding issues.
  • The BibTeX file should contain UNIX conform line separators (\n, not \r\n), so make sure that your line separators are set correct if you use windows

Disable Scraper Tests

In the case that some scrapers are disabled temporarily it's tests should be disabled too, to prevent them from failing (@Ignore annotation of the class or method is not sufficient). This is possible by setting the value of the Enabled tag of the corresponding test element in the UnitTestData.xml to false (and true to enable them again).

Fixing Scraper Tests

There are different types of errors and each one requires a different strategy to repair the test. The general guideline is: Do not just change the BibTeX data without thinking about it. It is not helpful, if we just change the data in order to "fix" the test but then the data itself is wrong or broken. Our goal should be to extract the best possible data.

Web Page is Down

Try to find out, if the web page is permanently down. If so, monitor the web page for two weeks and if it is staying down, disable the scraper (uncomment it in the scraper chain).

Returned BibTeX has Changed

Try to find the difference. Is the new BibTeX correct? Does it refer to the same entry as the one before? If unsure, ask Robert Jäschke.

The following list contains some typical types of changes and questions you should ask yourself when inspecting the data:

  1. changes within one BibTeX field * Was it just a small error correction or completely new data?
  2. addition of a BibTeX field
  3. removal of a BibTeX field * Why has the field been removed? Did it contain errors? Is the removed information important for the entry? Could we get the data otherwise?
  4. changed BibTeX key
  5. changed BibTeX entry type
  6. changes in several BibTeX fields
  7. a completely different entry * Normally, this should not happen - why did it happen? Maybe the URL does no longer work?
  8. empty BibTeX field removed or added * Since this does not change the content of the entry, the change can safely be adopted to the expected BibTeX.
Broken BibTeX

Special care must be taken when the BibTeX returned by the publisher is broken (e.g., BibTeX key is missing, year is missing, etc.). Broken BibTeX is not supported by BibSonomy, hence, the scraper should repair it, if it is easily possible. The rule is: Try to use a simple and robust heuristic to fix the BibTeX.

Example: when the BibTeX key is missing, the following could help:

// fix missing bibtex key
final int indexOfBrace = bibTeXResult.indexOf('{') + 1;
if (indexOfBrace == bibTeXResult.indexOf('\n')) {
   bibTeXResult = bibTeXResult.substring(0, indexOfBrace ) + "noKey," + bibTeXResult.substring(indexOfBrace);
}

Test URL Has Changed

Find the new URL for the same article (e.g., search by the title/authors on the publishers web page) and then change the URL. Does this change the returned BibTeX? Why?

Web Page Layout Has Changed

This most often requires fixing the scraper.

Updated