Wiki
Clone wikiBibSonomy / development / modules / scraper / Scraper
general information
A Scraper extracts publication metadata from a web page.
As most websites are written differently, we need to write a scraper for each site. A list of supported sites is available at http://www.bibsonomy.org/scraperinfo.
On the sites we created a scraper for, you only need to click on the post publication bookmarklet. The scraper then extracts the information and you only need to enter tags and save the post.
If you found a web page that is currently not supported by the scrapers, i.e., its publication metadata is not automatically extracted upon pressing the "post publication" bookmarklet in your web browser, please report it on the issue tracker.
Please help to extend this web page and document scraper development - add and update text!
Resources
How does it work?
- user presses the "post publication" bookmarklet/add-on-button on a web page
- URL + selected text (if it exists) is send to BibSonomy
- BibSonomy sends input to the scraper chain
- one of the scrapers feels "responsible", i.e., is able to extract BibTeX metadata from the URL (or the selected text)
- the scraper is extracting the metadata
- the BibTeX metadata is parsed and filled into the web forms which are returned to the user.
Types of Scrapers
URL-based
- sub-classes of AbstractUrlScraper or GenericBibTeXURLScraper
- Given a URL, they find a way to get the metadata of the publication described by that URL. Sometimes this is simple, sometimes not (for Springer and ACM it's not so simple).
GenericBibTeXURLScraper
For some pages writing scrapers is very simple (as this commit shows): given a URL to a publications web page, they can create another URL that points to the BibTeX metadata for that publication. E.g., the ApsScraper receives the URL http://physrev.physiology.org/content/91/4/1281.short extracts the article id from it and constructs and returns the URL http://physrev.physiology.org/citmgr?type=bibtex&gca=physrev%3B91%2F4%2F1281 which points to the article's BibTeX metadata (which is then scraped by the superclass AbstractUrlScraper).
- Scrapers of
the GenericBibTeXURLScraper must
implement the method
getBibTeXURL()
. - Please use this kind of scraper whenever possible!
- Since the class is rather new, look if existing scrapers can be migrated.
Generic
These scrapers are not made for a particular URL (or web site) but can extract metadata from any web site, as long as it has some "features" that the scraper can detect. Since we assume that our more specific scrapers (i.e., the URLScrapers) do better extract metadata than the generic scrapers, the generic scrapers appear after the URL scrapers in the scraper chain.
- DOIScraper: Checks if the selected text (!) contains a DOI and if so, resolves that DOI to a URL and forwards the URL to the remaining scrapers. (Hence, this scraper is the first in the scraper chain).
- ContentNegotiationDOIScraper: If the URL is a DOI URL or the selected text contains a DOI the DOIScraper redirects the URL and the URLScrapers try to get information from the redirected page. If that fails the ContentNegotiationDOIScraper sends a request, trying to get BibTex directly from the DOI by content negotiation.
- UnAPIScraper: looks for unAPI metadata within the web page and if it finds it, extracts it.
- HighwireScraper: generic scraper for Highwire-based web sites (do these still exist/work?).
- SnippetScraper: Checks if the selected text (!) contains BibTeX and if so, extracts it.
- CoinsScraper: looks for COinS metadata within the web page and if it finds it, extracts it.
- ISBNScraper: Checks if the selected text (!) contains an ISBN and gets the metadata for the ISBN from WorldCat.
- BibtexScraper: Checks if the web page (!) contains BibTeX and extracts it.
The following scrapers have special purposes:
- CitationManagerScraper:
- EprintScraper: Checks wether eprint BibTeX link exists and if found extracts BibTeX with the link.
- IEScraper: see next section
(TODO: describe)
Information Extraction
The IEScraper tries to extract publication metadata from the selected text by employing machine learning (namely the Conditional Random Field implementation of MALLET).
It was described in this blog post.
Scraper Chain
The scrapers are called in a specific order which must be carefully designed in order to not to trigger the "wrong" scraper. The current order is as follows:
- DOIScraper
- KDEUrlCompositeScraper
- CiteBaseScraper
- OpacScraper
- IEEEXploreScraper
- SpringerLinkScraper
- ... (around 90 more URLScrapers)
- ContentNegotiationDOIScraper
- EPrintScraper
- UnAPIScraper
- HighwireScraper
- SnippetScraper
- CoinsScraper
- ISBNScraper
- BibtexScraper
- IEScraper
The rationale behind that order can be understand by reading the part about generic scrapers.
Most importantly, the most specific scrapers (i.e., URLScrapers) should come first, the least specific ones (i.e., IEScraper) last (the IEScraper extracts strings from any selected text you give it, but it has also the worst metadata quality).
Scrapers that are not contained in the chain will not be called. URLScrapers should be added to the KDEUrlCompositeScraper.
Uses of the Scrapers
Posting Publications
on the /editPublication page
see this blog post
Posting a DOI/ISBN/etc.
see this blog post
Posting Bookmarks
see this blog post
Scraping Service
see this blog post
Scraper Info
http://www.bibsonomy.org/scraperinfo
JabRef
Scraper Development
Guidelines
General
- Try to use the existing super classes, if possible.
- Look if the web page supports one of our existing generic scrapers (unAPI, COinS, etc.) or another standard which we could support.
Efficiency
- All extraction patterns should be
private static final Pattern
class variables.
Accessing the Web
Scrapers naturally have to access other web sites. This should be done
using the static methods from the WebUtils class, which use the
Apache Commons HttpClient to access the web. The preferred way is to
call WebUtils.getContentAsString(url)
and get the content. However,
some pages require cookies or POST content and this is where it's
getting ugly. :-(
Cookies
Typically, one has to call one page to get the cookie and then call the page with the data we want to have including the cookie (see this page - which you can't access without cookie).
Therefore, the method WebUtils.getHttpClient()
provides a correctly
configured HttpClient
. Never create an instance of the HttpClient on
your own, since it will not be correctly configured. (And we want to
bundle all web access to WebUtils to get an overview which classes
access the web and to allow easier refactoring).
Post Content
Redirects
TODO
Which approach is better?
- build link based on ids in the scraping URL
- download page content to get a link to a bibtex/endnote/… file
is the preferred order.
Helper Classes
Converter
Not all publishers provide BibTeX so sometimes we must convert the metadata we get using one of the following scrapers. These convertes are contained in the package org.bibsonomy.scraper.converter.
This list of BibTeX types and fields helps you to write a converter that correctly fills all required and optional fields.
EndnoteToBibtexConverter
The format looks similar to RIS. Example:
%A Knox, John A. %T Recent and Future Trends in U.S. Undergraduate Meteorology Enrollments, Degree Recipients, and Employment Opportunities %0 Journal Article %D 2008 %J Bulletin of the American Meteorological Society %P 873-883 %V 89 %N 6 %U http://dx.doi.org/10.1175%2F2008BAMS2375.1 %8 June 01, 2008
RisToBibtexConverter
Example input:
TY - JOUR AU - Gosse, Philippe TI - Regression of Left Ventricular Hypertrophy: Should We Echo Echo[quest] JA - Am J Hypertens PY - 2008/03/18/print VL - 21 IS - 4 SP - 373 EP - 373 PB - American Journal of Hypertension, Ltd. SN - 0895-7061 UR - http://dx.doi.org/10.1038/ajh.2008.9 ER -
OAIConverter
PicaToBibtexConverter
Pica is the format of the OPAC of many university libraries.
CslToBibtexConverter
Example input: The CSL has the structure from JSON
{ "authors":[ {"forename":"Christoph","surname":"Schmitz"}, {"forename":"Andreas","surname":"Hotho","profile":{}}, {"forename":"Robert","surname":"J\u00e4schke"}, {"forename":"Gerd","surname":"Stumme"}, {"forename":"Dominik","surname":"Benz"}, {"forename":"Miranda","surname":"Grahl"}, {"forename":"Beate","surname":"Krause"} ], "editors":[ {"forename":"Andreas","surname":"Blumauer"}, {"forename":"Tassilo","surname":"Pellegrini"} ], "title":"Social Bookmarking am Beispiel BibSonomy", "identifiers":{"eid":"2-s2.0-84864170657", "issn":"1439-3107", "isbn":"978-3-540-72215-1", "sgr":"84864170657", "doi":"10.1007\/978-3-540-72216-8" }, "type":"book_section", "published_in":"Social Semantic Web", "publisher":"Springer", "year":2009, "oa_journal":false, "website":"http:\/\/dx.doi.org\/10.1007\/978-3-540-72216-8_18", "pages":"363-391", "url":"http:\/\/www.mendeley.com\/catalog\/social-bookmarking-beispiel-bibsonomy\/", "path":"\/catalog\/social-bookmarking-beispiel-bibsonomy\/", "canonicalId":"8bae9c19-d394-3df4-a52d-b7cac3595d1e" }
Scraper Testing
For each scraper, at least one JUnit test should exist. Better: for every URL where the scraper once had particular problems with, a test should exist.
All tests are run every morning and an e-mail with the test results is sent to the scraper developers.
Most test classes must be flagged with the
#!Java @Category(RemoteTest.class)
annotation because otherwise every automatic build would run them
which would a) make the build time reaaaaaally long and b) we would
get trouble with the publishers for so often calling their web
pages. To run such a test manually with Maven you must activate the
profile allTests. You can activate the profile when using the
command line mvn
by just appending -P allTests
to the Maven
command. In Eclipse the m2e plugin has a text field for setting the
profiles to use (just insert allTests
). You can run single tests
as normal in Eclipse.
Writing Scraper Tests
A scraper test for a specific scraper and URL consists of three parts:
- A BibTeX file in src/test/resources/org/bibsonomy/scraper/data/ that contains the BibTeX that corresponds to the publication that is shown on the URL.
- A Java test file in
src/test/java/org/bibsonomy/scraper/ that contains a
method for the test. You can use our
assertScraperResult
method inorg.bibsonomy.scraper.junit.RemoteTestAssert
to compare the file with the scraper result.
Hints:
- The BibTeX file should be committed in binary mode (not ASCII) to CVS to prevent operating system specific encoding issues.
- The BibTeX file should contain UNIX conform line separators (
\n
, not\r\n
), so make sure that your line separators are set correct if you use windows
Disable Scraper Tests
In the case that some scrapers are disabled temporarily it's tests
should be disabled too, to prevent them from failing (@Ignore
annotation of the class or method is not sufficient). This is possible
by setting the value of the Enabled
tag of the corresponding test
element in the UnitTestData.xml to false
(and
true
to enable them again).
Fixing Scraper Tests
There are different types of errors and each one requires a different strategy to repair the test. The general guideline is: Do not just change the BibTeX data without thinking about it. It is not helpful, if we just change the data in order to "fix" the test but then the data itself is wrong or broken. Our goal should be to extract the best possible data.
Web Page is Down
Try to find out, if the web page is permanently down. If so, monitor the web page for two weeks and if it is staying down, disable the scraper (uncomment it in the scraper chain).
Returned BibTeX has Changed
Try to find the difference. Is the new BibTeX correct? Does it refer to the same entry as the one before? If unsure, ask @jaeschke.
The following list contains some typical types of changes and questions you should ask yourself when inspecting the data:
- changes within one BibTeX field * Was it just a small error correction or completely new data?
- addition of a BibTeX field
- removal of a BibTeX field * Why has the field been removed? Did it contain errors? Is the removed information important for the entry? Could we get the data otherwise?
- changed BibTeX key
- changed BibTeX entry type
- changes in several BibTeX fields
- a completely different entry * Normally, this should not happen - why did it happen? Maybe the URL does no longer work?
- empty BibTeX field removed or added * Since this does not change the content of the entry, the change can safely be adopted to the expected BibTeX.
Broken BibTeX
Special care must be taken when the BibTeX returned by the publisher is broken (e.g., BibTeX key is missing, year is missing, etc.). Broken BibTeX is not supported by BibSonomy, hence, the scraper should repair it, if it is easily possible. The rule is: Try to use a simple and robust heuristic to fix the BibTeX.
Example: when the BibTeX key is missing, the following could help:
#!java // fix missing bibtex key final int indexOfBrace = bibTeXResult.indexOf('{') + 1; if (indexOfBrace == bibTeXResult.indexOf('\n')) { bibTeXResult = bibTeXResult.substring(0, indexOfBrace ) + "noKey," + bibTeXResult.substring(indexOfBrace); }
Test URL Has Changed
Find the new URL for the same article (e.g., search by the title/authors on the publishers web page) and then change the URL. Does this change the returned BibTeX? Why?
Web Page Layout Has Changed
This most often requires fixing the scraper.
Updated