scraping paper from Nature

Issue #2246 resolved
Robert Jäschke created an issue

Scraping the URL http://www.nature.com/news/online-collaboration-scientists-and-the-social-network-1.15711 currently does not work. The error returned is "can't get download url".

Having a look at the HTML source code reveals that we should be able to extract metadata:

<meta name="citation_authors" content="Van Noorden, Richard; "/>
<meta name="citation_journal_title" content="Nature News"/>
<meta name="citation_doi" content="doi:10.1038/512126a"/>
<meta name="citation_title" content="Online collaboration: Scientists and the social network"/>
<meta name="citation_firstpage" content="126"/>
<meta name="citation_date" content="2014-08-14"/>
<meta name="citation_volume" content="512"/>
<meta name="citation_issue" content="7513"/>    

with the following mapping to BibTeX fields:

  • citation_authors to author (split at ; character and merge with and: s.replace(";", " and ")
  • citation_journal_title to journal
  • citation_doi to doi
  • citation_title to title
  • citation_firstpage to pages
  • citation_date extract month, year, and day and fill fields month and year and day, in addition, store complete value as date
  • citation_volume to volume
  • citation_issue to number

Please extend the Nature scraper to handle this type of URLs.

Comments (8)

  1. Robert Jäschke reporter

    In addition:

    1. Please ensure that only the part after doi: (i.e., the actual DOI) is stored in the doi field
    2. Map the month number to a month abbreviation (there should be a method to do it - reuse it. If you can't find it, le me know).
  2. Robert Jäschke reporter

    Thanks for the first commit. Please check again all the above mentioned requirements (also the comments). The commit does not fully implement them, in particular, the format of the author, month, and year fields:

    1. the year should be without the {}, e.g., year = 2007
    2. the month should be without the {} and all lower case, e.g., month = aug
    3. the author names should not be separated by ; but by and
  3. Log in to comment