- edited description
scraping paper from Nature
Scraping the URL http://www.nature.com/news/online-collaboration-scientists-and-the-social-network-1.15711 currently does not work. The error returned is "can't get download url".
Having a look at the HTML source code reveals that we should be able to extract metadata:
<meta name="citation_authors" content="Van Noorden, Richard; "/>
<meta name="citation_journal_title" content="Nature News"/>
<meta name="citation_doi" content="doi:10.1038/512126a"/>
<meta name="citation_title" content="Online collaboration: Scientists and the social network"/>
<meta name="citation_firstpage" content="126"/>
<meta name="citation_date" content="2014-08-14"/>
<meta name="citation_volume" content="512"/>
<meta name="citation_issue" content="7513"/>
with the following mapping to BibTeX fields:
citation_authors
toauthor
(split at;
character and merge withand
:s.replace(";", " and ")
citation_journal_title
tojournal
citation_doi
todoi
citation_title
totitle
citation_firstpage
topages
citation_date
extract month, year, and day and fill fieldsmonth
andyear
andday
, in addition, store complete value asdate
citation_volume
tovolume
citation_issue
tonumber
Please extend the Nature scraper to handle this type of URLs.
Comments (8)
-
reporter -
reporter - changed status to open
@misgna let me know when you are able to finish this task.
-
reporter And, of course, add the scraped URL as field
url
. -
reporter In addition:
- Please ensure that only the part after
doi:
(i.e., the actual DOI) is stored in thedoi
field - Map the month number to a month abbreviation (there should be a method to do it - reuse it. If you can't find it, le me know).
- Please ensure that only the part after
-
reporter Thanks for the first commit. Please check again all the above mentioned requirements (also the comments). The commit does not fully implement them, in particular, the format of the author, month, and year fields:
- the year should be without the
{}
, e.g.,year = 2007
- the month should be without the
{}
and all lower case, e.g.,month = aug
- the author names should not be separated by
;
but byand
- the year should be without the
-
Account Deleted NatureScraper is modified to support the above requirements.
-
Account Deleted - changed status to resolved
Issue
#2246is resolved. -
reporter Please have a look at the refactoring in commit #0a18428
- Log in to comment