Different scraping results when using http and https DOI

Til Barthel

This should already be fixed on the master branch.

The problem was that the old WebUtils.getContentasString from ca. 2018 did return an empty string for the http url. The HTMLMetaDataDOIScraper then scraped the url correctly for the doi and set it in scrapingContext. For the https-Url the metaData of the page did get scraped for the doi instead.

The problem is that for the http-Url the doi got set in the scrapingContext and for the https Url the doi-Url was set. The ContentNegotiationDOIScraper checks, if the in the selected text a doi or a doi-url. The check for a doi-url was based on the Pattern(“.*dx.doi.org“), which does not catch https://doi.org/10.5194/wcd-2020-32. So the https-url used the HighwirePressScraper and got a different bibtex.

‌

2022-05-12T07:29:35+00:00

Comments (3)

Daniel Zoller reporter
- changed status to open
- 2020-08-09T17:04:42+00:00
Til Barthel
This should already be fixed on the master branch.

The problem was that the old WebUtils.getContentasString from ca. 2018 did return an empty string for the http url. The HTMLMetaDataDOIScraper then scraped the url correctly for the doi and set it in scrapingContext. For the https-Url the metaData of the page did get scraped for the doi instead.

The problem is that for the http-Url the doi got set in the scrapingContext and for the https Url the doi-Url was set. The ContentNegotiationDOIScraper checks, if the in the selected text a doi or a doi-url. The check for a doi-url was based on the Pattern(“.*dx.doi.org“), which does not catch https://doi.org/10.5194/wcd-2020-32. So the https-url used the HighwirePressScraper and got a different bibtex.

‌
- 2022-05-12T07:29:35+00:00
Jan Pfister
- changed status to resolved
- 2022-05-12T10:48:27+00:00
Log in to comment