Different scraping results when using http and https DOI

Issue #2906 resolved
Daniel Zoller created an issue

e.g. the scraping result of https://doi.org/10.5194/wcd-2020-32 differs from http://doi.org/10.5194/wcd-2020-32

Please check why and why the http version results are better than the https results.

Comments (3)

  1. Til Barthel

    This should already be fixed on the master branch.

    The problem was that the old WebUtils.getContentasString from ca. 2018 did return an empty string for the http url. The HTMLMetaDataDOIScraper then scraped the url correctly for the doi and set it in scrapingContext. For the https-Url the metaData of the page did get scraped for the doi instead.

    The problem is that for the http-Url the doi got set in the scrapingContext and for the https Url the doi-Url was set. The ContentNegotiationDOIScraper checks, if the in the selected text a doi or a doi-url. The check for a doi-url was based on the Pattern(“.*dx.doi.org“), which does not catch https://doi.org/10.5194/wcd-2020-32. So the https-url used the HighwirePressScraper and got a different bibtex.

  2. Log in to comment