- changed status to open
Different scraping results when using http and https DOI
e.g. the scraping result of https://doi.org/10.5194/wcd-2020-32 differs from http://doi.org/10.5194/wcd-2020-32
Please check why and why the http version results are better than the https results.
Comments (3)
-
reporter -
This should already be fixed on the master branch.
The problem was that the old WebUtils.getContentasString from ca. 2018 did return an empty string for the http url. The
HTMLMetaDataDOIScraper
then scraped the url correctly for the doi and set it in scrapingContext. For the https-Url the metaData of the page did get scraped for the doi instead.The problem is that for the http-Url the doi got set in the scrapingContext and for the https Url the doi-Url was set. The
ContentNegotiationDOIScraper
checks, if the in the selected text a doi or a doi-url. The check for a doi-url was based on the Pattern(“.*dx.doi.org“), which does not catch https://doi.org/10.5194/wcd-2020-32. So the https-url used the HighwirePressScraper and got a different bibtex.
-
- changed status to resolved
- Log in to comment