dx-doi redirect to jama hangs in DxDoi

Issue #10 on hold
Tim Laurent
created an issue

Hi,

We found this problem trying to access article here when trying to order a paper.

The

pmid = 7580483
doi = '10.1001/jamapsychiatry.2016.1728'

in DxDoi _query_api method - the url being contacted is http://dx.doi.org/10.1001/jamapsychiatry.2016.1728 . In a browser this redirect works great. however trying to requests.get or curl that url will cause a hang.

One approach to fix this is to add a timeout to requests and raise an error if that timeout is exceeded. Another approach is to use allow_redirects=False and the get the address from the headers.

here is what we have:

    def _query_api(self, doi):
        try:
            response = requests.get(DX_DOI_URL % doi, timeout=10.0, allow_redirects=False)
            if response.status_code not in [200, 401, 301, 302, 307, 308, 416]:
                raise DxDOIError('dx.doi.org lookup failed for doi "%s" (HTTP %i returned)' %
                                (doi, response.status_code))
            url = response.headers.get('Location')
            if url is None:
                raise DxDOIError('dx.doi.org lookup failed, unable to get a redirect link for doi: {}.'.format(doi))
            return url
        except requests.RequestException as e:
            raise DxDOIError('dx.doi.org lookup or redirect failed for doi "{}" (Error msg: {})'.format(
                doi, e.message
            ))

1 last question -- why are 401 and 416 acceptable status codes for the dx.doi.org response.

Comments (3)

  1. Tim Laurent reporter

    So the specific reason that there are problems with metapub accessing JAMA papers seems to be due to the robots.txt that JAMA uses blocks the user agent used by requests (also curl's user agent) https://github.com/kennethreitz/requests/issues/3743. We resolved this by creating a metapub requests in metapub/utils.py that can be imported rather than the regular requests:

    This changes requests' user agent to 'metapub/0.4.x' and applies a 20 second timeout.

    import requests as r
    
    requests = r.Session()
    requests.headers.update({'User-Agent': 'metapub/0.4.x'})
    requests.timeout = 20
    
  2. Log in to comment