SimpleScrapingLocator generates thread errors for projects with invalid (older) versions

Issue #54 resolved
Paul Moore
created an issue
>>> from distlib.locators import SimpleScrapingLocator
>>> pypi = SimpleScrapingLocator('https://pypi.python.org/simple/')
>>> p = pypi.get_project('setuptools')
Exception in thread Thread-23:
Traceback (most recent call last):
  File "C:\Apps\Python34\Lib\threading.py", line 921, in _bootstrap_inner
    self.run()
  File "C:\Apps\Python34\Lib\threading.py", line 869, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\Gustav\AppData\Local\Temp\VE-1\lib\site-packages\distlib\locators.py", line 662, in _fetch
    if (not self._process_download(link) and
  File "C:\Users\Gustav\AppData\Local\Temp\VE-1\lib\site-packages\distlib\locators.py", line 613, in _process_download
    self._update_version_data(self.result, info)
  File "C:\Users\Gustav\AppData\Local\Temp\VE-1\lib\site-packages\distlib\locators.py", line 303, in _update_version_data
    dist = make_dist(name, version, scheme=self.scheme)
  File "C:\Users\Gustav\AppData\Local\Temp\VE-1\lib\site-packages\distlib\database.py", line 1299, in make_dist
    md.version = version
  File "C:\Users\Gustav\AppData\Local\Temp\VE-1\lib\site-packages\distlib\metadata.py", line 764, in __setattr__
    self._validate_value(key, value)
  File "C:\Users\Gustav\AppData\Local\Temp\VE-1\lib\site-packages\distlib\metadata.py", line 761, in _validate_value
    key))
distlib.metadata.MetadataInvalidError: '0.6c12dev-r89000' is an invalid value for the 'version' property

Rather than cluttering up the output with errors that are of no relevance to the user (the code I was writing was going to ignore all keys except the latest version) can invalid versions not be simply skipped?

(Also, it would be nice to have a method that only returned data for a given version, which would be more efficient for locators such as the PyPI XMLRPC and JSON locators that can provide that information).

Comments (13)

  1. Vinay Sajip

    This can be avoided by specifying the legacy version scheme:

    pypi = SimpleScrapingLocator('https://pypi.python.org/simple/', scheme='legacy')
    

    This is documented here and here, but I see the formatting is slightly messed up. Will rectify shortly.

  2. Paul Moore reporter

    Hmm, OK. But I would have tended (incorrectly, I concede) to assume that versions that weren't valid under the scheme would simply be ignored. I don't see a way to get that information from distlib (without using legacy and then manually checking each version). And I'm not sure what I can do with the errors that get produced, in practice (they are thread errors, so aren't trappable in the main code).

    The problem is that unless I'm 100% sure that I know none of the historical versions of a project has ever used a legacy-format version, I have to specify legacy just to be safe. So what I'll actually do is just use it regardless (which effectively negates the value of having a newer standard).

    And my other point, that there's no way to efficiently get the data for a specific version from locators that work that way by default, remains relevant.

  3. Vinay Sajip

    versions that weren't valid under the scheme would simply be ignored

    "Errors should never pass silently. Unless explicitly silenced."

    I accept that I don't provide an easy way of just skipping invalid versions, but there would be situations where one wouldn't want silent ignoring to happen.

    which effectively negates the value of having a newer standard

    As long as loosely defined versions hang about on PyPI, we're going to run into this problem. For now, just use the legacy scheme, which you pretty much have to do in general when working with arbitrary distributions on PyPI.

    And my other point ... remains relevant.

    Yes, which is why I didn't close the issue - I'll look into it as soon as I get a chance. I assume you're aware of the locate API?

    >>> pypi = SimpleScrapingLocator('https://pypi.python.org/simple/', scheme='legacy')
    >>> pypi.locate('setuptools==5.8')
    <Distribution setuptools (5.8) [https://pypi.python.org/packages/source/s/setuptools/setuptools-5.8.zip]>
    

    Efficiency is less of a concern for the other locators you mentioned. For example, the Red Dove metadata provides the metadata for all versions in a single hit, and JSONLocator makes use of this. Admittedly the PyPIRPCLocator doesn't provide a short-cut for a single version, which is not optimal, but I see this and some of the other locators as just stop-gaps until PyPI provides a better querying infrastructure (like the Red Dove metadata does, though that can no doubt be improved, too).

  4. Paul Moore reporter

    Versions that weren't valid under the scheme would simply be ignored "Errors should never pass silently. Unless explicitly silenced." I accept that I don't provide an easy way of just skipping invalid versions, but there would be situations where one wouldn't want silent ignoring to happen.

    Agreed. The problem for me is that I can't do anything except ignore them, and leave the output cluttering what the user sees. If I could deal with the exceptions in my calling code that would be a different matter. But agreed it's a less important issue.

    I assume you're aware of the locate API?

    Yes, I am. I reported this issue while I was trying various options, but locate is actually just as good for the use case I have (unfortunately, due to issue #58, "just as good" is still "not good enough") which is to get the downloadable files for (the latest version of) a project.

  5. Vinay Sajip

    Given that using a legacy scheme eliminates the problem initially posted, and given that the default locator uses the legacy scheme, is it OK to close this issue? I'm not sure it's worth adding a separate API for locating just a single version, given that (a) it's currently efficient using the default locator, (b) it can't be made efficient using the SimpleScrapingLocator or the other locators that need to scan everything, and (c) the other locators are stop-gaps pending a decent JSON API appearing in PyPI?

  6. Paul Moore reporter

    I'm OK with closing this issue. To be honest, I've ended up writing my own locators code, because the limitations of the distlib API made it unsuitable. So I don't need this any more.

    I will say that I would never use an API that had the possibility of dumping error data to stdout/stderr that I couldn't handle within the app. So as far as I'm concerned the message I take away is "always specify legacy".

    And what's wrong with the PyPI JSON API, in your view? It seems fine to me (with the one exception of not having a "list all distributions" method, but that's easy enough to get by falling back to XMLRPC or scraping the simple index).

  7. Vinay Sajip

    because the limitations of the distlib API made it unsuitable

    Do you mean the limitations we're discussing here, or something else? Did you subclass an existing locator, or write new code entirely?

    I would never use an API that had the possibility of dumping error data to stdout/stderr that I couldn't handle within the app

    That's a fair point. One approach might be to add a handleError method on locators (or an error_handler callable attribute that defaults to a handleError method) that's called when exceptions occur. How does that sound?

    So as far as I'm concerned the message I take away is "always specify legacy"

    In general that seems the right answer for now, because what if all versions of a distribution are invalid according to PEP 440? Ignoring errors would give you ... no distributions.

    It seems fine to me (with the one exception of not having a "list all distributions" method, but that's easy enough to get by falling back to XMLRPC or scraping the simple index).

    That's precisely it: scraping the simple index or making multiple RPC calls are suboptimal, given that PyPI has all the relevant information in its database and could return all the requested info in one hit (i.e. by supporting multiple URLs to return different levels of information - such as whole project, or a specific version).

  8. Paul Moore reporter

    Do you mean the limitations we're discussing here, or something else?

    Basically what we're discussing here. I was trying to do a quick "grab the latest downloadable files for a distribution and do something with them" script, and found I couldn't. It was a throwaway script, so I mentioned the issue here and moved on.

    The thing that bothers me is that I find this happens often with the locators API. I don't quite know why - when I roll my own code with the one of the PyPI APIs, it always seems pretty straightforward to get what I want, so I decided to write my own, to see if I could work out why the distlib API seems to be a bad fit for me.

    Did you subclass an existing locator, or write new code entirely?

    Just use the existing locators to do some PyPI queries. It was when PyPI was having maintenance, so I used the simple scraping locator directly because the JSON and XMLRPC interfaces weren't working.

    That's precisely it: scraping the simple index or making multiple RPC calls are suboptimal

    OK, so asking for a JSON API to get all the distributions would suit you? I'd find it useful, too, so maybe I'll ask for it...

  9. Vinay Sajip

    The thing that bothers me is that I find this happens often with the locators API ... so I decided to write my own, to see if I could work out why the distlib API seems to be a bad fit for me.

    Any conclusions? I'd like the locators API to be easy to use, obviously, so specific feedback to improve it would be gratefully received.

    a JSON API to get all the distributions would suit you?

    Yes. I'd update PyPIJSONLocator to incorporate any improvements, when feasible.

  10. Paul Moore reporter

    Any conclusions? I'd like the locators API to be easy to use, obviously, so specific feedback to improve it would be gratefully received.

    So far, not much. I've got the basic internals, but not done much on the API yet. My code so far makes the following decisions:

    1. Locators have 3 main APIs - list all distribution names, list all version numbers for a distribution, list all files for a version. The rest can be layered on that. Pretty much the same as distlib (and the XMLRPC and JSON APIs, tbh).
    2. I am only locating files, I'm not even trying to return metadata (yet...). In my experience, name, version and URL are by far the most important data for "quick scripts". Directory and scraper locators can't give metadata anyway.
    3. I'm not trying to make the scraper comprehensive - I've restricted it to rel=internal links. Again, good enough for my (quick scripts) purposes so far.
    4. I'm parsing filenames with permissive URLs, not trying to enforce the standards at this point. My instinct is to return versions as strings, and let the caller do any parsing needed. (A locate-style API would obviously need to parse versions, but it can then ignore unparseable versions - the user can use the low-level API if that's not suitable).

    I also want to keep the locator API self-contained. Both setuptools and distlib link it to distribution, metadata, and other APIs. I'd rather keep things loosely coupled. Of course, working only with name, version and url makes that easier :-)

    My principle is that I'm not trying to write pip, but I am trying to make it easier for people (me) to write one-off scripts to interact with the packaging ecosystem.

  11. Log in to comment