Issue #97 open

find_package() doesn't find 3.3-style no-__init__.py packages

Graham Wideman
created an issue
setup(...
    packages=find_packages(..)
    )

on the developer machine will fail to find packages that lack a __init__.py file, as is allowed in Python 3.3. However, such packages listed explicitly: packages=['mypkg'] do appear to get included and later installed.

Note: When testing this, before each test be sure to delete all generated metadata, including that which setup may previously have placed in the original source directory, as it seems that setup may use metadata created on a previous run in order to include files.

This is part of a general problem reported in issue #83, but I've logged it separately as it's specifically about setuptools.

Comments (38)

  1. Jason R. Coombs

    I'd like to say that Setuptools should support PEP 420 packages. The harder question is how should find_packages discover such packages? PEP 420 allows importing of any importable filename ending in .py in any directory in sys.path. However, sys.path isn't relevant when building distributions and discovering packages.

    I imagine setuptools could include all suitable directories, possibly omitting explicitly excluded paths. This change would be backwards-incompatible, but could provide packagers with options to produce compatible dists.

    I considered using Python's import machinery to find packages, but I don't believe the import machinery provides any discovery tools.

    The other thing to consider - if setuptools supports this mode, it will not work on other versions of Python, so will require Python 3.3+.

    This gives me an idea - perhaps setuptools can use the package metadata to determine when PEP 420 package discovery is enabled. It could inspect the Trove classifiers and if Python versions are specified and only include versions >= 3.3, the feature would be enabled.

    That technique would be somewhat implicit, but properly-documented, it could provide an elegant way to allow a packager to both include PEP 420 packages and declare the Python requirement at the same time.

  2. Graham Wideman reporter

    Yes, the more I look at PEP 420-style packages, the less I am able to see what distinction there is between a package, and simply a directory tree containing python files at any level, a tree that import+sys.path can navigate arbitrarily.

    Packages become whatever directories can be reached by import combined with sys.path. And neither the imports or sys.path are readily available to packaging software.

    It may be that "packages" as a special level of modularity were an unnecessary idea in the first place, and it's as well to discard them.

    But if packages are thought to be an important level of modularity, then it should be up to the package developer to positively ordain them (as with __init__.py), not left up to the caller and whatever the caller's sys.path happens to be. At least, so I'm thinking at the moment.

    So I'm a bit mystified what the big plan is regarding packages.

  3. Wyatt Baldwin

    I ran into this issue yesterday and took a quick stab at a patch.

    It works pretty much the same as before, but it's necessary to specify what to include when there are top level, non-package directories that contain .py files.

    setup(
        # ...
        find_packages(include=['my_pep420_ns_package'])
        # ...
    )
    

    It's slightly more complicated than I'd prefer because of the need to specify include in some cases. I was thinking about putting the where directory on sys.path and then trying to import subdirectories, but I wanted to try a simple(r) approach first.

    On the other hand, maybe allowing includes to be specified is generally useful. It also makes find_packages() work in a way that's more similar to how the basic packages list works (i.e., you only specify what to include). In fact, if distutils.core.setup() would allow packages to be specified using wildcards, find_packages() probably wouldn't be necessary in most cases.

    Anyway, here's the patch:

    diff -r f19978009ac8 setuptools/__init__.py
    --- a/setuptools/__init__.py    Thu Feb 06 09:31:48 2014 -0500
    +++ b/setuptools/__init__.py    Thu Feb 06 13:09:25 2014 -0800
    @@ -1,5 +1,6 @@
     """Extensions to the 'distutils' for large or complex distributions"""
    
    +import glob
     import os
     import sys
     import distutils.core
    @@ -27,31 +28,58 @@
     # Standard package names for fixer packages
     lib2to3_fixer_packages = ['lib2to3.fixes']
    
    -def find_packages(where='.', exclude=()):
    +def find_packages(where='.', exclude=(), include=()):
         """Return a list all Python packages found within directory 'where'
    
         'where' should be supplied as a "cross-platform" (i.e. URL-style) path; it
    -    will be converted to the appropriate local path syntax.  'exclude' is a
    -    sequence of package names to exclude; '*' can be used as a wildcard in the
    -    names, such that 'foo.*' will exclude all subpackages of 'foo' (but not
    -    'foo' itself).
    +    will be converted to the appropriate local path syntax.
    +
    +    'exclude' is a sequence of package names to exclude; '*' can be used
    +    as a wildcard in the names, such that 'foo.*' will exclude all
    +    subpackages of 'foo' (but not 'foo' itself).
    +
    +    'include' is a sequence of package names to include. If it's
    +    specified, only the named packages will be included. If it's not
    +    specified, all found packages will be included (this is the same as
    +    specifying include='*'). 'include' can include wild card patterns
    +    just like 'exclude'.
    +
    +    The list of includes is built up first and then any excludes are
    +    removed from it.
    +
         """
    +    from fnmatch import fnmatchcase
    +    include = list(include or '*')
         out = []
         stack=[(convert_path(where), '')]
    +    pep420 = sys.version_info[:2] >= (3, 3)
    +    in_regular_pkg = False
         while stack:
             where,prefix = stack.pop(0)
             for name in os.listdir(where):
                 fn = os.path.join(where,name)
    +            candidate = prefix + name
    +            for pat in include:
    +                if fnmatchcase(candidate, pat):
    +                    break
    +            else:
    +                continue
    +            in_regular_pkg = os.path.isfile(os.path.join(fn, '__init__.py'))
    +            # Namespace packages can't be nested inside regular packages
    +            in_pep420_ns_pkg = (
    +                not in_regular_pkg
    +                and pep420
    +                and glob.glob(os.path.join(fn, '*.py'))
    +            )
                 looks_like_package = (
                     '.' not in name
                     and os.path.isdir(fn)
    -                and os.path.isfile(os.path.join(fn, '__init__.py'))
    +                and (in_regular_pkg or in_pep420_ns_pkg)
                 )
                 if looks_like_package:
    -                out.append(prefix+name)
    -                stack.append((fn, prefix+name+'.'))
    +                out.append(candidate)
    +                stack.append((fn, candidate + '.'))
         for pat in list(exclude)+['ez_setup']:
    -        from fnmatch import fnmatchcase
             out = [item for item in out if not fnmatchcase(item,pat)]
         return out
    
  4. Jason R. Coombs

    Add support for PEP 420 namespace packages to find_packages()

    On Python 3.3+, find_packages() now considers any subdirectory of the start directory that's not a regular package (i.e., that doesn't have an __init__.py) to be a namespace package.

    The other way this supports PEP 420 is by making sure __pycache__ directories are never added to the list of packages.

    Fixes issue #97

    → <<cset 9322e85ed4f8>>

  5. Jason R. Coombs

    I've merged the pull request, but I find that the output includes quite a few directories not previously covered. Consider the comparison on setuptools itself:

    > python
    Python 3.4.0rc3 (v3.4.0rc3:8a81cdab3e9d, Mar  9 2014, 20:24:50) [MSC v.1600 64 bit (AMD64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import setuptools
    >>> setuptools.find_packages()
    ['build', 'command', 'dist', 'docs', 'foo', 'setuptools', 'temp', 'tests', '_markerlib', 'build.lib', 'docs.build', 'docs._templates', 'docs._theme', 'setuptools.command', 'setuptools.tests', 'tests.shlib_test', 'build.lib.setuptools', 'build.lib._markerlib', 'docs.build.doctrees', 'docs.build.html', 'docs._theme.nature', 'setuptools.tests.indexes', 'setuptools.tests.svn_data', 'build.lib.setuptools.command', 'build.lib.setuptools.tests', 'docs.build.html._sources', 'docs.build.html._static', 'docs._theme.nature.static', 'setuptools.tests.indexes.test_links_priority', 'setuptools.tests.indexes.test_links_priority.simple', 'setuptools.tests.indexes.test_links_priority.simple.foobar']
    
    > py -2.7
    Python 2.7.6 (default, Nov 10 2013, 19:24:24) [MSC v.1500 64 bit (AMD64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import setuptools
    >>> setuptools.find_packages()
    ['setuptools', '_markerlib', 'setuptools.command', 'setuptools.tests']
    

    I think perhaps the example above using the heuristic of testing for the presence of *.py files may be better. On the other hand, Python will import any directory.

    > py
    Python 3.4.0rc3 (v3.4.0rc3:8a81cdab3e9d, Mar  9 2014, 20:24:50) [MSC v.1600 64 bit (AMD64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import os
    >>> os.listdir('foo')
    []
    >>> import foo
    >>>
    

    If we stick with the generous (and probably more correct) implementation to accept any directory as a package, package maintainers are going to have to be a lot more explicit about which directories are packages. It almost renders find_packages useless.

    Some more consideration is going to have to go into this idea.

  6. Jason R. Coombs

    Add support for PEP 420 namespace packages to find_packages()

    On Python 3.3+, find_packages() now considers any subdirectory of the start directory that's not a regular package (i.e., that doesn't have an __init__.py) to be a namespace package. Because this will often include non-package directories, a new include argument has been added to find_packages().

    include can make it easier to narrow which directories are considered packages instead of having to specify numerous excludes. In particular, it's an easy way to keep top-level, non-source directories from being considered packages.

    The other way this supports PEP 420 is by making sure __pycache__ directories are never added to the list of packages.

    Refs issue #97

    → <<cset 4515db47ccf0>>

  7. Graham Wideman reporter

    Jason:

    If we stick with the generous (and probably more correct) implementation to accept any directory as a package, package maintainers are going to have to be a lot more explicit about which directories are packages. It almost renders find_packages useless.

    I think you are reaching a level of head scratching similar to my 2013-11-05 comment. It seems to me that PEP 420's implications weren't fully thought through, as it undermines the ideas about what the package concept facilitates and what it protects against (not just causing problems for find_packages() ).

    Perhaps someone at a higher pay grade could distill a little better what the big-picture goal is for packages, and figure out whether PEP 420 gets there. To that end, it might be worthwhile recording in more detail the implications that you've encountered.

    For example, your discussion of package maintainers needing "to be a lot more explicit" -- what form do you envision that taking?

  8. Jason R. Coombs

    By the way, I don't think there's a "big plan" regarding packages. PEP-420 was the plan and where it's deficient, more discussion should probably happen in python-dev.

    I was there for the implementation of PEP-420. I feel like it's a good thing to simply and directly allow folders to represent packages. In the past, I was always bothered by the fact that it took two entities (a directory and an __init__ file) to represent a package.

    Regardless, the implementation for Python 3.3 and 3.4 is set, so setuptools should support those models to the best of its ability. I think for find_packages, it might be worthwhile to do more heuristics, namely -- only inferring a directory is a package if it has modules somewhere within it.

  9. Arfrever Frehtes Taifersar Arahesis

    I suggest to skip directories matching some hardcoded patterns by default (unless such directories are matched by element of include parameter):

    • build* (When e.g. python setup.py build is run multiple times, contents of build directory would be copied to a subdirectory of itself (build/lib/build/lib/build/lib/...) :( .)
    • doc* (Often with Sphinx-specific conf.py file)
  10. Wyatt Baldwin

    The heuristics are tricky. Here's what I came up with (my original patch included similar heuristics):

    • include directories that contains one or more .py files
    • include directories that contain only directories
    • exclude directories that are subdirectories of "regular" packages and don't have an __init__.py (these directories could contain .py files, e.g. a directory tree for a "scaffold").

    As far as excluding certain hard coded directories, it seems like that shouldn't be necessary, since that's basically the purpose of the new include arg. I.e., you say you want to include 'mypkg*' and then the sibling directories of mypkg (like build and docs) will automatically be excluded.

    In addition, you can use a little trick to exclude non-package directories that are inside a package: add an extension to them (e.g., mypkg/scaffolds.data). These will be excluded automatically by find_packages(), and they're also not importable.

    I also wrote a blog post about this: http://wyattbaldwin.com/2014/02/07/pep-420-namespace-packages-somewhat-surprising/.

  11. Jason R. Coombs

    After further consideration on this issue, particularly around backward-compatibility, I'm inclined to back out the changes as published and instead create a new function, designed to supersede find_packages, with PEP-420 support.

    This approach would give opt-in capability for projects that use PEP-420 package structures, but would leave the existing functionality unchanged.

    I've been brainstorming over what this new function could be called. I'm leaning toward "discover_packages", "locate_packages", or "find_420_packages". The latter provides better clarity of its original purpose, but will appear like cruft in the future (once Python 3.2 is the baseline).

  12. Graham Wideman reporter

    @ Jason: Where function names suggest an overlap of functionality, I strongly favor that the distinctions in the names actually indicate what the functional distinctions are.

    Cases where functions are distinguished only by arbitrary differences in their names tend to present endless confusion, searches of docs, and so on. The difference between "find", "locate" and "discover" does not pertain to scope, which is the salient difference in functionality.

    Also, having two overlapping functions with names that are not alphabetically adjacent (in docs and autocomplete) misses the opportunity to alert the unwary programmer that there's a choice to be made.

    If the distinctness of meaning is: "find_distribution_and_namespace_packages", then I, for one, am in favor of calling it that. It's long, but we have copy/paste and IDE autocomplete.

    If that's considered too long, then "find_420_packages", while somewhat cryptic, is at least precise. If it suggests cruft, that's perhaps a good thing -- that here some complexity lies, please be fully informed of the nuances of this function.

    Side question: "once Python 3.2 is the baseline" <-- I'm puzzling what's the salience of 3.2 to this topic, or did you mean 3.3?

  13. Jason R. Coombs

    Graham Wideman I do agree with your sentiment. I am concerned about the counter sentiment which is that this function is meant to be a convenience function, providing a shortcut for specifying the project names explicitly (and sometimes redundantly). A secondary advantage is that the function adds to the simplicity and elegance of the setup function. Having a function name like 'find_distribution_and_namespace_packages', while some environments might auto-complete, many (most?) won't. And furthermore, for a function like this, the aesthetic is relevant. A clumsy name, while functional, would detract from use (and thus the appeal of setuptools).

    Let me elaborate on my thoughts about the alternative names. Yes, they do mirror find_packages semantically without reflecting the disparity. This aspect was by design, with the intention that (a) it would be documented clearly (probably with a show-by-default deprecation warning), and (b) the new function would entirely supersede the old name. So, I'm somewhat reluctant to have a new function carry the baggage of a legacy function that's to be deprecated.

    That said, you make some good arguments, and you've pushed me closer to the 'find_420_packages' name.

    I should mention, I've considered 'find_all_packages' but rejected it because 'all' is vague yet still clumsy. I've also considered the Microsoft technique, find_packages_ex, but that's clumsy too and even more vague.

    Given this ticket/PR is your initiative, I'm inclined to defer some extra weight to your judgment, so please consider these aspects and give me your favorite (or two).

    Thanks again for your help on this.

  14. Graham Wideman reporter

    Jason R. Coombs: "defer some extra weight [to me]" Hahaha -- that's a lot of responsiblity!

    First, I thought I posted this additional comment, but perhaps not:

    Is adding an optional named argument to the existing function to select the 420-inclusive behavior a way forward that avoids the naming issue? Or is this beyond the pale for some reason?

    If a new function is the way to go, then I certainly see the reasonableness of all your points.

    If long names are out, then personally I lean strongly to the find_420_package name. Like I mentioned, it precisely captures the intent of the function, and its "technical" appearance correctly conveys that there's some nuances to understand.

    I strongly dislike the names that change just "find" to "locate" or "discover".

    I would also be quite happy with find_package_ex, or find_package2. These aren't beautiful, but at least they convey that this function is newer in some way than the old one.

  15. Wyatt Baldwin

    I was just thinking about the idea of adding a keyword arg, too. It would only be relevant/active on Python 3.3 and up. Perhaps it would be disabled by default on 3.3 too so that people who aren't using namespace packages--which is probably the common case--wouldn't need to think about it.

    Something like find_packages(include=['mypkg'], include_ns_packages=True) seems like a decent API. And for projects that aren't using namespace packages in 3.3 and up, find_packages() with no args would work just like it does today.

  16. Jason R. Coombs

    In general, I avoid parameters, especially simple booleans, that select behavior. I'd prefer accepting/passing strings or actual functions to select varying behaviors.

    But even if one assumes a keyword argument as Wyatt suggests, I'd be inclined to have an alias where the default is True.

  17. Graham Wideman reporter

    @"Jason R. Coombs":

    In general, I avoid parameters, especially simple booleans, that select behavior

    I don't have a preference, but I'm curious what is the rationale behind your avoidance?

  18. Jason R. Coombs

    Especially when I see this:

    def do_something(switch=False):
      if switch:
        do_one_thing()
      else:
        do_another()
    

    Of course, it's usually more complicated than that, but that function could be re-written to just have the caller call the function he cares about.

    More importantly, an important lesson I've taken from cleanroom development, functional programming, and test-driven design is that functions should be small, isolated, and with as few branches and side-effects as possible. Adding a boolean flag that creates in internal branch ties the external interface to a branch in the internal behavior. This means that to comprehend fully the meaning of the flag, one must comprehend the entire function and all of the states it might be in.

    Proper variable naming can help indicate intent, but it won't overcome the increasing cognitive burden of the behavior.

    This becomes increasingly apparent when one attempts to refactor such a function only to find that the single boolean parameter must be passed unused to reach its inner function.

    I realize my explanation probably sounds esoteric, but it stems from my attempt to explain what seems intuitive to me after reading books like Refactoring.

  19. Jason R. Coombs

    Case in point is the recent refactorings I made to the find_packages function.

    I wanted to be rid of the 'stack' concept (which mutated a list for the purposes of iteration) and the 'out list' (which initialized an empty list and populated it incrementally for iteration). I wanted instead something that leveraged clean iteration over the essential objects.

    Once I did that though, it becomes obvious why a include_ns_packages parameter in find_packages would significantly complicate the implementation.

  20. pje

    The easiest way to fix this problem is to simply auto-add namespace_packages to the packages list in the Distribution object. find_packages should not be used for this, because of the possibility of false positives.

    In other words, if you want to include namespace packages, list them in namespace_packages, and don't list them in the regular packages. This means that find_packages should only find traditional packages, not namespace packages.

    The explicit listing of namespace_packages needs to happen for backward compatibility anyway, and setuptools already installs those without __init__.py on older Pythons.

  21. pje

    I forgot to mention/may not have made this clear: currently, namespace packages must be explicitly listed in the namespace_packages argument to setup(), which means there is no real reason from an end-user perspective to have find_packages() detect them. That's why it would be simplest to change setuptools.dist:check_nsp to automatically add the namespace packages to the list of packages, instead of giving an error -- at that point, nobody should even notice that find_packages() isn't finding the packages in question. (Though the docs should still be updated to clarify that any __init__-less packages must still be manually listed in namespace_packages.)

  22. Jason R. Coombs

    pje excellent suggestion to use namespace_packages directly.

    I do think, however, that PEP420 doesn't just apply to namespace packages but to any package, in which case the use of namespace_packages only serves as another improvement in the heuristic.

  23. pje

    A PEP 420 package is a namespace package - that's the whole concept behind PEP 420.

    One thing that was discussed in relation to PEP 420 and its predecessors on python-dev and import-sig, is that there is no a priori way to tell that an implicit namespace package is a namespace package from the file system alone. If you have a directory tree ./foo/bar/baz/spam.py, where ./foo and ./foo/bar don't have any importable files, then it's possible that foo.bar.baz and its parents are all namespace packages. So you'd have to scan all possible subdirectories to unlimited depth (while detecting symlink loops) to correctly "detect" all possible namespace packages... and then you can have false positives.

    In the presence of this ambiguity, the tools should refuse the temptation to guess.

    (Also, if you do make find_packages() guess, you'll have a counter-problem, where Distribution will have to find out on its own that those packages are namespace packages -- my guess is that they won't be handled properly in all installation cases if you don't. In other words, guessing will be required on both sides of the fence!)

  24. Graham Wideman reporter

    "that there is no a priori way to tell that an implicit namespace package is a namespace package from the file system alone"

    That's the conclusion I've come to as well.

    I know a lot of smart people put a lot of thought into this. I've read some of the reams of discussion on import-sig (notably including Nick Coghlan's https://mail.python.org/pipermail/import-sig/2012-March/000423.html and subsequent).

    But I am puzzled how the package concept is viable if it lacks a way for packages to specify (explicitly or implicitly) their own structure unambiguously. I didn't reach any explanation of how everything that interacts with packages (programmers, IDEs, tools) is intended to work with packages that lack a way to unambiguously specify their structure. Is there one?

  25. pje

    Heh. If you think this is hard in Python, you should check out JavaScript's module systems, or lacks thereof. ;-)

    But no, essentially, there is no way to be unambiguous without adding __init__.py, and it's still recommended that you do, unless you're using namespace packages.

    And at the moment, setup.py and distribution metadata are supposed to catalog the existence of same. So that is the only other way to know: use .egg-info or .dist-info. Any packages that don't use __init__.py and aren't declared in distribution info can't be picked up unambiguously.

  26. Graham Wideman reporter

    Could someone clarify whether there was explicit thinking/spec regarding whether PEP420 package rules apply within a package that has an
    __init__.py file? From Wyatt's earlier comment and blog post, I believe he thinks that PEP420 applies even within a __init__.py-marked directory tree, at least so far as import's behavior. He finds that surprising (as do I), and proposed heuristics that disable that behavior. Was there some deliberation on that we can read about? I'd rather be persuaded that it's a good thing than whinge about it in ignorance :-).

  27. pje

    It was once briefly discussed in relation to PEP 402 (420's predecessor) that the net effect of having a namespace package inside a non-namespace package was to make the subpackage a non-namespace package, since its path is limited to one item (the subdirectory of the parent). But neither PEP specs this explicitly, it's just implied. There didn't seem to be any reason to restrict this.

    Another issue that's come up recently on distutils-sig is that nspkg.pth support is broken for PEP 420 currently -- the magic really shouldn't be running if PEP 420 support is available.

    I'm beginning to think that an update to the PEP is advisable, to both specify how to detect support, and spelling out package detection heuristics in a better way. At any rate, the discussion needs to expand beyond the ticket tracker, probably to import-sig and distutils-sig.

  28. Graham Wideman reporter

    pje

    the net effect of having a namespace package inside a non-namespace package was to make the subpackage a non-namespace package

    OK, but then there seems to be no way to specify, either to Python, or to package-user programmers, that a subdir is not intended to be a package.

    update to the PEP is advisable I have been leaning that way for some time, but as a non-luminary I don't carry much weight :-).

    If this comes up for more deliberation, I do advocate that the overall package behavior be not just deterministic and implementable, but also comprehensible and explicable as a whole. That is, the whole of the __init__.py and non-__init__.py should be described as a coherent whole, and not as two distinct things whose interactions and side effects are left to the imagination.

    Nick Coghlan proposed an "explanation requirement" with respect to directly executing a module from a package. I think it applies to the whole package concept. https://mail.python.org/pipermail/import-sig/2012-March/000423.html

  29. Log in to comment