dependencies could be streamlined

Issue #52 new
Thomas Gilgenast created an issue

currently lib5c lists relatively few optional dependencies and a long list of required dependencies

many of the required dependencies are used very rarely

lib5c is increasingly depended on by other libraries/code, most of which use just a few utility or plotting functions that don’t touch the majority of lib5c’s dependencies

in light of this, it seems to make sense to streamline lib5c’s dependencies

there are two major options:

  1. make more of lib5c's dependencies optional, moving them from install_requires to extras_require. we can always install the full package with pip install lib5c[full]. this would require adding some code in the modules that import rarely-used dependencies to catch ImportError and instruct the user to install the package (via the same mechanism as the current bsub_avail).
  2. break lib5c into subpackages, leaving the dependency-light code in lib5c-core

(1) feels simpler so it might be preferred

Comments (7)

  1. Thomas Gilgenast reporter

    one possible issue could be that it may be difficult to decide what dependencies should be considered "core"

    one nice thing is that by always making dependencies optional whenever possible, we can keep the experience of users who directly install lib5c relatively similar while greatly streamlining things for client libraries. client libraries will be able to list explicit dependencies on the optional lib5c dependencies they require. for example, a client library that needs to plot gene tracks can list cruzdb and pymysql in its own install_requires, guaranteeing that lib5c's gene plotting functionality will be available. meanwhile, users who directly install lib5c can either remember to pip install lib5c[full] or pip install lib5c[windows] to recapitulate their current experience. even if they forget the extras_require groups that they want, these users will be prompted to install missing dependencies as they access features of lib5c that depend on those dependencies.

  2. Thomas Gilgenast reporter

    as a first guess for what we could cut, here's a list of the current lib5c dependencies

    install_requires=[
        'python-daemon>=2.1.1,<2.2.0',  # pipeline
        'numpy>=1.10.4',                # core
        'scipy>=0.16.1',                # core
        'matplotlib>=1.4.3',            # plotting
        'pandas>=0.18.0',               # dataset, plotting (distance dependence, PCA, bias heatmaps)
        'seaborn>=0.8.0',               # plotting
        'statsmodels>=0.6.1',           # qvalues, lowess (expected models)
        'dill>=0.2.5',                  # parallelization
        'decorator>=4.0.10',            # plotting and parallelization
        'luigi>=2.1.1',                 # pipeline
        'scikit-learn>=0.17.1',         # PCA, thresholding (confusion matrix)
        'interlap>=0.2.3',              # convergency
        'powerlaw>=1.4.3',              # powerlaw expected models
    ],
    

    the use of sklearn during thresholding is very weak - we could easily lift sklearn.metrics.confusion_matrix()

    it is somewhat unfortunate that we use pandas for certain plotting functions but probably not worth trying to avoid

    dataset, powerlaw, and parallelization are basically no longer recommended (i.e., client libraries won't want to touch them) so it doesn't make sense to promote them

    convergency, PCA, and pipeline are niche enough that it doesn't make sense to promote them either

    statsmodels is perhaps the most debatable dependency since qvalues and lowess are both quite common

    despite being an incredibly useful package, interlap seems to only be used directly by the convergency code, so it doesn't make sense to promote

    we probably could quite easily get to

    install_requires=[
        'numpy>=1.10.4',
        'scipy>=0.16.1',
        'statsmodels>=0.6.1',
    ],
    

    and

    extras_require = {
        'bsub': ['bsub>=0.3.5'],
        'iced': ['iced>=0.4.0'],
        'pyBigWig': ['pyBigWig>=0.3.4'],
        'plotting': ['matplotlib>=1.4.3', 'seaborn>=0.8.0', 'decorator>=4.0.10', 'pandas>=0.18.0'],
        'pipeline': ['luigi>=2.1.1', 'python-daemon>=2.1.1,<2.2.0'],
        'parallel': ['decorator>=4.0.10', 'dill>=0.2.5'],
        'powerlaw': ['powerlaw>=1.4.3'],
        'interlap': ['interlap>=0.2.3'],
        'dataset': ['pandas>=0.18.0'],
        'pca': ['scikit-learn>=0.17.1'],
        'test': ['nose>=1.3.7', 'nose-exclude>=0.5.0', 'flake8>=3.4.1'],
        'docs': ['Sphinx>=1.7.2', 'sphinx-rtd-theme>=0.3.0', 'nbsphinx>=0.3.5', 'ipykernel>=4.10.0'],
    }
    extras_require['complete'] = sorted(set(sum(extras_require.values(), [])))
    extras_require['windows'] = sorted(set(sum((v for k, v in extras_require.items() if k not in {'bsub', 'iced', 'pyBigWig'}))))
    

    with the expectation that we'll need to add a 'genes' group for cruzdb and pymysql

    each extras group should get a XXX_avail module-level boolean variable located in a place that makes sense for it. we limit our liability to checking only entire groups of dependencies. the XXX_avail checks should perhaps enforce version levels (they currently just watch for ImportError instead) via the pkg_resources API (https://stackoverflow.com/a/16298328)) looking up the exact package and version spec declared in setup.py's extras_require if possible (might be tricky, more comments coming below)

  3. Thomas Gilgenast reporter

    we probably can't import setup.py's extras_require from the package code, and we probably can't move extras_require into the package and still be able to import it from setup.py

    it's possible to do

    import pkg_resources
    
    directory = pkg_resources.working_set.find(pkg_resources.Requirement.parse('lib5c'))._provider.egg_info
    

    directory is then a string reference to a path that contains either requires.txt or metadata.json (depending on whether or not it was a dev mode install)

    either file can be parsed (though their formats are different) to reconstruct extras_require

  4. Thomas Gilgenast reporter

    the main advantage of going through the work of finding and parsing this file would be to enable checking features at the group level without having to maintain two separate lists of the packages required by each group. an extra advantage is that we will catch and correctly warn on outdated optional package versions.

  5. Thomas Gilgenast reporter

    while the discussion on this issue seems to favor option 1, serious discussion is currently under way on “balkanizing“ lib5c into separate packages (option 2)

  6. Log in to comment