dependencies could be streamlined

Issue #52 new

Thomas Gilgenast created an issue 2019-05-30

currently lib5c lists relatively few optional dependencies and a long list of required dependencies

many of the required dependencies are used very rarely

lib5c is increasingly depended on by other libraries/code, most of which use just a few utility or plotting functions that don’t touch the majority of lib5c’s dependencies

in light of this, it seems to make sense to streamline lib5c’s dependencies

there are two major options:

make more of lib5c's dependencies optional, moving them from install_requires to extras_require. we can always install the full package with pip install lib5c[full]. this would require adding some code in the modules that import rarely-used dependencies to catch ImportError and instruct the user to install the package (via the same mechanism as the current bsub_avail).
break lib5c into subpackages, leaving the dependency-light code in lib5c-core

(1) feels simpler so it might be preferred

Comments (7)

Thomas Gilgenast reporter
combined with #46 this could dramatically reduce the install overhead of the package
- 2019-05-30T20:02:42+00:00
Thomas Gilgenast reporter
one possible issue could be that it may be difficult to decide what dependencies should be considered "core"

one nice thing is that by always making dependencies optional whenever possible, we can keep the experience of users who directly install lib5c relatively similar while greatly streamlining things for client libraries. client libraries will be able to list explicit dependencies on the optional lib5c dependencies they require. for example, a client library that needs to plot gene tracks can list cruzdb and pymysql in its own install_requires, guaranteeing that lib5c's gene plotting functionality will be available. meanwhile, users who directly install lib5c can either remember to pip install lib5c[full] or pip install lib5c[windows] to recapitulate their current experience. even if they forget the extras_require groups that they want, these users will be prompted to install missing dependencies as they access features of lib5c that depend on those dependencies.
- 2019-05-30T20:15:17+00:00

Thomas Gilgenast reporter

as a first guess for what we could cut, here's a list of the current lib5c dependencies

install_requires=[
    'python-daemon>=2.1.1,<2.2.0',  # pipeline
    'numpy>=1.10.4',                # core
    'scipy>=0.16.1',                # core
    'matplotlib>=1.4.3',            # plotting
    'pandas>=0.18.0',               # dataset, plotting (distance dependence, PCA, bias heatmaps)
    'seaborn>=0.8.0',               # plotting
    'statsmodels>=0.6.1',           # qvalues, lowess (expected models)
    'dill>=0.2.5',                  # parallelization
    'decorator>=4.0.10',            # plotting and parallelization
    'luigi>=2.1.1',                 # pipeline
    'scikit-learn>=0.17.1',         # PCA, thresholding (confusion matrix)
    'interlap>=0.2.3',              # convergency
    'powerlaw>=1.4.3',              # powerlaw expected models
],

the use of sklearn during thresholding is very weak - we could easily lift sklearn.metrics.confusion_matrix()

it is somewhat unfortunate that we use pandas for certain plotting functions but probably not worth trying to avoid

dataset, powerlaw, and parallelization are basically no longer recommended (i.e., client libraries won't want to touch them) so it doesn't make sense to promote them

convergency, PCA, and pipeline are niche enough that it doesn't make sense to promote them either

statsmodels is perhaps the most debatable dependency since qvalues and lowess are both quite common

despite being an incredibly useful package, interlap seems to only be used directly by the convergency code, so it doesn't make sense to promote

we probably could quite easily get to

install_requires=[
    'numpy>=1.10.4',
    'scipy>=0.16.1',
    'statsmodels>=0.6.1',
],

and

extras_require = {
    'bsub': ['bsub>=0.3.5'],
    'iced': ['iced>=0.4.0'],
    'pyBigWig': ['pyBigWig>=0.3.4'],
    'plotting': ['matplotlib>=1.4.3', 'seaborn>=0.8.0', 'decorator>=4.0.10', 'pandas>=0.18.0'],
    'pipeline': ['luigi>=2.1.1', 'python-daemon>=2.1.1,<2.2.0'],
    'parallel': ['decorator>=4.0.10', 'dill>=0.2.5'],
    'powerlaw': ['powerlaw>=1.4.3'],
    'interlap': ['interlap>=0.2.3'],
    'dataset': ['pandas>=0.18.0'],
    'pca': ['scikit-learn>=0.17.1'],
    'test': ['nose>=1.3.7', 'nose-exclude>=0.5.0', 'flake8>=3.4.1'],
    'docs': ['Sphinx>=1.7.2', 'sphinx-rtd-theme>=0.3.0', 'nbsphinx>=0.3.5', 'ipykernel>=4.10.0'],
}
extras_require['complete'] = sorted(set(sum(extras_require.values(), [])))
extras_require['windows'] = sorted(set(sum((v for k, v in extras_require.items() if k not in {'bsub', 'iced', 'pyBigWig'}))))

with the expectation that we'll need to add a 'genes' group for cruzdb and pymysql

each extras group should get a XXX_avail module-level boolean variable located in a place that makes sense for it. we limit our liability to checking only entire groups of dependencies. the XXX_avail checks should perhaps enforce version levels (they currently just watch for ImportError instead) via the pkg_resources API (https://stackoverflow.com/a/16298328)) looking up the exact package and version spec declared in setup.py's extras_require if possible (might be tricky, more comments coming below)

2019-05-30T21:37:27+00:00

Thomas Gilgenast reporter
we probably can't import setup.py's extras_require from the package code, and we probably can't move extras_require into the package and still be able to import it from setup.py

it's possible to do
```
import pkg_resources

directory = pkg_resources.working_set.find(pkg_resources.Requirement.parse('lib5c'))._provider.egg_info
```
directory is then a string reference to a path that contains either requires.txt or metadata.json (depending on whether or not it was a dev mode install)

either file can be parsed (though their formats are different) to reconstruct extras_require
- 2019-05-30T21:55:19+00:00
Thomas Gilgenast reporter
the main advantage of going through the work of finding and parsing this file would be to enable checking features at the group level without having to maintain two separate lists of the packages required by each group. an extra advantage is that we will catch and correctly warn on outdated optional package versions.
- 2019-05-30T22:01:26+00:00
Thomas Gilgenast reporter
- changed milestone to 0.6.0
- 2019-06-21T19:08:18+00:00
Thomas Gilgenast reporter
while the discussion on this issue seems to favor option 1, serious discussion is currently under way on “balkanizing“ lib5c into separate packages (option 2)
- 2020-02-07T19:46:46+00:00
Log in to comment

Assignee: –

Type: proposal

Priority: major

Status: new

Milestone: 0.6.0

Votes: 0

Watchers: 1