dependencies could be streamlined
currently lib5c lists relatively few optional dependencies and a long list of required dependencies
many of the required dependencies are used very rarely
lib5c is increasingly depended on by other libraries/code, most of which use just a few utility or plotting functions that don’t touch the majority of lib5c’s dependencies
in light of this, it seems to make sense to streamline lib5c’s dependencies
there are two major options:
- make more of lib5c's dependencies optional, moving them from
install_requires
toextras_require
. we can always install the full package withpip install lib5c[full]
. this would require adding some code in the modules that import rarely-used dependencies to catch ImportError and instruct the user to install the package (via the same mechanism as the currentbsub_avail
). - break lib5c into subpackages, leaving the dependency-light code in
lib5c-core
(1) feels simpler so it might be preferred
Comments (7)
-
reporter -
reporter one possible issue could be that it may be difficult to decide what dependencies should be considered "core"
one nice thing is that by always making dependencies optional whenever possible, we can keep the experience of users who directly install lib5c relatively similar while greatly streamlining things for client libraries. client libraries will be able to list explicit dependencies on the optional lib5c dependencies they require. for example, a client library that needs to plot gene tracks can list cruzdb and pymysql in its own
install_requires
, guaranteeing that lib5c's gene plotting functionality will be available. meanwhile, users who directly install lib5c can either remember topip install lib5c[full]
orpip install lib5c[windows]
to recapitulate their current experience. even if they forget theextras_require
groups that they want, these users will be prompted to install missing dependencies as they access features of lib5c that depend on those dependencies. -
reporter as a first guess for what we could cut, here's a list of the current lib5c dependencies
install_requires=[ 'python-daemon>=2.1.1,<2.2.0', # pipeline 'numpy>=1.10.4', # core 'scipy>=0.16.1', # core 'matplotlib>=1.4.3', # plotting 'pandas>=0.18.0', # dataset, plotting (distance dependence, PCA, bias heatmaps) 'seaborn>=0.8.0', # plotting 'statsmodels>=0.6.1', # qvalues, lowess (expected models) 'dill>=0.2.5', # parallelization 'decorator>=4.0.10', # plotting and parallelization 'luigi>=2.1.1', # pipeline 'scikit-learn>=0.17.1', # PCA, thresholding (confusion matrix) 'interlap>=0.2.3', # convergency 'powerlaw>=1.4.3', # powerlaw expected models ],
the use of sklearn during thresholding is very weak - we could easily lift
sklearn.metrics.confusion_matrix()
it is somewhat unfortunate that we use pandas for certain plotting functions but probably not worth trying to avoid
dataset, powerlaw, and parallelization are basically no longer recommended (i.e., client libraries won't want to touch them) so it doesn't make sense to promote them
convergency, PCA, and pipeline are niche enough that it doesn't make sense to promote them either
statsmodels is perhaps the most debatable dependency since qvalues and lowess are both quite common
despite being an incredibly useful package, interlap seems to only be used directly by the convergency code, so it doesn't make sense to promote
we probably could quite easily get to
install_requires=[ 'numpy>=1.10.4', 'scipy>=0.16.1', 'statsmodels>=0.6.1', ],
and
extras_require = { 'bsub': ['bsub>=0.3.5'], 'iced': ['iced>=0.4.0'], 'pyBigWig': ['pyBigWig>=0.3.4'], 'plotting': ['matplotlib>=1.4.3', 'seaborn>=0.8.0', 'decorator>=4.0.10', 'pandas>=0.18.0'], 'pipeline': ['luigi>=2.1.1', 'python-daemon>=2.1.1,<2.2.0'], 'parallel': ['decorator>=4.0.10', 'dill>=0.2.5'], 'powerlaw': ['powerlaw>=1.4.3'], 'interlap': ['interlap>=0.2.3'], 'dataset': ['pandas>=0.18.0'], 'pca': ['scikit-learn>=0.17.1'], 'test': ['nose>=1.3.7', 'nose-exclude>=0.5.0', 'flake8>=3.4.1'], 'docs': ['Sphinx>=1.7.2', 'sphinx-rtd-theme>=0.3.0', 'nbsphinx>=0.3.5', 'ipykernel>=4.10.0'], } extras_require['complete'] = sorted(set(sum(extras_require.values(), []))) extras_require['windows'] = sorted(set(sum((v for k, v in extras_require.items() if k not in {'bsub', 'iced', 'pyBigWig'}))))
with the expectation that we'll need to add a 'genes' group for cruzdb and pymysql
each extras group should get a
XXX_avail
module-level boolean variable located in a place that makes sense for it. we limit our liability to checking only entire groups of dependencies. theXXX_avail
checks should perhaps enforce version levels (they currently just watch for ImportError instead) via thepkg_resources
API (https://stackoverflow.com/a/16298328)) looking up the exact package and version spec declared insetup.py
'sextras_require
if possible (might be tricky, more comments coming below) -
reporter we probably can't import
setup.py
'sextras_require
from the package code, and we probably can't moveextras_require
into the package and still be able to import it fromsetup.py
it's possible to do
import pkg_resources directory = pkg_resources.working_set.find(pkg_resources.Requirement.parse('lib5c'))._provider.egg_info
directory
is then a string reference to a path that contains eitherrequires.txt
ormetadata.json
(depending on whether or not it was a dev mode install)either file can be parsed (though their formats are different) to reconstruct
extras_require
-
reporter the main advantage of going through the work of finding and parsing this file would be to enable checking features at the group level without having to maintain two separate lists of the packages required by each group. an extra advantage is that we will catch and correctly warn on outdated optional package versions.
-
reporter - changed milestone to 0.6.0
-
reporter while the discussion on this issue seems to favor option 1, serious discussion is currently under way on “balkanizing“ lib5c into separate packages (option 2)
- Log in to comment
combined with #46 this could dramatically reduce the install overhead of the package