Is there a way of storing dependencies (in my case python packages) between builds à la travis? (https://docs.travis-ci.com/user/caching/). Or if not is there any other similar way to speed up the builds?
We also had a similar problem with Maven and NPM dependencies, which made our builds significantly longer and polluted the logs. Using a custom Docker image with prepopulated cache helped, but a large percentage of the builds is still too long as the Docker image is not always cached.
@Warren O'Neill Yes. Works for us with Maven and NPM, don't see it shouldn't work for Python packages or other dependencies. We ended up creating a Dockerfile that clones our repo, builds it once to populate dependency cache directories and removes the code.
Note that Pipelines does not always cache Docker images and download takes time, so it's useful to make sure the image is as small as possible. Particularly, we had to ensure cloning and removing the code is in the same Dockerfile RUN step.
You can also take a look at Alpine Linux docker image which is only 5 MB and see if that works for you. I used it to build my Java project and so far so good. The Oracle JDK + Ant image is roughly 160 MB.
If image is not in cache then pipeline needs to download it over the network and every byte counts :)
Of course if you have a long running build process you can neglect the time to download an image.
It also might be interesting to see if number of layers in the image affect download time and by how much if so.
I agree, making the image smaller is always a good idea, but I'd rather have it contain the necessary dependencies already rather than having to download a gazillion packages during each build that don't get cached anyhow.
I agree. All I want to say if that you can take 5 MB base image + dependencies, or you can have 150 MB base image + dependencies.
It also depends on your particular use case and what dependencies we are talking about.
In my case I'm building Java application, and I only need the smallest possible base image capable of running JDK and Ant. We don't use Maven to manage dependencies here, so I don't have to add them to the image.
Also maybe Atlassian could make some common images which could be reused by many projects and these images would always be in cache. Then I could create my images on top of them and only new layers would need to be downloaded.
RUN mkdir .ssh &&\ID=.ssh/id_rsa &&\echo"$KEY" > $ID&&\
ssh-keygen -y -f $ID > $ID.pub &&\
ssh-keyscan -t dsa -H bitbucket.org > .ssh/known_hosts &&\REPO=<repo> &&\
git clone email@example.com:<account>/$REPO.git &&\
<command to run a typical build with as many optional profiles as possible> &&\
rm -rvf $REPO .ssh <any other directories created by the build that are not needed in the image> &&\echo"Cache size:\n$(du -hs .npm)"
KEY is provided as a build argument (e.g. docker build --build-arg KEY="$(cat ~/.ssh/bitbucket)") and should have no password.
To force NPM to use the cache you will need --cache-min=Infinity. It is better to use it with the latest version of NPM to avoid bumping into #8581. Also, if you need Shrinkwrap you will bump into #3581 and might need a script to clean up npm-shrinkwrap.json like the one Angular uses. --loglevel=http is also useful to see if any dependencies are still downloaded because of bugs like this.
@Dave Van den Eynde@Igor Nikolaev
On the image size: we extend from the official Maven image which is Debian-based and is indeed large. However we do find the parent image layers to be in the cache most of the time, so I guess Atlassian does have some logic to cache popular images. It would, however, be helpful to cache more aggressively as most our builds still need to download our custom layers.
This is more or less a must-have for us. Installing all dependencies currently takes 3 minutes which means that it's a minimum of 3 minutes waiting time to even see if the linters passed. Not really feasible at the moment I'm afraid :(
Download the cache-repo and compare the cached package.json (for example, can be configured) to the actual one
If, and only if, changes are detected: Run yarn (for example, can be configured) and save the node_modules to the cache.
Commit changes to the cache
Copy the cache to the build dir
As 2. is only run in the event that something changed, build did speed up a lot.
If you have any comments, please let me know. Today we support npm, yarn and bower and everything works with built-in tools in bitbucket pipelines.
Maybe some of you can use it and save some valuable build-time.
Thanks, @iammichiel. Need to try drone.io - looks cool, < 1 min is nice as well. Must admit CircleCI build times are not very good. For Scala Play 2.5 project with tests - around 8 min, for react - with a few tests - 4 min.
Thanks for all the feedback. I just wanted to provide an update. Fast builds are important not just for CI but also to us. Dependency caching is on the roadmap and I hope to give you an update really soon when we've started work on it.
+1 for dependency caching!
For nodejs projects, since dependencies are http (npm) -> Install a bunch of proxy server (squid cache works great) and point all containers to use that proxy layer for outbound communication! You should be able to enable caching for most scenarios in a heart beat! It should support a lot of other platform too :)
My expectation for caching is possibly slightly different to other people. I'm not sure that proxing is going to make enough difference to speed up our builds since the HTTP connection time is also significant.
I would like to configure a folder that is preserved across builds. (I agree with Simon Petersen)
In my case I would like the maven repository folder preserved across builds.
A build would be able to add additional files to the preserved folder.
I'd like a sensible amount of free storage, however I would be open to paying for extra storage if I required it.
I would need to be able the clear the cache in one go via the UI.
My expectations are the same as Thomas Turrell-Croft.
I am working with scala/sbt/ivy and I would like my pipelines to keep dependencies in a cache folder like circlici or travis do across builds.
Thank you for all your good work !
This is must feature , lot of people have talked about custom docker image which pulls dependencies and caches it. As someone pointed out images are not always cached and not the most helpful way to achieve this.
Just wanted to give another update. We appreciate everyone's patience. We've been working on many features including #12790 and #12757 which are now available to Alpha customers. Dependency caching is at the top of our priorities and we're now in the planning stages for it. I'll post another update in a few weeks.
@JoshuaT Survey link doesn't work. Users need permissions.
The yml looks good to me.
How would I clear the cache? I can see myself wanting to clear the cache to remove stale dependencies, especially if there was some variable charge for caches. I would also want to clear the cache if a corrupted dependency got in it. If the cache is going to be downloaded at the start of the build then stale dependencies are going to slow the build down. I realise I could create my own script to clear the cache but a UI would be better.
Where is the cache being downloaded from? How is downloading the cache going to be faster than not having a cache? I'm just curious because I suspect that most cache's will contain lots of old dependency versions which will mean that there may come a point where it is faster not the use the cache. To use an extreme example a 1TB cache is going to take longer to download than getting a single 1MB dependency from source.
What is the scope of the cache? repository, project, team?
Yep. You will be able to invalidate caches in the UI
Where is the cache being downloaded from?
We are still working out the details but most likely S3. It should be faster just because of localities (just from S3 rather than Github, S3 etc..) and AWS' links (between EC2 and S3). We will definitely measure this once we do more technical investigation and there will be room for improvement post-mvp.
contain lots of old dependency versions which will mean that there may come a point where it is faster not the use the cache
Good point. We considered adding a max-use or a max-time lifespan for each cache to help with this problem but considered it not huge problem for mvp if the cache can be invalidated in the UI.
What is the scope of the cache? repository, project, team?
Scope of the cache will be per repository. Can you elaborate your use case if you think there are benefits for project / team scoped caches?
+1 to the general outline. Thanks for working on this!
For things like npm and yarn, I want to flag that there are valuable caches outside of the repository/build folder. For example, npm has a cache used to calculate dependencies (which also includes source tarballs), so for many builds it is more valuable to persist the internal npm cache than to cache the node_modules folder. Would be great to have an easy way to support that.
(I know we could do something like the following, but that's complex to manage and hard to repeat consistently. Note that I'm using npm config set cache to move the cache folder into the accessible build directory so that it can be cached using the proposed syntax.)
Thanks for all the responses to the survey, really helps us build this feature to support your needs.
Update on spec
After some preliminary analysis, uploading / updating the cache on every successful build reduces the time from caching saved. This means more build minutes consumed and slower feedback (unless we can upload this post-step but this is a lot more work).
For example, a node repo we tested normally took ~35s to download dependencies. With caching, this was reduced to ~6s with an additional ~23s to update the cache.
We are now proposing:
cache directories will be uploaded if it does not exist (no cache)
cache will automatically be invalidated after 14 days since it was uploaded
cache can be manually cleared from the UI
In the future, we could:
add an option to update cache on change
have a commit message trigger like [ci clear_cache] - for the commits where you update dependencies
What are your thoughts on this?
Good feedback @Benj Kamm! You will be able to cache directories outside of the build directory. When we implement a predefined node cache, we will make sure we cache ~/.npm as well.
I take your feedback about having multiple directories per cache. That's a good suggestion however we probably won't support that right away for the first version of caching.
The spec are a very good start for this quest of cache. For your future spec, personally, I prefer the first option : I don't think a message trigger would be great to do (think about the beginning of project where you can often change dependencies).
Only having a time based expiry on the cache directories seems somewhat clumsy, and will result in incrementally slower build times between any dependency change and the next time the cache is purged and rebuilt from scratch.
Would it be feasible to store a recursive hash of the nominated cache directories, and automatically re-upload them only if the recursive hash changed during the build? That should be enough to automatically add new versions of dependencies and completely new dependencies to the cache as they're added to the project (it may even be feasible to use git to manage the cache directories, rather than transferring tarballs around).
The time based cache invalidation would then still be useful for clearing out old, no longer used, dependencies from the cache, but it wouldn't be part of picking up new dependencies and new versions.
This does raise a different question though: will the caches be distinct per-branch? Or will all branches share a single cache? (a shared cache presumably makes the most sense, since it's just a cache, but it would still be good to document that explicitly)
@JoshuaT I suggest having an option to only update cache from master and/or keep branch level caches. This would give faster build performance of branches as well as avoid growing the cache with things that are not relevant for building the master.
In my current workflow, with Bamboo node builds, I always start new branches with rsyncing in the node_modules folder if it doesn't exsist (similar to this cache idea), and update the node_modules_cache on successful master only. This causes a huge time saving for the first build of each branch.
Oh, and +1 for caching things outside the workfolder. I'm testing with sbt and it's painfully slow compared to Bamboo, as the ivy2 cache is always repopulated.
To note, Gemfile.lock always contains exact versions, even for ranges and repositories:
the Gemfile.lock makes your application a single package of both your own code and the third-party code it ran the last time you know for sure that everything worked. Specifying exact versions of the third-party code you depend on in your Gemfile would not provide the same guarantee, because gems usually declare a range of versions for their dependencies.
@Jecelyn Yeen awesome! So we were thinking of providing yarn as a pre-defined cache. However, we thought if node_modules is already cached, then caching the yarn cache directory (~/.cache/yarn) might be redundant? Might even be a tad bit slower as 2 caches must be restored.
@JoshuaT : We currently have a git mono repo for one of our project. That said, our git repo structure doesn't have just one node_modules folder at the root but also several node_modules in each subfolder under a Packages folder. Does the default "node" config searches for all "node_modules" folder or does it only pick up the one at the root?
If only root, would you mind optimizing this for people like us with many npm packages in the same git repo?
Excellent. You said caches are "cleared automatically 1 week from creation". What's the reason for this? Would it be possible to clear it automatically after some idle period, several days, perhaps? With this we all have to download everything once a week...
@David Poetzsch-Heffter wasn't sure if bower was popular anymore. We'll gauge usage a bit more but a custom cache will always work. Any reason why you aren't using npm packages?
@Richard Simko it's something we thought about but would like users to try the Alpha first before adding it to the roadmap. Also gets a little complicated with many different languages and build tools. Curious to know how often your dependencies change?
@Francois Germain it only caches the root one. We've had a request to allow a custom cache to accept a pattern. How many npm packages do you have in your repo? You can still cache the ~/.npm directory or define multiple custom caches.
@Daryl Stultz the reason for clearing the cache 1 week from creation is to strike a balance between keeping the cache up to date and minimizing the cost (build time) of updating the cache. e.g. a team pushing 20 times a week, then only the first pays the cost of saving the cache. Each week when the cache is cleared then new dependencies are picked for the new cache. If dependencies are changed mid-week, then builds for the rest of the week will incur a smaller marginal cost of picking up the changed dependencies on each build until the cache is cleared. We're considering making this expiry period configurable or increasing it to 14 days. Depends on the feedback we receive on how often teams update their dependencies.
@Bruno RZN that won't work because the Docker daemon isn't running in the build image. We've been investigating caching Docker layers but discovered it's more complicated than caching the directory.
@JoshuaT We use bower for our front-end, originally because npm did not allow for flat hierarchies. Since then, technology has advanced (bower has been deprecated and I hear npm also has some kind of support for deduplication) but for now no had the time to replace the dependency manager.
@Francois Germain have you considered caching the ~/.npm directory (if using npm) or ~/.cache/yarn (if using yarn)? They're the global cache which means a module used by multiple packages will be only cached once but installed to each package's node_modules. Also means you will only need to define 1 cache directory.
Is there an issue with the pipeline service atm? Since being admitted to the Alpha, I've tried to run my build with caches, but starting the agent and pulling the image takes ages (I stopped the first try at 42 minutes in).
@JoshuaT You sure about the yarn cache being located at ~/.cache/yarn ? I configured my build to cache that path and I got no cache for that after a build. Looks like it didn't found the folder so he cache nothing.
@JoshuaT Re: clearing cache on package.json change.
I see now that the cache is cleared 1 week after creation and not 1 week after the latest build which I originally thought so what I brought up is a non-issue really since dependencies will be updated once per week anyway?
@Marc-Andre Roy Apologies. As James linked, the yarn cache dir will be in different locations depending on distro. You can run yarn cache dir in your build to find out where it will be.
@Richard Simko that is correct! Maybe I could have described it better. We thought about the design and wanted to balance updating dependencies with the cost of updating it since updating the cache too often partially negates the benefits of caching. Thanks for the feedback!
@JoshuaT I am doing a gradle build and want to cache gradle dependencies by using predefined cache gradle. But cache is showing 93 bytes only. Even after explicitly configuring dependency folder, depedencies not getting cached. Could you please let me know why?
@Anthony Lazam while its not advertised apparently its already there. We are not part of the alpha (afaik) and i just used the caches successfully for my project.
Restoring caches seems to take quite a while (2 minutes in my case, but that time varies a lot) but neither maven nor node did download dependencies.
Our team tried the feature today, and it looks handy. The only problem we noticed are the transfer speeds: about 14 MBps download and 1.5 MBps upload. It's adding quite a bit of time once the cache grows larger, and is surprisingly slow, in the range of average DSL speed. Are the S3 buckets and container VMs in the same DC?
One more comment: our build is using Gradle, which recently released a new Build Cache feature. It works well for building large projects on the CI servers by reusing previous build results, only rerunning the tasks when corresponding input changes. It would be great to use it with Pipelines, but would likely require both higher upload speed and rsync-style differential upload to be able to update the cache after every build.
Hi, we will be making this feature generally available very soon.
@Chain Singh we are pushing out a fix for uploading empty caches due to permission issues. This should be available in the next few hours (so maybe clear your cache and try again tomorrow). Please let me know if it still uploads empty caches.
@Sergey Parhomenko The S3 buckets and containers aren't currently in the same AZ but we hope to move them to the same AZ in the future to benefit even more from caching. Thanks for the pointer to the Gradle feature. I'll have a look at it to see how we can make it work with Pipelines
"For dependencies, you specify where the restored packages are placed during the restore operation using the --packages argument. If not specified, the default NuGet package cache is used, which is found in the .nuget/packages directory in the user's home directory on all operating systems (for example, /home/user1 on Linux or C:\Users\user1 on Windows)."
We are very happy with caching in general! But we have an issue with a custom build for maven release. That custom build uses also a maven cache, but it seems to me that the cache is not used. The output of 'mvn release:prepare' and 'mvn release:perform' shows a lot of output telling me that it is downloading. For example:
I have trouble getting the node cache working. Composer works perfectly, but I don't seem te get node cachec.
The message in the build process reads:
Cache "node": Not found
Cache "composer": Downloading
Cache "composer": Extracting
Cache "composer": Loaded
My pipelines file has the folowing lines:
image:pms72/groundwater-builderpipelines:branches:#Pipelinesthatrunautomaticallyonacommittoabranchstaging:-step:caches:-node-composerscript:-echo"Staging - Words build bridges into unexplored regions."-composerinstall-cdweb/app/themes/$THEME_DIR/-npminstall-bowerinstall--allow-root
@JoshuaT I've tried that, but it still returns Cache "node": Not found. It seems that even before anything happens, node is not present. Like I have to define node before the cashes in the first step... I've tried to add a custom cache (with both 'node' and 'customnode' names, but both cause the same Cache "node": Not found and Cache "customnode": Not found messages...
Also I've tried relative and absolute paths for the custom cache folder, but without any different results.
The .yml file looks like this now:
pipelines:branches:staging:-step:caches:-node-customnode-composerscript:-echo"Staging - Words build bridges into unexplored regions."-composerinstall-cdweb/app/themes/$THEME_DIR/-npminstall-bowerinstall--allow-rootdefinitions:caches:customnode:~/web/app/themes/pms72-wdcd-challenge/node_modules
We've been using this for a while now and overall it works great, it's sped up our builds a lot! Just one minor detail, we stopped using the node cache and defined our own custom npm cache instead (~/.npm). With npm 5 this works great since there is excellent cache validation and as such it makes more sense to cache the cache so to speak, that way we can be sure that the installed dependencies are always the latest ones permitted by the version string and the cache is used when available.
@WebSter GZ - we haven't had any reports of caching not working, so this will need some further investigation. (Note that the cache won't be populated until you have a successful build. This is one common misunderstanding.)