Provide a fairly constant CPU and network resource

Issue #13079 open
Barnabás Gema
created an issue

I'm trying to run my integration tests using pipelines. When running on my local machine they need a fairly constant 2-3 seconds/test case, but when running them on pipelines the deviance is much higher, they need 1-10 seconds to complete.

It wouldn't be disturbing if on every run the same test cases would be slow, but it doesn't really depend on the exact test case how much time it takes for it to pass, So sometimes a test case needs 1,5 seconds, sometimes it needs 9 and it makes finding a proper timeout for the test cases really challenging. It would be nice if during a pipelines run the container got a fairly constant CPU resource, or if it would have been configurable whether I need a constant or a best effort CPU.

Official response

  • Raul Gomis staff

    Hi everyone,

    I'm pleased to announce that we have recently shipped (1st August) a platform change that has hugely improved the performance for the majority of the builds:

    • We have swapped our Kubernetes nodes from using EC2 M4 instance types to M5d's. M5d instances use NVMe drives (instead of EBS volumes), which are much faster, as well as located on the underlying compute hardware, not having then the overhead of transferring data on the drives over a storage network.
    • With that, as well as some other performance improvements / fixes, we have fixed most of the intermittent performance issues of the last couple of months. However, if you are still experiencing any issue, I suggest you submit a support case so that we can investigate it.

    Next steps are:

    • In regards to the build variance, we might still have some build variance due to the fact that we run in a shared infrastructure. Now, after the M5d changes, our main bottleneck is CPU (instead of IO), so we will run again some experiments to limit CPU in order to test how much it would affect to build variance and build time. This would help us determine the best balance between predictability and speed in a shared infrastructure.

    I'll keep you updated with any other enhancements that might improve build performance and variance.

    Regards, Raul

Comments (40)

  1. Gabriel Kaputa

    same here, it is very slow.. i am running phpunit tests and every time, even if i use different tests, after the first two, nothing happens for a couple of minutes and then the testing continues (still slow)

  2. Joshua Tjhin staff
    • changed status to open

    Thanks for the feedback. As you probably know, we run builds on the shared infrastructure and therefore CPU is best effort. In the future, we would like to make CPU more constant for more steady and reliable builds. This however, is unlikely to be implemented in the next 6 months.

  3. Nick Houghton

    This is a big deal for a CI tool. We are trying to move more of our CI workload into pipelines but are seeing strange processing behaviour where pipelines appear to hang for long periods (minutes), and tests fail non-deterministically because of it (tests fail after timeouts).

    Logging shows clock jumps measured in seconds to minutes, like this:

    2017-06-19 09:49:45:164 WARN  kari.pool.HikariPool - readwrite - Thread starvation or clock leap detected (housekeeper delta=45s970ms942µs457ns).
    2017-06-19 09:50:16:831 WARN  kari.pool.HikariPool - readwrite - Thread starvation or clock leap detected (housekeeper delta=45s163ms543µs617ns).
    2017-06-19 09:51:21:838 WARN  kari.pool.HikariPool - readwrite - Thread starvation or clock leap detected (housekeeper delta=1m5s7ms608µs650ns).
    

    Which generally means the process didn't get CPU time for the delta period. Which lines up with what we are seeing around very slow processing of steps in the pipeline. Currently our test step in one of our pipelines takes 11mins to fail, running the same test in the "atlassian/default-image:latest" container locally takes around 2 minutes with no failures.

    Shared CPU infrastructure where scheduling is non-deterministic just means that sometimes builds are going to fail and require 1 or more reruns. Especially for a product where you are billing in minutes, it feels very disingenuous to make builds take longer (cost more minutes) and fail more often and require reruns (cost more minutes).

    Is there any more info about how the CPU is scheduled? and how to stay underneath the limits? Is it per container? or at a lower infrastructure level?

  4. Nick Houghton

    I gave up on bitbucket pipelines and moved to Google Container Builder. 120 mins free per day, and no lame shared compute resources. My build and tests ran first time.

  5. James Dengel

    I'd also like to note that pulling the image can make a huge difference to build time for a very simple project, also there is no monitoring of the time taken to pull the image.

    For instance logs of the build are as follows:

    build setup 5 seconds tox -e coverage 10 seconds

    but the billing period was 57 seconds, not sure where my missing 42 seconds are.

    another build

    build setup 4 seconds tox -e coverage 15 seconds total billing period 22 seconds. - missing 7 seconds

    Can we be shown exactly where this time is being used. It somewhats puts me off paying with knowing where 1/3 of our small build time is going.

    The above times are for the same repo with only minor changes between commits.

  6. Marcus Schumann

    We have the same issues like James Dengel. The build-time is not correct and it's also highly different between each run, even when running the same exact pipeline for the same exact commit (re-run).

  7. James Dengel

    bitbucket.PNG

    This is a prime example of the issue, time taken to pull the image and the cache.

    build: 2 mins 25 seconds - build setup : 1 minute 12s - tox: 9s - push: 17s

    total 1:38 Missing time : 47 seconds.

  8. Joshua Tjhin staff

    Hi James,

    The build setup includes the cloning of the repo and downloading the cache. However, it does not include the time to pull images and start containers. This additional time also varies and might be faster if some image layers have been already downloaded. I've created a new issue #14484 for this improvement to provide a better breakdown of the pipeline duration.

    We're always trying to improve build speeds and a few days ago we rolled out an experiment to cache the public docker images.

    Regards,
    Joshua

  9. Joshua Tjhin staff

    @Marcus Schumann yes, by caching build and service images, you might notice logs start streaming a little earlier. It has been rolled out to all customers and you should benefit automatically. We will monitor the impact and make additional improvements.

  10. Hudson Mendes

    Very, VERY! slow. Such a slow build infrastructure only contributes to miserable hours of wasted time. Already looking into alternatives - Bitbucket pipelines has burnt a lot of very scarse time I had on a startup I'm working on. Nothing fancy necessary: just get more processing power and memory on this machines already...

  11. Nick Boultbee

    I'm seeing custom Docker builds take 17 - 27 mins that I can run in under a minute locally (on a reasonable I7 with SSD).

    So to me, two problems: (a) very slow (big problem), and (b) quite variable (not so much a problem).

  12. Nick Boultbee

    CircleCI (and Travis) are indeed much faster and with similar enough configuration.

    I don't feel Bitbucket is incentivised enough here to help, as they're now charging us extra for being so slow (due to exceeding the included minutes in our plan). Hmmm.

  13. Paul Carter-Brown

    Hi,

    I'm getting really frustrated with the performance of pipelines. Especially considering we are paying based on time. This needs to be dealt with urgently or suspend billing clients until the issue is fixed. Here is a docker build step in my pipeline. I've added a "RUN date" step as every second step so you can see the progress. Even running the date command takes something like 30s:

    Status: Downloaded newer image for jinigurumauritius/ubuntu_jdk_tomee:latest ---> 05fdc078884d Step 2/19 : RUN date ---> Running in abc9e8f1d14a Thu Dec 14 13:01:00 UTC 2017 ---> 942cb7bae99e Removing intermediate container abc9e8f1d14a Step 3/19 : RUN mkdir -p /opt/tomee/apps/ && rm /opt/tomee/lib/johnzon- ---> Running in deaf4d47bdc0 ---> f1243138756b Removing intermediate container deaf4d47bdc0 Step 4/19 : RUN date ---> Running in d0747c048f25 Thu Dec 14 13:01:52 UTC 2017 ---> 5c7b8b24192f Removing intermediate container d0747c048f25 Step 5/19 : COPY deployable/target/.ear /opt/tomee/apps/ ---> ec07c86b4e10 Removing intermediate container e916471e79b0 Step 6/19 : RUN date ---> Running in c146d8e77e80 Thu Dec 14 13:02:52 UTC 2017 ---> 17fcdd2da425 Removing intermediate container c146d8e77e80 Step 7/19 : COPY deployable/target/jars/jg-arch-log-formatter.jar /usr/lib/jvm/java-8-oracle/jre/lib/ext/ ---> 04139c43b618 Removing intermediate container 78308a5354b2 Step 8/19 : RUN date ---> Running in 8188cdadb9af Thu Dec 14 13:04:33 UTC 2017 ---> ab33ca967b57 Removing intermediate container 8188cdadb9af Step 9/19 : COPY deployable/target/jars/mysql-connector-java.jar docker/all/hacked-libs/* /opt/tomee/lib/ ---> 10c8b8e8bfec Removing intermediate container b6657221ac64 Step 10/19 : RUN date ---> Running in d765595cd0ce Thu Dec 14 13:05:33 UTC 2017 ---> 3cc1d0c90ed1 Removing intermediate container d765595cd0ce Step 11/19 : COPY docker/all/tomee.xml docker/all/logging.properties docker/all/server.xml /opt/tomee/conf/ ---> 45cc6af66c75 Removing intermediate container 68ffc104468f Step 12/19 : RUN date ---> Running in 3b01f8a87b1f Thu Dec 14 13:06:24 UTC 2017 ---> 1edaced52591 Removing intermediate container 3b01f8a87b1f Step 13/19 : COPY docker/all/setenv.sh /opt/tomee/bin/ ---> 56861adf785b Removing intermediate container d01af87cd73a Step 14/19 : RUN date ---> Running in 163715b5bc4f Thu Dec 14 13:07:10 UTC 2017 ---> 998a1e3d84a3 Removing intermediate container 163715b5bc4f Step 15/19 : COPY docker/all/run.sh / ---> e96fe0845d13 Removing intermediate container a1db54d9b34d Step 16/19 : RUN date ---> Running in 76b720d624cb Thu Dec 14 13:08:05 UTC 2017 ---> 8a2e88fc79e4 Removing intermediate container 76b720d624cb Step 17/19 : RUN chmod +x /run.sh ---> Running in 55956f7e90ce ---> 5e77516ce2f8 Removing intermediate container 55956f7e90ce Step 18/19 : RUN date ---> Running in 488fc1c84247 Thu Dec 14 13:08:58 UTC 2017 ---> e0aa09b5f033 Removing intermediate container 488fc1c84247 Step 19/19 : CMD /run.sh ---> Running in f40389c0e7b2 ---> d271f2bc3f3b Removing intermediate container f40389c0e7b2 Successfully built d271f2bc3f3b Successfully tagged 831776913662.dkr.ecr.eu-west-1.amazonaws.com/ngage:latest

    That's 9 minutes to run a few copy commands.

    Running the same docker build on my laptop lakes about 10s and that's not due to layer caching:

    Step 6/19 : RUN date ---> Running in 305bbd635591 Thu Dec 14 13:20:00 UTC 2017 ---> 9c7bf288a205 Removing intermediate container 305bbd635591 Step 7/19 : COPY deployable/target/jars/jg-arch-log-formatter.jar /usr/lib/jvm/java-8-oracle/jre/lib/ext/ ---> 1d24c7d0b18a Removing intermediate container 6648d4bd22bc Step 8/19 : RUN date ---> Running in 6ad534b21909 Thu Dec 14 13:20:01 UTC 2017 ---> 75ca7eb2ddb3 Removing intermediate container 6ad534b21909 Step 9/19 : COPY deployable/target/jars/mysql-connector-java.jar docker/all/hacked-libs/* /opt/tomee/lib/ ---> 780c4b7652d5 Removing intermediate container 8d303e7996bc Step 10/19 : RUN date ---> Running in 286356627c69 Thu Dec 14 13:20:02 UTC 2017 ---> 975c55ca6fa9 Removing intermediate container 286356627c69 Step 11/19 : COPY docker/all/tomee.xml docker/all/logging.properties docker/all/server.xml /opt/tomee/conf/ ---> 880ca7f0ecef Removing intermediate container d44acfe4b9a6 Step 12/19 : RUN date ---> Running in b8910ed6bf3b Thu Dec 14 13:20:04 UTC 2017 ---> ba51e68840f4 Removing intermediate container b8910ed6bf3b Step 13/19 : COPY docker/all/setenv.sh /opt/tomee/bin/ ---> b9239ae873da Removing intermediate container 3644b2498bc6 Step 14/19 : RUN date ---> Running in 0c56d060b2f8 Thu Dec 14 13:20:05 UTC 2017 ---> 5720533332f9 Removing intermediate container 0c56d060b2f8 Step 15/19 : COPY docker/all/run.sh / ---> f3b8bd2d1032 Removing intermediate container d858557ad957 Step 16/19 : RUN date ---> Running in 90dbd0ea94d1 Thu Dec 14 13:20:06 UTC 2017 ---> fc2c8bcb0643 Removing intermediate container 90dbd0ea94d1 Step 17/19 : RUN chmod +x /run.sh ---> Running in 6f2534932e4b ---> 28450b4cf99f Removing intermediate container 6f2534932e4b Step 18/19 : RUN date ---> Running in e2dee7143445 Thu Dec 14 13:20:07 UTC 2017 ---> ff1b4844f2ba Removing intermediate container e2dee7143445 Step 19/19 : CMD /run.sh ---> Running in d0922815e090 ---> a53821bb5d32 Removing intermediate container d0922815e090 Successfully built a53821bb5d32

    So basically pipeline is 50 time slower than a laptop!

  14. Barnabás Gema reporter

    Actually I posted the ticket 1 year and 5 months ago, so it is even more disturbing. With this quality of service there is really no point in using the product. After a few trials and errors we have moved over to our self-hosted Jenkins, that has an almost identical and well functioning Pipeline feature.

  15. Aneita Yang staff

    Hi everyone,

    Thanks for your interest in this issue and for your patience. When we look at our open tickets, the number of votes that a ticket has plays much more of a role than the priority it is assigned. Over the past couple of months, the team has been working on issues with a much higher number of votes.

    In the past few weeks, we've looked at a range of solutions for the variance in build time. Unfortunately, the solution isn't as simple as just restricting the CPU available to each build. While this makes build times more consistent as every pipeline will have the same amount of resources allocated to it, regardless of when it's being run, it also means that builds could be slower overall. Your build that has previously had no limit in the CPUs it could use, is now limited in this resource. We will continue to investigate this issue and experiment with different solutions in the new year. I'll keep you updated on our progress via this issue.

    Thanks again for your patience.

    Aneita

  16. Andrew Kao

    Pipeline is too slow recently and drives our team crazy.

    The matter thing is, we are willing to pay more to get more powerful environment.

    We even consider move to GitHub to CI/CD with Travis or Circle if the situation still the same.

  17. Matt Ryall staff

    @Andrew Kao - sorry to hear that. Our goal is to run your builds as quickly as possible. We'd like to get a bit more info about what exactly is slow in your builds, so could you please open a support case, so we can take a look at your specific case?

    Here are a couple of general suggestions that might help:

    • We recently added support for large builds which get double memory and CPU allocation of normal builds, for double the cost. This will almost certainly speed up your build, if you're not using it yet. (We're also experimenting with ways of enforcing this CPU allocation better in the next week or two -- the project @Aneita Yang mentioned above -- which should further help out large builds.)
    • We also recently added Docker layer caching, which should speed up builds and subsequent use of Docker images across multiple steps/builds.
    • When using build images hosted on a private registry, or other assets that need to be downloaded for the build, it may be faster to move them to a location closer to where the Pipelines cluster runs in the AWS us-east region.
    • We recently added parallel steps, so you can split long testing tasks into separate steps that run in parallel.
    • We've considered the idea of even larger builds (3x, 4x), with additional memory or CPU allocation depending on what you need, but don't currently have a feature request open for it. Feel free to open one if this sounds interesting and let us know what you're after.

    We haven't seen a general slowdown in our performance metrics here, but perhaps we're missing something specific to your situation. We can investigate this best through a support ticket.

  18. Uri Harduf

    Hey @Matt Ryall .

    I got here through a support ticket. For the past few weeks we've been seeing that our builds take longer. This means that our builds can take 10 minutes and can take 20 minutes. When they take longer, all of the steps including the cloning of the repo and the docker image pull take longer as well. So this is crucial for us since we're also paying per build minute that's changing drastically, and also we would rather get a faster build.

    Also I personally think that 2x resources means a lot when you talk about memory, but I don't understand from the documentation whether 2x means you'll get 2 cores, or get a better priority when competing with other docker instances thus getting more CPU time for 1 core.

    I would simply expect 1x resources to give me a CPU with a constant speed which is an average moden-server core (so if I run the tests on my macbook pro, they'll run as fast on the build, else 1x would never be good for me). And that 2x resources would give me more cores, as its like at other cloud services.

    So parallel steps or a larger build would not help if the CPU given to the docker instance is not constant and at least as fast as an average server with 1 core.

    Thanks

  19. Brad Humphrey

    We have huge variances in build time as well. We have builds that take between 25 and 60 minutes (total time using parallel build steps). It seems like most of the trouble is caused by variations in network capacity. Builds sometimes really slow down when running npm install, downloading assets from amazon's S3, or pushing built docker images. I've seen download speeds that vary between 1MB/s and 50 MB/s.

  20. Kevin Cabrera

    Experiencing constant slow down for the past week as well. Before, our usual build varies from 2-4 minutes per deployments but now it's taking constant 30 mins or more. :-( We have multiple project that uses bitbucket pipeline. And every each one of them are all affected by the slow down!! Tried increasing the size to 2x but still no use.

    Performance 2 months ago vs now Screen Shot 2018-06-28 at 2.51.32 PM.png

  21. Raul Gomis staff

    Hey,

    Sorry to hear that your builds are taking longer recently. We are currently investigating a performance degradation since June in our platform that might be the cause of some slow pipelines. We are still not sure if it would be due to build variance or recent CoreOs and Docker upgrades in the platform, but we are working on it to isolate the issue and fix it. I suggest you submit a support case so that we can investigate your specific cases.

    Regarding build variance, one of our priorities in Pipelines team is to run builds as fast as possible, as we believe it is the best practice for CI/CD. So, we are working in testing different approaches to have more predictable build times, but it's not as easy as limiting the CPU, as we don't want to affect build speed (our main priority to keep low).

    On 6th/7th June we run an experiment to limit CPU in our shared infrastructure in order to test how much would affect to build variance / predictability and build time. The experiment showed that not only build time was affected (getting slower builds) but also build variance got worse. This is because our main bottleneck is either networking or disk (as they are shared resources we do best effort as well as CPU). We will be working soon to experiment on testing different configurations and limits for disk and/or networking.

  22. Shawn McKnight

    I can echo that we've had pipeline timings with enormous variances. I had a pipeline time out after 30 minutes and then it was re-run immediately after and completed in 7. Another pipeline for a very similar project/build then finished in 73 seconds.

  23. Tanju Erinmez

    @Raul Gomis , I was redirected to this thread from a support case where we "packaged up" infos on a hefty variance case for BB to analyse.

    Do you guys have any updates on your further experiments? If it helps, feel free to have a look at the material in BBS-82245 .

    Thanks, Tanju

  24. Raul Gomis staff

    Hi everyone,

    Sorry for any inconvenience caused by this specific issue. After reverting all platform changes (downgrading CoreOS and Docker versions in our Kubernetes cluster and moving back from M5 to M4 EC2 instance types in AWS) the intermittent variance problem seems to persist based on our internal tests and analytics.

    Next steps are:

    • Today, we identified a potential race condition in the docker images clean-up of our internal caches that might affect build performance. As a result, a small percentage of builds might have had bigger variances when pulling images. The fix is already in review. We'll deploy it on Monday and keep monitoring to see how this improves overall performance.
    • We are improving our internal metrics to find other potential build performance enhancements.
    • We have recently shipped docker-in-docker image caching. For docker-in-docker, we are now caching / pulling public docker images from our internal docker registry, which means that we don't need to pull from DockerHub (or other public docker registry) every time. This enhancement will significantly improve build time when using docker run, especially in big docker images.

    I'll keep you updated with any other fixes / enhancements that might improve build performance.

    Regards, Raul

  25. Raul Gomis staff

    Hi @Johnathan Gilday,

    It's enabled by default for everyone, so, if you are using docker run, you'll be using it without adding extra configuration in your bitbucket-pipelines.yml file. With this feature we are now caching docker images for docker-in-docker in a similar way that we were caching docker images for the step.

    The predefined docker image cache is used for caching / pulling intermediate layers when building a docker image. We decided to make it opt-in as there are some special cases in which builds might be slower if not using not using docker good practices. More details here.

  26. Johnathan Gilday

    @Raul Gomis does this mean that if I have a build that does not build images but does pull images using the docker service, the predefined docker image cache is not effective?

    The predefined docker image cache is used for caching / pulling intermediate layers when building a docker image

  27. Raul Gomis staff

    @Johnathan Gilday, docker layers caching also caches intermediate docker layers which are pulled when executing docker run, docker pull, etc. However, the most interesting use case is to cache previously built layers when building docker images (bear in mind it's limited to 1GB).

    As we have recently shipped the pull-through image cache for DinD, enabling docker layers caching for just pulling images (docker run or similar) does not make sense for public images, as we are now caching the whole image (if public) in our internal registry. Even more, it would be better not to enable it just for pulling images as there is some overhead (build time) in downloading and uploading cached layers that you don't need.

    My suggestion is to enable docker layers caching only if you are building images or pulling private images (in those scenarios the 1GB cache would help speeding up your build). If not, it's better not to enable it and rely on our automatic pull-though image cache for DinD.

    Thanks for your interest! We'll also clarify those specific details in our docs.

    Regards, Raul

  28. Raul Gomis staff

    Hi everyone,

    I'm pleased to announce that we have recently shipped (1st August) a platform change that has hugely improved the performance for the majority of the builds:

    • We have swapped our Kubernetes nodes from using EC2 M4 instance types to M5d's. M5d instances use NVMe drives (instead of EBS volumes), which are much faster, as well as located on the underlying compute hardware, not having then the overhead of transferring data on the drives over a storage network.
    • With that, as well as some other performance improvements / fixes, we have fixed most of the intermittent performance issues of the last couple of months. However, if you are still experiencing any issue, I suggest you submit a support case so that we can investigate it.

    Next steps are:

    • In regards to the build variance, we might still have some build variance due to the fact that we run in a shared infrastructure. Now, after the M5d changes, our main bottleneck is CPU (instead of IO), so we will run again some experiments to limit CPU in order to test how much it would affect to build variance and build time. This would help us determine the best balance between predictability and speed in a shared infrastructure.

    I'll keep you updated with any other enhancements that might improve build performance and variance.

    Regards, Raul

  29. Log in to comment