Provide a fairly constant CPU and network resource

Issue #13079 open
Barnabás Gema created an issue

I'm trying to run my integration tests using pipelines. When running on my local machine they need a fairly constant 2-3 seconds/test case, but when running them on pipelines the deviance is much higher, they need 1-10 seconds to complete.

It wouldn't be disturbing if on every run the same test cases would be slow, but it doesn't really depend on the exact test case how much time it takes for it to pass, So sometimes a test case needs 1,5 seconds, sometimes it needs 9 and it makes finding a proper timeout for the test cases really challenging. It would be nice if during a pipelines run the container got a fairly constant CPU resource, or if it would have been configurable whether I need a constant or a best effort CPU.

Official response

  • Raul Gomis staff

    Hi all,

    Quick status update:

    On 11th January 2019, we released a configuration change that has hugely improved the build setup time performance for the majority of the builds: we have tuned our Kubernetes autoscaler algorithm to make sure we have enough space to schedule new builds without affecting setup time performance. We can see a huge reduction in scheduling latency and variance in our analytics.

    We'd love to hear your feedback about the performance / variance in Bitbucket Pipelines after the recent performance improvements.

    Regards,
    Raul

Comments (44)

  1. John Doe

    same here, it is very slow.. i am running phpunit tests and every time, even if i use different tests, after the first two, nothing happens for a couple of minutes and then the testing continues (still slow)

  2. Joshua Tjhin Account Deactivated
    • changed status to open

    Thanks for the feedback. As you probably know, we run builds on the shared infrastructure and therefore CPU is best effort. In the future, we would like to make CPU more constant for more steady and reliable builds. This however, is unlikely to be implemented in the next 6 months.

  3. Nick Houghton

    This is a big deal for a CI tool. We are trying to move more of our CI workload into pipelines but are seeing strange processing behaviour where pipelines appear to hang for long periods (minutes), and tests fail non-deterministically because of it (tests fail after timeouts).

    Logging shows clock jumps measured in seconds to minutes, like this:

    2017-06-19 09:49:45:164 WARN  kari.pool.HikariPool - readwrite - Thread starvation or clock leap detected (housekeeper delta=45s970ms942µs457ns).
    2017-06-19 09:50:16:831 WARN  kari.pool.HikariPool - readwrite - Thread starvation or clock leap detected (housekeeper delta=45s163ms543µs617ns).
    2017-06-19 09:51:21:838 WARN  kari.pool.HikariPool - readwrite - Thread starvation or clock leap detected (housekeeper delta=1m5s7ms608µs650ns).
    

    Which generally means the process didn't get CPU time for the delta period. Which lines up with what we are seeing around very slow processing of steps in the pipeline. Currently our test step in one of our pipelines takes 11mins to fail, running the same test in the "atlassian/default-image:latest" container locally takes around 2 minutes with no failures.

    Shared CPU infrastructure where scheduling is non-deterministic just means that sometimes builds are going to fail and require 1 or more reruns. Especially for a product where you are billing in minutes, it feels very disingenuous to make builds take longer (cost more minutes) and fail more often and require reruns (cost more minutes).

    Is there any more info about how the CPU is scheduled? and how to stay underneath the limits? Is it per container? or at a lower infrastructure level?

  4. Nick Houghton

    I gave up on bitbucket pipelines and moved to Google Container Builder. 120 mins free per day, and no lame shared compute resources. My build and tests ran first time.

  5. James Dengel

    I'd also like to note that pulling the image can make a huge difference to build time for a very simple project, also there is no monitoring of the time taken to pull the image.

    For instance logs of the build are as follows:

    build setup 5 seconds
    tox -e coverage 10 seconds

    but the billing period was 57 seconds, not sure where my missing 42 seconds are.

    another build

    build setup 4 seconds
    tox -e coverage 15 seconds
    total billing period 22 seconds. - missing 7 seconds

    Can we be shown exactly where this time is being used.
    It somewhats puts me off paying with knowing where 1/3 of our small build time is going.

    The above times are for the same repo with only minor changes between commits.

  6. Marcus Schumann

    We have the same issues like James Dengel. The build-time is not correct and it's also highly different between each run, even when running the same exact pipeline for the same exact commit (re-run).

  7. James Dengel

    bitbucket.PNG

    This is a prime example of the issue, time taken to pull the image and the cache.

    build: 2 mins 25 seconds
    - build setup : 1 minute 12s
    - tox: 9s
    - push: 17s

    total 1:38
    Missing time : 47 seconds.

  8. Joshua Tjhin Account Deactivated

    Hi James,

    The build setup includes the cloning of the repo and downloading the cache. However, it does not include the time to pull images and start containers. This additional time also varies and might be faster if some image layers have been already downloaded. I've created a new issue #14484 for this improvement to provide a better breakdown of the pipeline duration.

    We're always trying to improve build speeds and a few days ago we rolled out an experiment to cache the public docker images.

    Regards,
    Joshua

  9. Marcus Schumann

    @xtjhin Does the caching of docker images speed up the currently "invisible" step of pulling images and starting containers? How would one try it out? Only for alpha customers?

  10. Joshua Tjhin Account Deactivated

    @Tanax yes, by caching build and service images, you might notice logs start streaming a little earlier. It has been rolled out to all customers and you should benefit automatically. We will monitor the impact and make additional improvements.

  11. Hudson Mendes

    Very, VERY! slow. Such a slow build infrastructure only contributes to miserable hours of wasted time.
    Already looking into alternatives - Bitbucket pipelines has burnt a lot of very scarse time I had on a startup I'm working on.
    Nothing fancy necessary: just get more processing power and memory on this machines already...

  12. Jan Kühnlein

    Slow...slow...slow... :-(
    EDIT: My fault this time. But had many slow builds before sometimes x4 longer than usual :-/

  13. Nick Boultbee

    I'm seeing custom Docker builds take 17 - 27 mins that I can run in under a minute locally (on a reasonable I7 with SSD).

    So to me, two problems: (a) very slow (big problem), and (b) quite variable (not so much a problem).

  14. Etienne Noel

    Me too, Docker builds are simply taking too much time to be done in Pipelines. I don't mind paying but the execution time should be close to my local machine.

  15. Nick Boultbee

    CircleCI (and Travis) are indeed much faster and with similar enough configuration.

    I don't feel Bitbucket is incentivised enough here to help, as they're now charging us extra for being so slow (due to exceeding the included minutes in our plan). Hmmm.

  16. Paul Carter-Brown

    Hi,

    I'm getting really frustrated with the performance of pipelines. Especially considering we are paying based on time. This needs to be dealt with urgently or suspend billing clients until the issue is fixed. Here is a docker build step in my pipeline. I've added a "RUN date" step as every second step so you can see the progress. Even running the date command takes something like 30s:

    Status: Downloaded newer image for jinigurumauritius/ubuntu_jdk_tomee:latest
    ---> 05fdc078884d
    Step 2/19 : RUN date
    ---> Running in abc9e8f1d14a
    Thu Dec 14 13:01:00 UTC 2017
    ---> 942cb7bae99e
    Removing intermediate container abc9e8f1d14a
    Step 3/19 : RUN mkdir -p /opt/tomee/apps/ && rm /opt/tomee/lib/johnzon-
    ---> Running in deaf4d47bdc0
    ---> f1243138756b
    Removing intermediate container deaf4d47bdc0
    Step 4/19 : RUN date
    ---> Running in d0747c048f25
    Thu Dec 14 13:01:52 UTC 2017
    ---> 5c7b8b24192f
    Removing intermediate container d0747c048f25
    Step 5/19 : COPY deployable/target/
    .ear /opt/tomee/apps/
    ---> ec07c86b4e10
    Removing intermediate container e916471e79b0
    Step 6/19 : RUN date
    ---> Running in c146d8e77e80
    Thu Dec 14 13:02:52 UTC 2017
    ---> 17fcdd2da425
    Removing intermediate container c146d8e77e80
    Step 7/19 : COPY deployable/target/jars/jg-arch-log-formatter.jar /usr/lib/jvm/java-8-oracle/jre/lib/ext/
    ---> 04139c43b618
    Removing intermediate container 78308a5354b2
    Step 8/19 : RUN date
    ---> Running in 8188cdadb9af
    Thu Dec 14 13:04:33 UTC 2017
    ---> ab33ca967b57
    Removing intermediate container 8188cdadb9af
    Step 9/19 : COPY deployable/target/jars/mysql-connector-java.jar docker/all/hacked-libs/* /opt/tomee/lib/
    ---> 10c8b8e8bfec
    Removing intermediate container b6657221ac64
    Step 10/19 : RUN date
    ---> Running in d765595cd0ce
    Thu Dec 14 13:05:33 UTC 2017
    ---> 3cc1d0c90ed1
    Removing intermediate container d765595cd0ce
    Step 11/19 : COPY docker/all/tomee.xml docker/all/logging.properties docker/all/server.xml /opt/tomee/conf/
    ---> 45cc6af66c75
    Removing intermediate container 68ffc104468f
    Step 12/19 : RUN date
    ---> Running in 3b01f8a87b1f
    Thu Dec 14 13:06:24 UTC 2017
    ---> 1edaced52591
    Removing intermediate container 3b01f8a87b1f
    Step 13/19 : COPY docker/all/setenv.sh /opt/tomee/bin/
    ---> 56861adf785b
    Removing intermediate container d01af87cd73a
    Step 14/19 : RUN date
    ---> Running in 163715b5bc4f
    Thu Dec 14 13:07:10 UTC 2017
    ---> 998a1e3d84a3
    Removing intermediate container 163715b5bc4f
    Step 15/19 : COPY docker/all/run.sh /
    ---> e96fe0845d13
    Removing intermediate container a1db54d9b34d
    Step 16/19 : RUN date
    ---> Running in 76b720d624cb
    Thu Dec 14 13:08:05 UTC 2017
    ---> 8a2e88fc79e4
    Removing intermediate container 76b720d624cb
    Step 17/19 : RUN chmod +x /run.sh
    ---> Running in 55956f7e90ce
    ---> 5e77516ce2f8
    Removing intermediate container 55956f7e90ce
    Step 18/19 : RUN date
    ---> Running in 488fc1c84247
    Thu Dec 14 13:08:58 UTC 2017
    ---> e0aa09b5f033
    Removing intermediate container 488fc1c84247
    Step 19/19 : CMD /run.sh
    ---> Running in f40389c0e7b2
    ---> d271f2bc3f3b
    Removing intermediate container f40389c0e7b2
    Successfully built d271f2bc3f3b
    Successfully tagged 831776913662.dkr.ecr.eu-west-1.amazonaws.com/ngage:latest

    That's 9 minutes to run a few copy commands.

    Running the same docker build on my laptop lakes about 10s and that's not due to layer caching:

    Step 6/19 : RUN date
    ---> Running in 305bbd635591
    Thu Dec 14 13:20:00 UTC 2017
    ---> 9c7bf288a205
    Removing intermediate container 305bbd635591
    Step 7/19 : COPY deployable/target/jars/jg-arch-log-formatter.jar /usr/lib/jvm/java-8-oracle/jre/lib/ext/
    ---> 1d24c7d0b18a
    Removing intermediate container 6648d4bd22bc
    Step 8/19 : RUN date
    ---> Running in 6ad534b21909
    Thu Dec 14 13:20:01 UTC 2017
    ---> 75ca7eb2ddb3
    Removing intermediate container 6ad534b21909
    Step 9/19 : COPY deployable/target/jars/mysql-connector-java.jar docker/all/hacked-libs/* /opt/tomee/lib/
    ---> 780c4b7652d5
    Removing intermediate container 8d303e7996bc
    Step 10/19 : RUN date
    ---> Running in 286356627c69
    Thu Dec 14 13:20:02 UTC 2017
    ---> 975c55ca6fa9
    Removing intermediate container 286356627c69
    Step 11/19 : COPY docker/all/tomee.xml docker/all/logging.properties docker/all/server.xml /opt/tomee/conf/
    ---> 880ca7f0ecef
    Removing intermediate container d44acfe4b9a6
    Step 12/19 : RUN date
    ---> Running in b8910ed6bf3b
    Thu Dec 14 13:20:04 UTC 2017
    ---> ba51e68840f4
    Removing intermediate container b8910ed6bf3b
    Step 13/19 : COPY docker/all/setenv.sh /opt/tomee/bin/
    ---> b9239ae873da
    Removing intermediate container 3644b2498bc6
    Step 14/19 : RUN date
    ---> Running in 0c56d060b2f8
    Thu Dec 14 13:20:05 UTC 2017
    ---> 5720533332f9
    Removing intermediate container 0c56d060b2f8
    Step 15/19 : COPY docker/all/run.sh /
    ---> f3b8bd2d1032
    Removing intermediate container d858557ad957
    Step 16/19 : RUN date
    ---> Running in 90dbd0ea94d1
    Thu Dec 14 13:20:06 UTC 2017
    ---> fc2c8bcb0643
    Removing intermediate container 90dbd0ea94d1
    Step 17/19 : RUN chmod +x /run.sh
    ---> Running in 6f2534932e4b
    ---> 28450b4cf99f
    Removing intermediate container 6f2534932e4b
    Step 18/19 : RUN date
    ---> Running in e2dee7143445
    Thu Dec 14 13:20:07 UTC 2017
    ---> ff1b4844f2ba
    Removing intermediate container e2dee7143445
    Step 19/19 : CMD /run.sh
    ---> Running in d0922815e090
    ---> a53821bb5d32
    Removing intermediate container d0922815e090
    Successfully built a53821bb5d32


    So basically pipeline is 50 time slower than a laptop!

  17. Nick Boultbee

    Agreed. And at almost [EDIT: 1 year, 5 months] old, how come this ticket is still at Priority: Minor??

  18. Barnabás Gema reporter

    Actually I posted the ticket 1 year and 5 months ago, so it is even more disturbing. With this quality of service there is really no point in using the product. After a few trials and errors we have moved over to our self-hosted Jenkins, that has an almost identical and well functioning Pipeline feature.

  19. Aneita Yang staff

    Hi everyone,

    Thanks for your interest in this issue and for your patience. When we look at our open tickets, the number of votes that a ticket has plays much more of a role than the priority it is assigned. Over the past couple of months, the team has been working on issues with a much higher number of votes.

    In the past few weeks, we've looked at a range of solutions for the variance in build time. Unfortunately, the solution isn't as simple as just restricting the CPU available to each build. While this makes build times more consistent as every pipeline will have the same amount of resources allocated to it, regardless of when it's being run, it also means that builds could be slower overall. Your build that has previously had no limit in the CPUs it could use, is now limited in this resource. We will continue to investigate this issue and experiment with different solutions in the new year. I'll keep you updated on our progress via this issue.

    Thanks again for your patience.

    Aneita

  20. Paul Carter-Brown

    Hi Aleita,

    Is this not just a matter of adding more processing power in your hosting environment?

  21. Andrew Kao

    Pipeline is too slow recently and drives our team crazy.

    The matter thing is, we are willing to pay more to get more powerful environment.

    We even consider move to GitHub to CI/CD with Travis or Circle if the situation still the same.

  22. Matt Ryall

    @andrew_kao - sorry to hear that. Our goal is to run your builds as quickly as possible. We'd like to get a bit more info about what exactly is slow in your builds, so could you please open a support case, so we can take a look at your specific case?

    Here are a couple of general suggestions that might help:

    • We recently added support for large builds which get double memory and CPU allocation of normal builds, for double the cost. This will almost certainly speed up your build, if you're not using it yet. (We're also experimenting with ways of enforcing this CPU allocation better in the next week or two -- the project @aneita mentioned above -- which should further help out large builds.)
    • We also recently added Docker layer caching, which should speed up builds and subsequent use of Docker images across multiple steps/builds.
    • When using build images hosted on a private registry, or other assets that need to be downloaded for the build, it may be faster to move them to a location closer to where the Pipelines cluster runs in the AWS us-east region.
    • We recently added parallel steps, so you can split long testing tasks into separate steps that run in parallel.
    • We've considered the idea of even larger builds (3x, 4x), with additional memory or CPU allocation depending on what you need, but don't currently have a feature request open for it. Feel free to open one if this sounds interesting and let us know what you're after.

    We haven't seen a general slowdown in our performance metrics here, but perhaps we're missing something specific to your situation. We can investigate this best through a support ticket.

  23. Uri Harduf

    Hey @mryall_atlassian .

    I got here through a support ticket. For the past few weeks we've been seeing that our builds take longer. This means that our builds can take 10 minutes and can take 20 minutes. When they take longer, all of the steps including the cloning of the repo and the docker image pull take longer as well. So this is crucial for us since we're also paying per build minute that's changing drastically, and also we would rather get a faster build.

    Also I personally think that 2x resources means a lot when you talk about memory, but I don't understand from the documentation whether 2x means you'll get 2 cores, or get a better priority when competing with other docker instances thus getting more CPU time for 1 core.

    I would simply expect 1x resources to give me a CPU with a constant speed which is an average moden-server core (so if I run the tests on my macbook pro, they'll run as fast on the build, else 1x would never be good for me). And that 2x resources would give me more cores, as its like at other cloud services.

    So parallel steps or a larger build would not help if the CPU given to the docker instance is not constant and at least as fast as an average server with 1 core.

    Thanks

  24. Brad Humphrey

    We have huge variances in build time as well. We have builds that take between 25 and 60 minutes (total time using parallel build steps). It seems like most of the trouble is caused by variations in network capacity. Builds sometimes really slow down when running npm install, downloading assets from amazon's S3, or pushing built docker images. I've seen download speeds that vary between 1MB/s and 50 MB/s.

  25. Kevin Cabrera

    Experiencing constant slow down for the past week as well. Before, our usual build varies from 2-4 minutes per deployments but now it's taking constant 30 mins or more. :-( We have multiple project that uses bitbucket pipeline. And every each one of them are all affected by the slow down!! Tried increasing the size to 2x but still no use.

    Performance 2 months ago vs now
    Screen Shot 2018-06-28 at 2.51.32 PM.png

  26. Raul Gomis staff

    Hey,

    Sorry to hear that your builds are taking longer recently. We are currently investigating a performance degradation since June in our platform that might be the cause of some slow pipelines. We are still not sure if it would be due to build variance or recent CoreOs and Docker upgrades in the platform, but we are working on it to isolate the issue and fix it. I suggest you submit a support case so that we can investigate your specific cases.

    Regarding build variance, one of our priorities in Pipelines team is to run builds as fast as possible, as we believe it is the best practice for CI/CD. So, we are working in testing different approaches to have more predictable build times, but it's not as easy as limiting the CPU, as we don't want to affect build speed (our main priority to keep low).

    On 6th/7th June we run an experiment to limit CPU in our shared infrastructure in order to test how much would affect to build variance / predictability and build time. The experiment showed that not only build time was affected (getting slower builds) but also build variance got worse. This is because our main bottleneck is either networking or disk (as they are shared resources we do best effort as well as CPU). We will be working soon to experiment on testing different configurations and limits for disk and/or networking.

  27. Shawn McKnight

    I can echo that we've had pipeline timings with enormous variances. I had a pipeline time out after 30 minutes and then it was re-run immediately after and completed in 7. Another pipeline for a very similar project/build then finished in 73 seconds.

  28. Tanju Erinmez

    @rgomish , I was redirected to this thread from a support case where we "packaged up" infos on a hefty variance case for BB to analyse.

    Do you guys have any updates on your further experiments? If it helps, feel free to have a look at the material in BBS-82245 .

    Thanks, Tanju

  29. Raul Gomis staff

    Hi everyone,

    Sorry for any inconvenience caused by this specific issue. After reverting all platform changes (downgrading CoreOS and Docker versions in our Kubernetes cluster and moving back from M5 to M4 EC2 instance types in AWS) the intermittent variance problem seems to persist based on our internal tests and analytics.

    Next steps are:

    • Today, we identified a potential race condition in the docker images clean-up of our internal caches that might affect build performance. As a result, a small percentage of builds might have had bigger variances when pulling images. The fix is already in review. We'll deploy it on Monday and keep monitoring to see how this improves overall performance.
    • We are improving our internal metrics to find other potential build performance enhancements.
    • We have recently shipped docker-in-docker image caching. For docker-in-docker, we are now caching / pulling public docker images from our internal docker registry, which means that we don't need to pull from DockerHub (or other public docker registry) every time. This enhancement will significantly improve build time when using docker run, especially in big docker images.

    I'll keep you updated with any other fixes / enhancements that might improve build performance.

    Regards,
    Raul

  30. Raul Gomis staff

    Hi @gilday,

    It's enabled by default for everyone, so, if you are using docker run, you'll be using it without adding extra configuration in your bitbucket-pipelines.yml file. With this feature we are now caching docker images for docker-in-docker in a similar way that we were caching docker images for the step.

    The predefined docker image cache is used for caching / pulling intermediate layers when building a docker image. We decided to make it opt-in as there are some special cases in which builds might be slower if not using not using docker good practices. More details here.

  31. Johnathan Gilday

    @rgomish does this mean that if I have a build that does not build images but does pull images using the docker service, the predefined docker image cache is not effective?

    The predefined docker image cache is used for caching / pulling intermediate layers when building a docker image

  32. Raul Gomis staff

    @jgilday, docker layers caching also caches intermediate docker layers which are pulled when executing docker run, docker pull, etc. However, the most interesting use case is to cache previously built layers when building docker images (bear in mind it's limited to 1GB).

    As we have recently shipped the pull-through image cache for DinD, enabling docker layers caching for just pulling images (docker run or similar) does not make sense for public images, as we are now caching the whole image (if public) in our internal registry. Even more, it would be better not to enable it just for pulling images as there is some overhead (build time) in downloading and uploading cached layers that you don't need.

    My suggestion is to enable docker layers caching only if you are building images or pulling private images (in those scenarios the 1GB cache would help speeding up your build). If not, it's better not to enable it and rely on our automatic pull-though image cache for DinD.

    Thanks for your interest! We'll also clarify those specific details in our docs.

    Regards,
    Raul

  33. Johnathan Gilday

    @rgomish thanks for the detailed explanation - this helps me and my team understand when to use the cache

  34. Raul Gomis staff

    Hi everyone,

    I'm pleased to announce that we have recently shipped (1st August) a platform change that has hugely improved the performance for the majority of the builds:

    • We have swapped our Kubernetes nodes from using EC2 M4 instance types to M5d's. M5d instances use NVMe drives (instead of EBS volumes), which are much faster, as well as located on the underlying compute hardware, not having then the overhead of transferring data on the drives over a storage network.
    • With that, as well as some other performance improvements / fixes, we have fixed most of the intermittent performance issues of the last couple of months. However, if you are still experiencing any issue, I suggest you submit a support case so that we can investigate it.

    Next steps are:

    • In regards to the build variance, we might still have some build variance due to the fact that we run in a shared infrastructure. Now, after the M5d changes, our main bottleneck is CPU (instead of IO), so we will run again some experiments to limit CPU in order to test how much it would affect to build variance and build time. This would help us determine the best balance between predictability and speed in a shared infrastructure.

    I'll keep you updated with any other enhancements that might improve build performance and variance.

    Regards,
    Raul

  35. Remo Meier

    is there any update on the topic? we also switched to bitbucket pipelines two weeks ago, but performance is not consitent. For example today, the build setup along took one time 1:30 minutes while the other time 15:00 minutes.

  36. Raul Gomis staff

    Hi @remom,

    Sorry for any inconvenience caused. My team is already working on investigating the performance issue on build setup time: we have released a configuration change that might improve the scheduling / setup time for builds and we are monitoring the performance of the cluster closely.

    I'll keep you updated on how it goes and any other enhancements that might improve build performance and variance.

    Regards,
    Raul

  37. Raul Gomis staff

    Hi all,

    Quick status update:

    On 11th January 2019, we released a configuration change that has hugely improved the build setup time performance for the majority of the builds: we have tuned our Kubernetes autoscaler algorithm to make sure we have enough space to schedule new builds without affecting setup time performance. We can see a huge reduction in scheduling latency and variance in our analytics.

    We'd love to hear your feedback about the performance / variance in Bitbucket Pipelines after the recent performance improvements.

    Regards,
    Raul

  38. Matt Schaub

    I think having an option to use the same image across build steps would help a lot. I want to break my pipeline up into multiple steps but have no need for each step to pull down the same image and start from scratch.

  39. Log in to comment