Improve options for managing repository size

Issue #17387 new
John McConnell
created an issue

Since Bitbucket is limited to 2 GB repositories, it is important to manage the repository's file size. Binaries (especially large ones) should be excluded. Package managers should be used.

But mistakes happen. Binaries get checked in by accident. Old projects have dependencies checked in from before a language's package manager became popular, and they can't be easily removed without breaking the project on those commits. Dependencies aren't available via package manager, or a custom build with source code modification was checked in and there isn't time/budget to fix that right now.

BitBucket's interface makes dealing with these situations difficult. The garbage collection happens completely transparently; I can't even tell when it last completed. I have no idea what options it ran with. I don't even know which commands were invoked. Even once you've corrected a mistake and deleted some binary from the repository (including historical commits), it can still linger in the internal storage files for some time, maybe forever if the right clean up commands are never invoked.

git provides a number of useful commands for cleaning up garbage from the repository:

  • git gc
  • git gc --aggressive
  • git repack (with arguments)
  • git prune

Some of them are discussed in more detail here.

BitBucket provides no useful interface for invoking these or understanding what has been invoked on the repository. BitBucket does not need to provide direct access to the repository on the server, of course, but the available tooling needs some serious improvement.

  1. Be more transparent. The admin interface should tell me when a garbage collection was last completed and should tell me what command it was. It should tell me when the next one is scheduled to run under normal circumstances. It doesn't have to name the full command, but some kind of indication would be helpful.
  2. Ideally, I could tell the system to invoke one immediately when I think it's needed. Having limitations on this to prevent abuse would be fine.
  3. There should be some kind of option(s) to cause the system to invoke more aggressive clean up efforts, like git gc --aggressive or maybe some form of git repack. These would be especially useful for mistakes that happened recently and can be cleaned up by modifying recent history. Even if I can't invoke one manually, I could tell the system to be more aggressive on the next attempt.