Unblob the repository

Issue #13 resolved
Marco van Zwetselaar created an issue

Hi Philip,

At my end, pulling the KMA repo takes close to an hour 😲 . This is part BitBucket, part Africa, but also … it’s a whopping 200MB!

This is mainly due to this bunch in the history:

        ResFinder.fsa      | b52ec5ed (2.0 MB), 176114ba (2.0 MB)
        all_databases.fsa  | 23a43bee (1.9 MB), 5555003d (1.9 MB)
        beta-lactamase.fsa | 4804d572 (1.1 MB), 6d61d37d (1.2 MB)
        bl2seq             | 3f9309bb (12.0 MB)
        blast_formatter    | 5cb1200a (37.2 MB)
        blastall           | 9419f510 (12.0 MB)
        blastclust         | 92ba65ce (10.8 MB)
        blastdb_aliastool  | 23fc2d58 (23.2 MB)
        blastdbcheck       | cd4b8a28 (26.3 MB)
        blastdbcmd         | 1ea89d02 (33.0 MB)
        blastn             | b10ac9f9 (37.2 MB)
        blastp             | 56cc7479 (37.2 MB)
        blastpgp           | b4ba1852 (11.2 MB)
        blastx             | ba3894e4 (37.2 MB)
        convert2blastmask  | 0433f0c5 (24.9 MB)

There’s a tool BFG Repo-Cleaner which does a good job at the cleaning. It doesn’t kill history, but of course does need to rewrite all commits since the removal of the files. (It add a line Former-commit-id: 04355e40af35119f06c5675e7a61e1f7fa00629a so you could still trace every commit back to old repository copies.)

# Download BFG
wget 'https://repo1.maven.org/maven2/com/madgag/bfg/1.13.0/bfg-1.13.0.jar'

# Mirror clone the repository
git clone --mirror git@bitbucket.org/genomicepidemiology/kma.git

# Clean out all blobs over 1MB
java -jar bfg-1.13.0.jar --strip-blobs-bigger-than 1M kma

# Pack the repository
cd kma
git reflog expire --expire=now --all && git gc --prune=now --aggressive

# Push back to BitBucket (this will update all branches)
git push

After the push, best to tell people to reclone the repository. But that’s a breeze, because it’s down to … 2.7MB 🙂

Comments (2)

  1. ptlcc

    Hi Marco

    There was a colleague that made an extra branch by mistake with ResFinder and all its dependencies, which is what you pulled out as major contributors to the exploding size.

    I have just deleted this branch, and the repo with all its history is down to 4.8 MB again.

    Best,
    Philip

  2. Marco van Zwetselaar reporter

    Yay, that fixed it!

    (BTW bitbucket is still horridly slow compared to GitHub, maybe they don't have proxies near here.)

    Thanks, Marco

  3. Log in to comment