Currently there's only one regex for each cleanup, which means all the ones which match multiple sites end up with long regexes which are hard to work with. I'm not sure if there's any better ideas, but I thought I'd try turning it into an array.
I added a very short "test_all" function, which tests every regex in the array and returns true if any match and changed everything to use that instead. Then I split all the regexes where I thought it made sense to, which is all the ones which cover multiple unrelated sites.
I did consider reordering things so that they're more alphabetical, but I figured I'd make as few changes as possible here to make it easier to review.
I mostly tested it using http://nikki.mbsandbox.org/static/scripts/tests/all.html (in Opera 12, Firefox 24, Chromium 33 something, Safari 5.1, all OSX) making sure that the only errors that appear were ones which were already there before I started. I did a small amount of testing by typing and pasting various URLs into the add entity pages in Chromium to check that it appeared to be working as expected.