HTTPS SSH

Bag of Words Collision Example

This repository contains

  • Perl scripts for scraping the relevant source data from the AUSTLII website

  • Python scripts to:

    • Compare unique word frequencies against unique bi-gram frequencies in the corpus
    • Find collisions between phrases of length n that share the same bag of words in different order

Note that Python's inbuilt hash function is used to find "bag-of-words" collisions but does not take into account actual hash collisions.

As this was a quick and dirty exercise, the data has not been cleaned very well (using only a basic regex to strip HTML tags). This will leave some non-natural language in the corpus (AUSTLII links/headers/etc) which is manually excluded.

In addition, two phrases will be considered "bag-of-words" collisions if punctuation, whitespace, etc has been rearranged. These should probably be considered identical (therefore non-collisions) for the purposes of demonstration, but I decided it was not worth the effort to exclude these (the only real downside being additional noise generated in the output).

All licences MIT https://opensource.org/licenses/MIT