Bag of Words Collision Example
This repository contains
Perl scripts for scraping the relevant source data from the AUSTLII website
Python scripts to:
- Compare unique word frequencies against unique bi-gram frequencies in the corpus
- Find collisions between phrases of length n that share the same bag of words in different order
Note that Python's inbuilt hash function is used to find "bag-of-words" collisions but does not take into account actual hash collisions.
As this was a quick and dirty exercise, the data has not been cleaned very well (using only a basic regex to strip HTML tags). This will leave some non-natural language in the corpus (AUSTLII links/headers/etc) which is manually excluded.
In addition, two phrases will be considered "bag-of-words" collisions if punctuation, whitespace, etc has been rearranged. These should probably be considered identical (therefore non-collisions) for the purposes of demonstration, but I decided it was not worth the effort to exclude these (the only real downside being additional noise generated in the output).
All licences MIT https://opensource.org/licenses/MIT