This code scrapes a VBulletin forum into a MongoDB and includes a file to access the database using the mongoengine python library.

The code is buggy! (See below.)

I created it for a class project. It could be the basis of something much more robust, so feel free to do with it as you please, just keep in mind that the code is provided as-is!


"I want to use this code and make it better!" you say?


Let me know and I'll only then feel super motivated to add some inline documentation. Then, I'll add you as a committer.

What's wrong with it

  • Unicode.
    • I'm using data_encoded.decode("utf-8","ignore"). This means any non-ascii characters are simply removed from text. We should do something better but I've never wrapped my head around encodings in Python. :(
  • Lots of other stuff. ?
    • The forum I scraped using this has some statistics about total numbers of threads, posts, etc. The numbers I got after scraping don't match. Probably my fault?


This code is provided under the WTFPL.