wordhound can't spider website
When ever I try and spider a website to generate a wordlist I get the following traceback:
Traceback (most recent call last): File "./Main.py", line 333, in <module> main() File "./Main.py", line 59, in main generation() File "./Main.py", line 141, in generation industrySelected(options[choice]) File "./Main.py", line 169, in industrySelected clientSelected(optionsList[choice-1], industry) File "./Main.py", line 228, in clientSelected newClientOptions(c) File "./Main.py", line 271, in newClientOptions client.buildDictionary(int(recursionLevel)) File "<basedir>/wordhound/client.py", line 151, in buildDictionary self.crawl(rLevels) File "<basedir>/wordhound/client.py", line 49, in crawl self.crawledText = webcrawler.urlreport(b=self.url, d=recursionLevels, t=1) File "<basedir>/wordhound/crawler.py", line 867, in urlreport if b: self.weburls(b, w, d, t) File "<basedir>/wordhound/crawler.py", line 657, in weburls newbase, rawurls = self._webopen((base, '')) TypeError: 'bool' object is not iterable
Comments (11)
-
repo owner -
reporter I opted to create a new industry,a new client and then Generate a Dictionary from a website gave the url in both domain only format and http://domain/ format and received the same output. However I also just noticed that if I do not specify a recursion level (which one would assume would cause wordhound to use the default) I get this error:
Traceback (most recent call last): File "./Main.py", line 333, in <module> main() File "./Main.py", line 59, in main generation() File "./Main.py", line 141, in generation industrySelected(options[choice]) File "./Main.py", line 169, in industrySelected clientSelected(optionsList[choice-1], industry) File "./Main.py", line 228, in clientSelected newClientOptions(c) File "./Main.py", line 271, in newClientOptions client.buildDictionary(int(recursionLevel)) ValueError: invalid literal for int() with base 10: ''
To me this looks like wordhound is looking for values as a int that should be a null. So that is something else you may want to look in to.
-
pull request #2, still not able to crawl any reasonable amount of words from a given website
DICTIONARY GENERATION FAILED
-
Same issue here guys. Using Kali Linux 2.0. If I don't specify any recur level then I just get the same error as the above. If I do specify a recur level with the correct URL then I just get the DICTIONARY GENERATION FAILED error. It actually creates the new client and folder structure etc, but the dictionary is completely empty. Has any body else managed to get this working? Is it just a Kali Linux issue, or a general issue? Thx.
-
I too get a similar issue when hitting the dictionary generation step. Any idea if this will be fixed or when wordhound 2.0 will be out?
-
Seeing similar issues here. Pulled from https://bitbucket.org/mattinfosec/wordhound.git
After editing Main.py and Client.py to fix the integer conversion attempt of '' I am still seeing:
DICTIONARY GENERATION FAILED
Somewhere in newClientOptions(client) from these two files:
recursionLevel = raw_input() if not recursionLevel: recursionLevel = client.recursionLevels client.buildDictionary(int(recursionLevel))
-
And at the very bottom of Client.py
print "How many tweets would you like to analyse?:(Default = 700) (Max = 700)" count = raw_input() if not count: count = 700 else: count = int(raw_input()) #count = int(raw_input())
-
The "[=] DICTIONARY GENERATION FAILED [=]" problem occurs because the crawler.py respects "robots.txt" files. The point is that FIRST test (placed in def _webopen, called in def weburls, line 658) is made BEFORE robotparser reads anything (line 687), so self._robot.can_fetch ALWAYS returns False, process is stopped and we have no new urls to check.
My (updated) quick and dirty hack: feed the robotparser with your own robots.txt content (allowing all).
crawler.py, def weburls, line 657:
# Verify URL and get child URLs self._robot.parse(['User-agent: *', 'Disallow:']) newbase, rawurls = self._webopen((base, ''))
If you want to skip subsequent robots.txt check at all, comment lines 686 and 687:
# Get robot limits #robot.set_url(''.join([base, 'robots.txt'])) #robot.read()
Then bs4 may complain about lack of parser so change line 446 in crawler.py, def sanititse:
soup_text = bs4.BeautifulSoup(text, 'html.parser')
Still under investigation. Thanks Matt for the invitation - there is still some work and tests to do :)
-
repo owner Hi Drinx,
Thanks for sending this. Would you add your changes to the project? I'll incorporate this into the master.
Cheers Matt
-
repo owner - changed status to resolved
Hi,
Wordhound has been completely redesigned. Let me know if you still have any issues.
M
-
Thanks Matt, will do.
- Log in to comment
Hi Jethro
Thanks, I'll take a look at this later today. Can you just confirm what you input during the setup of the crawler into wordhound?
M