wordhound can't spider website

Issue #1 resolved

thesle3p created an issue 2014-12-12

When ever I try and spider a website to generate a wordlist I get the following traceback:

Traceback (most recent call last): File "./Main.py", line 333, in <module> main() File "./Main.py", line 59, in main generation() File "./Main.py", line 141, in generation industrySelected(options[choice]) File "./Main.py", line 169, in industrySelected clientSelected(optionsList[choice-1], industry) File "./Main.py", line 228, in clientSelected newClientOptions(c) File "./Main.py", line 271, in newClientOptions client.buildDictionary(int(recursionLevel)) File "<basedir>/wordhound/client.py", line 151, in buildDictionary self.crawl(rLevels) File "<basedir>/wordhound/client.py", line 49, in crawl self.crawledText = webcrawler.urlreport(b=self.url, d=recursionLevels, t=1) File "<basedir>/wordhound/crawler.py", line 867, in urlreport if b: self.weburls(b, w, d, t) File "<basedir>/wordhound/crawler.py", line 657, in weburls newbase, rawurls = self._webopen((base, '')) TypeError: 'bool' object is not iterable

Comments (11)

Matt Marx repo owner
Hi Jethro

Thanks, I'll take a look at this later today. Can you just confirm what you input during the setup of the crawler into wordhound?

M
- 2014-12-15T06:55:30+00:00

thesle3p reporter

I opted to create a new industry,a new client and then Generate a Dictionary from a website gave the url in both domain only format and http://domain/ format and received the same output. However I also just noticed that if I do not specify a recursion level (which one would assume would cause wordhound to use the default) I get this error:

Traceback (most recent call last):
  File "./Main.py", line 333, in <module>
    main()
  File "./Main.py", line 59, in main
    generation()
  File "./Main.py", line 141, in generation
    industrySelected(options[choice])
  File "./Main.py", line 169, in industrySelected
    clientSelected(optionsList[choice-1], industry)
  File "./Main.py", line 228, in clientSelected
    newClientOptions(c)
  File "./Main.py", line 271, in newClientOptions
    client.buildDictionary(int(recursionLevel))
ValueError: invalid literal for int() with base 10: ''

To me this looks like wordhound is looking for values as a int that should be a null. So that is something else you may want to look in to.

2014-12-15T18:04:05+00:00

Vasiliy Yorkin
pull request #2, still not able to crawl any reasonable amount of words from a given website DICTIONARY GENERATION FAILED
- 2015-09-18T09:51:26+00:00
Gareth Phillips
Same issue here guys. Using Kali Linux 2.0. If I don't specify any recur level then I just get the same error as the above. If I do specify a recur level with the correct URL then I just get the DICTIONARY GENERATION FAILED error. It actually creates the new client and folder structure etc, but the dictionary is completely empty. Has any body else managed to get this working? Is it just a Kali Linux issue, or a general issue? Thx.
- 2016-02-07T02:09:56+00:00
John 223
I too get a similar issue when hitting the dictionary generation step. Any idea if this will be fixed or when wordhound 2.0 will be out?
- 2016-02-22T17:34:40+00:00

Boo Boo 2 Shoes

Seeing similar issues here. Pulled from https://bitbucket.org/mattinfosec/wordhound.git

After editing Main.py and Client.py to fix the integer conversion attempt of '' I am still seeing:

DICTIONARY GENERATION FAILED

Somewhere in newClientOptions(client) from these two files:

                        recursionLevel = raw_input()

                        if not recursionLevel:
                                recursionLevel = client.recursionLevels

                        client.buildDictionary(int(recursionLevel))

2016-03-23T19:29:02+00:00

Boo Boo 2 Shoes

And at the very bottom of Client.py

                print "How many tweets would you like to analyse?:(Default = 700) (Max = 700)"
                count = raw_input()
                if not count:
                        count = 700
                else:
                        count = int(raw_input())
                #count = int(raw_input())

2016-03-23T20:13:59+00:00

Drinx
The "[=] DICTIONARY GENERATION FAILED [=]" problem occurs because the crawler.py respects "robots.txt" files. The point is that FIRST test (placed in def _webopen, called in def weburls, line 658) is made BEFORE robotparser reads anything (line 687), so self._robot.can_fetch ALWAYS returns False, process is stopped and we have no new urls to check.

My (updated) quick and dirty hack: feed the robotparser with your own robots.txt content (allowing all).

crawler.py, def weburls, line 657:
```
        # Verify URL and get child URLs
        self._robot.parse(['User-agent: *', 'Disallow:'])
        newbase, rawurls = self._webopen((base, ''))
```
If you want to skip subsequent robots.txt check at all, comment lines 686 and 687:
```
        # Get robot limits
        #robot.set_url(''.join([base, 'robots.txt']))
        #robot.read()
```
Then bs4 may complain about lack of parser so change line 446 in crawler.py, def sanititse:
```
        soup_text = bs4.BeautifulSoup(text, 'html.parser')
```
Still under investigation. Thanks Matt for the invitation - there is still some work and tests to do :)
- 2016-07-17T07:37:40+00:00
Matt Marx repo owner
Hi Drinx,

Thanks for sending this. Would you add your changes to the project? I'll incorporate this into the master.

Cheers Matt
- 2016-07-18T11:45:08+00:00
Matt Marx repo owner
- changed status to resolved
Hi,

Wordhound has been completely redesigned. Let me know if you still have any issues.

M
- 2016-12-06T19:10:39+00:00
Gareth Phillips
Thanks Matt, will do.
- 2016-12-07T14:50:12+00:00
Log in to comment

Assignee: –

Type: bug

Priority: critical

Status: resolved

Votes: 2

Watchers: 3