wordhound can't spider website

Issue #1 resolved
thesle3p created an issue

When ever I try and spider a website to generate a wordlist I get the following traceback:

Traceback (most recent call last): File "./Main.py", line 333, in <module> main() File "./Main.py", line 59, in main generation() File "./Main.py", line 141, in generation industrySelected(options[choice]) File "./Main.py", line 169, in industrySelected clientSelected(optionsList[choice-1], industry) File "./Main.py", line 228, in clientSelected newClientOptions(c) File "./Main.py", line 271, in newClientOptions client.buildDictionary(int(recursionLevel)) File "<basedir>/wordhound/client.py", line 151, in buildDictionary self.crawl(rLevels) File "<basedir>/wordhound/client.py", line 49, in crawl self.crawledText = webcrawler.urlreport(b=self.url, d=recursionLevels, t=1) File "<basedir>/wordhound/crawler.py", line 867, in urlreport if b: self.weburls(b, w, d, t) File "<basedir>/wordhound/crawler.py", line 657, in weburls newbase, rawurls = self._webopen((base, '')) TypeError: 'bool' object is not iterable

Comments (11)

  1. Matt Marx repo owner

    Hi Jethro

    Thanks, I'll take a look at this later today. Can you just confirm what you input during the setup of the crawler into wordhound?

    M

  2. thesle3p reporter

    I opted to create a new industry,a new client and then Generate a Dictionary from a website gave the url in both domain only format and http://domain/ format and received the same output. However I also just noticed that if I do not specify a recursion level (which one would assume would cause wordhound to use the default) I get this error:

    Traceback (most recent call last):
      File "./Main.py", line 333, in <module>
        main()
      File "./Main.py", line 59, in main
        generation()
      File "./Main.py", line 141, in generation
        industrySelected(options[choice])
      File "./Main.py", line 169, in industrySelected
        clientSelected(optionsList[choice-1], industry)
      File "./Main.py", line 228, in clientSelected
        newClientOptions(c)
      File "./Main.py", line 271, in newClientOptions
        client.buildDictionary(int(recursionLevel))
    ValueError: invalid literal for int() with base 10: ''
    

    To me this looks like wordhound is looking for values as a int that should be a null. So that is something else you may want to look in to.

  3. Gareth Phillips

    Same issue here guys. Using Kali Linux 2.0. If I don't specify any recur level then I just get the same error as the above. If I do specify a recur level with the correct URL then I just get the DICTIONARY GENERATION FAILED error. It actually creates the new client and folder structure etc, but the dictionary is completely empty. Has any body else managed to get this working? Is it just a Kali Linux issue, or a general issue? Thx.

  4. John 223

    I too get a similar issue when hitting the dictionary generation step. Any idea if this will be fixed or when wordhound 2.0 will be out?

  5. Boo Boo 2 Shoes

    Seeing similar issues here. Pulled from https://bitbucket.org/mattinfosec/wordhound.git

    After editing Main.py and Client.py to fix the integer conversion attempt of '' I am still seeing:

    DICTIONARY GENERATION FAILED

    Somewhere in newClientOptions(client) from these two files:

                            recursionLevel = raw_input()
    
                            if not recursionLevel:
                                    recursionLevel = client.recursionLevels
    
                            client.buildDictionary(int(recursionLevel))
    
  6. Boo Boo 2 Shoes

    And at the very bottom of Client.py

                    print "How many tweets would you like to analyse?:(Default = 700) (Max = 700)"
                    count = raw_input()
                    if not count:
                            count = 700
                    else:
                            count = int(raw_input())
                    #count = int(raw_input())
    
  7. Drinx

    The "[=] DICTIONARY GENERATION FAILED [=]" problem occurs because the crawler.py respects "robots.txt" files. The point is that FIRST test (placed in def _webopen, called in def weburls, line 658) is made BEFORE robotparser reads anything (line 687), so self._robot.can_fetch ALWAYS returns False, process is stopped and we have no new urls to check.

    My (updated) quick and dirty hack: feed the robotparser with your own robots.txt content (allowing all).

    crawler.py, def weburls, line 657:

            # Verify URL and get child URLs
            self._robot.parse(['User-agent: *', 'Disallow:'])
            newbase, rawurls = self._webopen((base, ''))
    

    If you want to skip subsequent robots.txt check at all, comment lines 686 and 687:

            # Get robot limits
            #robot.set_url(''.join([base, 'robots.txt']))
            #robot.read()
    

    Then bs4 may complain about lack of parser so change line 446 in crawler.py, def sanititse:

            soup_text = bs4.BeautifulSoup(text, 'html.parser')
    

    Still under investigation. Thanks Matt for the invitation - there is still some work and tests to do :)

  8. Matt Marx repo owner

    Hi Drinx,

    Thanks for sending this. Would you add your changes to the project? I'll incorporate this into the master.

    Cheers Matt

  9. Log in to comment