CLF Probes blocks Googlebot

Create issue
Issue #130 resolved
Christos Chatzaras created an issue

“CLF Probes” blocks Googlebot:

example.com-access_log:66.249.70.10 - - [16/Aug/2020:15:57:36 +0300] "GET /index.php?option=com_content&view=section&layout=blog&id=12&Itemid=81 HTTP/1.1" 404 794 "-" "Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.92 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" (0.047)

Here is how we can verify an IP is from Googlebot:

https://support.google.com/webmasters/answer/80553

I can modify /usr/local/libexec/sshg-fw-ipfw to not block Googlebot, but then in my system logs file I will still see these entries:

Blocking "66.249.70.10/32" for 240 secs (3 attacks in 1712 secs, after 2 abuses over 2785 secs.)

Also this file will be replaced in the next sshguard upgrade.

I think the best solution is to add a feature to sshguard, add an option to sshguard.conf ( dynamicwhitelist = /usr/local/etc/sshguard.whitelist.sh ) and then before a block it executes:

/usr/local/etc/sshguard.whitelist.sh IP

The shell script we will get the IP, does it stuff and if it returns 0 then sshguard blocks the IP and if it returns 1 to not block it.

Comments (14)

  1. Kevin Zheng
    • changed status to open

    What about inserting your own filter on the second to last line of the sshguard driver script?

    eval $tailcmd | $libexec/sshg-parser | <YOUR FILTER HERE> | \                                        
        $libexec/sshg-blocker $flags | $BACKEND &
    
  2. Christos Chatzaras reporter

    I found the file in /usr/local/sbin/sshguard, but if I make changes there I will have to make the changes again after each sshguard upgrade. Do you think it would be useful for others too to add upstream an option in sshguard.conf , and if that option is enabled then /usr/local/sbin/sshguard uses the filter?

  3. Kevin Zheng

    Yes, that’s the right file to edit. Yes, if nothing is done, you’ll have to make those changes every time you upgrade.

    I understand the desire to keep local changes outside of /usr/local/sbin/sshguard and in the configuration file, but the sshguard script was also written in shell to make it easy to adapt or change if the needs required it; for example, you can even replace sshg-parser with your own custom parser.

    One possibility would be to essentially make sshguard.conf executable, and essentially put everything that’s currently in sshguard in sshguard.conf. That would break existing configurations but make pretty much every part of SSHGuard “configurable”.

    Another possibility could be to add a configuration option, something along the lines of POST_PARSER_FILTER that optionally adds that to the second to last line, so that it would look like:

    eval $tailcmd | $libexec/sshg-parser | ${POST_PARSER_FILTER} | \                                        
        $libexec/sshg-blocker $flags | $BACKEND &
    

    What do you think?

  4. Christos Chatzaras reporter

    Finally I wrote a shell script that runs every minute using cron. It reads the previous minute entries from /var/log/auth.log, it checks for IPs blocked from sshguard, it checks (with the recommended Google method) if these IPs are from Googlebot or Bing and removes them from firewall and also from /var/db/sshguard/blacklist.db , then it sends an e-mail to admin.

    #!/usr/local/bin/bash
    
    HOSTNAME=`/bin/hostname -s`
    
    previousMinute=`date -v -1M '+%b %d %H:%M:'`
    
    for ip in `grep "${previousMinute}" /var/log/auth.log | grep "sshguard" | grep ": Blocking " | awk '{ print $7 }' | sed 's|"||g' | awk -F '/' '{print $1}' | sort | uniq`
    do
    
      if `ipfw table 22 list | awk '{print $1}' | awk -F '/' '{print $1}' | sort | uniq | grep -q "${ip}"`; then
    
        HOSTRESULT="$(host -W 1 ${ip})"
        HOSTRESULT="$(echo $HOSTRESULT | awk '{print $5}' | sed 's/\.$//')"
    
        REGEX='.*(googlebot\.com|google\.com|search\.msn\.com)$'
    
        if [[ "$HOSTRESULT" =~ $REGEX ]]; then
    
          IPRESULT="$(host -W 1 ${HOSTRESULT})"
          IPRESULT="$(echo $IPRESULT | awk '{print $4}')"
    
          if [[ $IPRESULT = "$ip" ]]; then
            ipfw -q table 22 delete ${ip}
            sed -i "" -e "/|4|${ip}$/d" /var/db/sshguard/blacklist.db
            printf "${IPRESULT} (${HOSTRESULT})" | /usr/bin/mail -s "[SSHGUARD - ${HOSTNAME}] ${IPRESULT}" root
          fi
    
        fi
    
      fi
    
    done
    

  5. Kevin Zheng

    That’s certainly a way to do it. I would imagine it would have been easier just to filter out the Googlebot IP’s in the filter pipeline?

  6. Christos Chatzaras reporter

    Yes but I want to avoid making custom changes to sshguard files to avoid forgetting redoing them after an upgrade. I am ok for temporary blocking a search engine IP for up to 1 minute and also it doesn’t happen frequently.

  7. Christos Chatzaras reporter

    I am not sure I understand this solution.

    I will add POST_PARSER_FILTER in sshguard.conf with the path of a shell script that does the check? And it will automatically run the following command:

    eval $tailcmd | $libexec/sshg-parser | ${POST_PARSER_FILTER} | \
    \$libexec/sshg-blocker $flags | $BACKEND &

    instead of the following command?

    eval $tailcmd | $libexec/sshg-parser | ${POST_PARSER_FILTER} | \
    \$libexec/sshg-blocker $flags | $BACKEND &

  8. Kevin Zheng

    Instead of the current pipeline, without ${POST_PARSER_FILTER}. This way, the filter is optional and configurable from sshguard.conf.

  9. Christos Chatzaras reporter

    And the post parser script to exclude the search engines:

    #!/usr/local/bin/bash
    
    while read SSHGUARD; do
    
    HOSTNAME=`/bin/hostname -s`
    
    IP=`echo "${SSHGUARD}" | awk '{print $2}'`
    
    HOSTRESULT="$(host -W 1 ${IP})"
    HOSTRESULT="$(echo $HOSTRESULT | awk '{print $5}' | sed 's/\.$//')"
    
    REGEX='.*(googlebot\.com|google\.com|search\.msn\.com|yandex\.ru|yandex\.net|yandex\.com|crawl\.baidu\.com|crawl\.yahoo\.net)$'
    
    if [[ "$HOSTRESULT" =~ $REGEX ]]; then
    
      IPRESULT="$(host -W 1 ${HOSTRESULT})"
      IPRESULT="$(echo $IPRESULT | awk '{print $4}')"
    
      if [[ "$IPRESULT" != "$IP" ]]; then
        echo "${SSHGUARD}"
      else
        printf "We did not block: ${IPRESULT} (${HOSTRESULT})" | /usr/bin/mail -s "[SSHGUARD - ${HOSTNAME}] ${IPRESULT}" root
      fi
    
    else
    
      echo "${SSHGUARD}"
    
    fi
    
    done
    

  10. Log in to comment