GroupReferrer not working correctly

Issue #17 closed
KS created an issue

I think I have found a bug that I also found in awffull and that may reach back to old webalizer times. The GroupReferrer directive doesn't work as advertised.

GroupReferrer google.com Google should group any of those ("real URL"), right: http://google.com/ https://google.com/ http://google.com/whatever https://google.com/whatever https://www.google.com/whatever

But it groups only this one: http://google.com/ Without any * it should match any part of the URL (according to your older sample.conf). I also tried with leading or trailing * (two * are forbidden) and got unsuccessful results, anyway. There is no way to even "access" any of the other ones. For instance GroupReferrer https://google.com Google doesn't work at all.

Is this a bug or is there some intention behind this that I don't understand? (After all, it has survived for years.)

Comments (8)

  1. StoneSteps repo owner

    One caveat in using a substring is that without an asterisk this syntax may pick up something like this:

    https://microsoft.com/ref=https://google.com
    

    Other than that it should work. I will check it out tonight and will post an update.

  2. StoneSteps repo owner

    I tried all combinations and all work fine, whether with a sub-word pattern or with a leading and trailing asterisk. Is it possible that some broader pattern in the group referrer list matches first? For example,

    GroupReferrer   goo                 GOO
    GroupReferrer   *google.com/whatever    Google Whatever
    GroupReferrer   http://google.com/* Google
    GroupReferrer   https://google.com/*    Google
    
  3. KS reporter

    Thanks for testing this out. I made an enbarrassing mistake. There was no HideReferrer, so the Referrer was grouped, but listed, anyway. Due to Google and Bing only using https:// nowadays it looked like the GroupReferrer grouped only on http://google.com, because that one was missing from the Referrers.

    I have now compared the two outputs from awffull and ssw and ran quite a few tests with ssw and have a few questions. I ask them here to not submit bug reports that aren't bug reports. If you think I should file the one or the other as an enhancement, just say so and I'll do.

    1. The URL reports don't remove parameters. Why? I would have some 560 URLs in one of the larger reports, but due to parameters I get nearly 100.000 URLs. This makes the table completely useless (I don't get the top pages and the "all" table is just too big to display it). I looked through the Readme, but cannot find an option to change this. Is there one? GroupURL won't help, I think. If there is no option yet I would add it as a separate enhancement request. The same happens with the Exit and Entry Pages tables.

    2. It seems I cannot group something like GroupAgent "SM-" Android Samsung (I guess it matches against the "" as well?) I had this in the config file because awffull couldn't group a bare SM-, maybe because of the - in the string (it seems to ignore the - everywhere, including where it indicates a missing entry in the log file, therefore it doesn't have any UserAgent - in the output). I tried with ssw without the quotation marks and that worked. So I have to maintain differences in the config file here. But there is still one problem if I cannot use "" to include phrases with spaces. for instance: GroupAgent "Mobile Safari/" Browser: Mobile Safari There's that option EnablePhraseValues which should help here. What I don't understand: can I then use tabs only in values and not to separate values from each other anymore? I mean I have a few GroupAgent AgentString AgentName but I also have some GroupAgent<tab>AgentString<tb>AgentName or any combination thereof. If I switch EnablePhraseValues on, do I have to remove all tabs except for those that are in "match strings"? That would be quite a nuisance.

    3. The Errors table lists only the error URL. That's good for aggregation, but bad for finding bugs. awffull lists also the Referrer. That's handy, because you know immediately where to look for your own site bugs and can correct the code.

    4. I'm not sure if I understand the Robot option. It's only to mark robots in URL/Hostname reports? It doesn't have anything to do with the Agents table unless I use GroupRobots/HideRobots? If I don't use these I can GroupAgent as normal?

    5. There is no way to change sort order of some tables, right? I would like to change order of some tables, for instance for Referrer or Agents. These are ordered by hits. I think that pages or visits would be more useful here. (awffull orders by pages, it doesn't know visits).

    6. Should't the "Total URLs" table list only pages? (and maybe an extra report listing all files? analog does it this way.)

    7. The Time column of the URLs tables is always "0.000". I'm not sure what it should display, but I guess not 0.000?

    8. There is a Duration column on the Hosts table. What does it show? Min. and max. duration between last and first request on a "visit" in seconds?

    Btw, to get "fast" results I disabled GeoIP and dns resolution for ssw completely. Nevertheless, it takes half an hour to process all the log files and domains and reports I provide for this test. The same run and configuration files take 70 or 80 seconds (!) with awffull. That is with GeoIP! So, even with GeoIP it's a lot faster. (It can use only the old .dat format, unfortunetely.)

    Thanks a lot for answers!

  4. StoneSteps repo owner

    Thanks for the update.

    1. The URL reports don't remove parameters. Why?

    Because it is very useful for many sites, especially for those who run a single-page site, so the only way for these folks to see what is going on is to report on the entire URL. If you don't want to report query strings, add this in the configuration file:

    ExcludeSearchArg    *
    
    1. It seems I cannot group something like GroupAgent "SM-" Android Samsung (I guess it matches against the "" as well?)

    The entire value is used. If you want the value to have spaces, add EnablePhraseValues yes at the top of your configuration file. I don't remember off the top of my head whether the dash is treated specially and will check it out, but if anything would work, that would be EnablePhraseValues.

    1. The Errors table lists only the error URL. That's good for aggregation, but bad for finding bugs. awffull lists also the Referrer.

    That would only make sense for 404 errors. For any other error type that would make error reports too granular. Even for 404 errors a web master only would care about what came from their own site. I will think how to approach those, but as a general approach I don't think referrers belong in the error report.

    1. I'm not sure if I understand the Robot option.

    Robots show different site access patterns and may skew visit counts for smaller sites if they hit the site with a lot of different IP addresses, as well as may create visits that never end. The country and the city report in v5 exclude spammer and robot activity and they are also highlighted in green in the user agent report for visibility.

    1. There is no way to change sort order of some tables, right?

    No, there is not. I dumped all obsolete HTML, but never got around to rework the overall report structure. At some point I introduced XSL reports that allowed people move things around, but that didn't take.

    1. Should't the "Total URLs" table list only pages?

    Assuming you mean Top URLs. The only way to do this is to hide non-pages or ignore them if you don't care about total transfer amounts. Otherwise it lists all URL types.

    1. The Time column of the URLs tables is always "0.000". I'm not sure what it should display, but I guess not 0.000?

    CLF log format doesn't have request processing time. You would need to use %D in Apache configuration for that. I suppose I should have made this column configurable, but most people seem to use it, so I never got around.

    1. There is a Duration column on the Hosts table. What does it show? Min. and max. duration between last and first request on a "visit" in seconds?

    That's average and maximum visit duration, in minutes.

    SSW also counts all requests, not only pages for visits, which produces different results from other Webalizer forks and the original one. Have a look here.

    http://www.stonesteps.ca/projects/webalizer/faq.asp?qid=q20051224-01&topic=webalizer

    Btw, to get "fast" results I disabled GeoIP and dns resolution for ssw completely. Nevertheless, it takes half an hour to process all the log files and domains and reports I provide for this test. The same run and configuration files take 70 or 80 seconds (!) with awffull. That is with GeoIP! So, even with GeoIP it's a lot faster. (It can use only the old .dat format, unfortunetely.)

    I think you have your answer in your question:

    I would have some 560 URLs in one of the larger reports, but due to parameters I get nearly 100.000 URLs.

    It takes longer to process 100K URLs than 560. Also keep in mind that longer group/hide/ignore lists also will slow things down. Once you disable query string processing, compare again and if the number of records/second is still higher than in other forks, I would be interested to hear more.

  5. Log in to comment