Problems with IgnoreURL

Issue #2 closed
Former user created an issue

Hello,

some URL's included the phrase /?eID=shariff on my webserver.

When i use IgnoreURL /?eID=shariff in my configuration file. He doesn't ignore the urls.

What can I do.

Thank You

Comments (25)

  1. StoneSteps repo owner

    IgnoreURL is only applied to the URL stem and not against URL queries, The only thing you can do now is to filter out query strings you don't want to see, but this will not exclude URLs from reports.

    For example, if you use ExcludeSearchArg x and these URLs are requested:

    /a/?x=1&ref=/
    /b/?x=2&ref=/c/
    

    , then these URLs will be reported:

    /a/
    /b/
    

    , and these query strings will be reported.

    ref=/
    ref=/c/
    

    This URL handling is somewhat traditional, as the original Webalizer didn't report URL query strings, and doesn't work for sites that use a single script to handle the entire site, like /?page=/abc/, which seems to be the case here.

    I will give it some thought going forward, but for now you can only filter out query strings (called search arguments in README).

  2. StoneSteps repo owner

    The syntax for IgnoreURL will be extended in the next release to include search argument names and values. For example, this configuration:

    IgnoreURL    /*   eID=shariff
    

    , will ignore a log record containing this URL:

    /?abc=123&eID=shariff&xyz=456
    

    , but will process log records with different eID values, like this:

    /?abc=123&eID=home&xyz=456
    

    See README in the main branch for more information.

  3. Marcus

    Thank you for including my IgnoreUrl Request in the new Version.

    I know it's a BETA, but the IgnoreURL doesn't work or i make mistakes in the configuration file.

    The configuration file of the 4.2. Version contains this:

    IgnoreURL *.xml

    this worked perfekt

    With the ne Version the IgnoreURL configuration line above doesn't work.

    I try different phrases

    IgnoreURL / rss.xml*

    IgnoreURL .xml*

    IgnoreURL rss.xml

    IgnoreURL "completePath"/rss.xml

    All my configartions with IgnoreURL from the configuration file with the old version (4.2...) doesn't work.

    Is this a bug or my mistake with the new parameters?

    Thank you Marcus

  4. StoneSteps repo owner

    Yes, it's a bug. Let me look into this tonight and I will post a fix in the next couple of days. Thanks for the heads up.

  5. Marcus

    Thank you for the new Version. It works partially.

    IgnoreURL /rss.xml is working*

    IgnoreURL /404/ doesn't works

    IgnoreURL .png doesn't works*

    IgnoreURL .css doesn't works*

    IgnoreURL /processed/ * doesn't works

    IgnoreURL * /processed/ doesn't works

    IgnoreURL /processed/ doesn't works

    IgnoreURL / eID=shariff doesn't works*

    The Spaces are tabs. With processed it try a few Phrases, but with no desired result.

  6. StoneSteps repo owner

    Can you clarify how exactly it doesn't work in each case:

    IgnoreURL /404/ doesn't works

    This will look for the /404/ text anywhere in the URL and should match any of these:

    /abc/404/xyz/
    /404/
    /abc/404/
    

    When you say that it doesn't work, do you mean that any of these URLs are still in the report?

    IgnoreURL *.png doesn't works

    IgnoreURL *.css doesn't works

    I tested these again and I don't see any URLs ending in .png or .css in my reports. Check if the configuration file where you listed these lines is picked up. You should see its name on the command line after Processed configuration file.

    IgnoreURL /processed/ * doesn't works

    IgnoreURL * /processed/ doesn't works>

    The first one will not work because it's looking for a URL with a search argument with an asterisk, like this: /processed/?*, which is probably not something you want to look for.

    The second looks for a any URL with a search argument /processed/, as in /abc/?/processed/ or /?/processed/. Again, not sure if that is what you are looking for.

    IgnoreURL /* eID=shariff doesn't works

    This should match any URL with this search argument, as in /?eID=shariff or /abc/?eID=shariff.

    Can you also describe your OS, whether it's 32- or 64-bit, and the type of logs you are processing?

  7. Marcus

    This will look for the /404/ text anywhere in the URL and should match any of these: But in my configuration he shows the /404/ in the results.

    My OS is Win 7 and i use the 32bit Version. I also tested 64bit, with no improvement. The Logs are Apache CLF's.

    Here Part of my conf and the URL-Result.

    He shows 404 - typo3conf - and typo3temp, i also tested with *.png without function

  8. StoneSteps repo owner

    Can you put the log file and the configuration file somewhere I can pick them up? Email the location to support@stonesteps.ca. Don't post it here.

    Thanks!

  9. StoneSteps repo owner

    I posted a new version that fixes the issue. Thank you very much for raising it and helping me to track down the bug. I recommend to delete the screenshots above now that I have your sample log. Give it a try, let me know if I missed anything.

    One thing to mention is that if you try IgnoreURL with search arguments, add those patterns after regular IgnoreURL configuration values. This is because they may match on the URL path and not on the query string and will stop the search. I will either describe this in README or will change it to work differently. Haven't decided yet.

    I would also recommend removing HTMLPre, HTMLHead, HTMLBody, HTMLPostand HTMLEnd from configuration. They only interfere with the standard HTML constructs. I will probably remove them in future releases.

    You can add webalizer.css an webalizer.js into your report directory to make reports look better. You can also reuse the same ones in multiple reports by using HTMLCssPath and HTMLJsPath.

  10. Marcus

    Thank you, it works, but i don't understand your note for IgnoreURL with search arguments

    I try this without function with tab und space.

    IgnoreURL /test/* ?eID=shariff&url

    IgnoreURL /test/ eID=shariff

    • test is an example-url -

    Or is this my mistake in syntax? sorry for my renewed question.

  11. StoneSteps repo owner

    What happens is that once a URL path matches, it will only check search arguments for the same pattern, but no further. It's easier to explain with an example. Consider these filters:

    IgnoreURL   /abc/*
    IgnoreURL   /xyz/*   eID=1
    IgnoreURL   /xyz/*   eID=2
    IgnoreURL   /xyz/*   eID=3
    IgnoreURL   /def/*
    

    Notice that there should be no question mark in the search argument and only one search argument should be in one IgnoreURL configuration variable.

    Let's say that the current URL is /xyz/?x=1&eID=2&y=2. It will first check URL path, which is /xyz/ against /abc/*, which will not match, then it will check /xyz/ against /xyz/*, which will match and it will go into the special search mode that will stop searching after all /xyz/* patterns are checked. In this case, the line with eID=2. Because the URL path matched, the next entry, def/*, will not be even tried. This is done so a long list of ignore filters wouldn't slow-down processing too much.

    Now, consider a slightly different list of filters:

    IgnoreURL   /abc/*
    IgnoreURL   /*   eID=1
    IgnoreURL   /*   eID=2
    IgnoreURL   /*   eID=3
    IgnoreURL   /def/*
    

    Now the pattern /* will match any URL, so if a URL in the log line is /def/p.html, the pattern /* will match this URL, but the URL doesn't have any search arguments, so all three lines with eID= will not match. However, because /* matched, /def/* will not even be tested.

    One way to work this around is to have those broader filters at the very end, so all other patterns are matched first. However, if the URL pattern is distinct enough (i.e. a longer path or a specific and distinct page name), then this is a moot point because it usually matches only specific pages.

    I hope this clarifies it. Let me know if you have any additional questions.

  12. Marcus

    Ok, i think i have it :). Thank you.

    Now i have tested again. I'am a little bit confused:

    IgnoreURL    /*    eID=sharrif          = works (all URL's with the Parameter eID=shariff are ignored)
    

    when i use the follow for the parameter above

    IgnoreURL    /*    eID=s                  = doesn't works  (all URL's with the Parameter eID=shariff are not ignored, what would have to be done?)
    

    when i use

    IgnoreURL    /*    eID=sharrif      = works (all URL's with the Parameter eID=shariff are ignored)
    IgnoreURL   /*     tx_news_          = doesn't works  (all URL's with the Parameter tx_news are not ignored)
    

    when i use IgnoreURL /* tx_news_ alone, it also doesn't work

    The IgnoreURL with broader filters are in the end of my IgnoreURL List....

  13. StoneSteps repo owner

    Search argument names and values must match exactly, not partially. In other words, s won't match shariff.

    If the name of your search argument contains URL-encoded characters, use actual characters in IgnoreURL, so for tx_news_pi1%5BeventDate%5D it would look like this:

    IgnoreURL   /url-path*   tx_news_pi1[eventDate]
    

    I would also advise to use actual URL paths for each filter and not to shortcut it with /* - this way you wouldn't filter out some URLs that may have nothing to do with that search argument (e.g. submitted by mistake to that page and is not processed by the page).

  14. Marcus

    Sorry for my another question. I hope, i don't annoy you.

    in my report i have this for example

    /nc/your-site/events/?tx_news_pi1%5BeventDate%5D=1205&type=9829 
    

    in my configuration i tested these lines

    IgnoreURL   /nc/your-site/events/   tx_news_pi1[eventDate]=1205&type=9829
    
    IgnoreURL   /nc/your-site/events/   tx_news_pi1[eventDate]=1205
    
    IgnoreURL   /nc/your-site/events/   tx_news_pi1[eventDate]
    

    And he doesn't ignore this line respectively all lines for the last example.

  15. StoneSteps repo owner

    My apologies. I forgot that [ and ] are considered as special URL characters and confused you here. Such characters, which are listed in the second bullet point in URL Normalization, will remain URL encoded, so you would use tx_news_pi1%5BeventDate%5D, like this:

    IgnoreURL   /url-path*   tx_news_pi1%5BeventDate%5D
    

    Other characters will be decoded. For example, a character Ä will appear in URLs as %C3%84, but you would use the actual character in IgnoreURL, like this:

    IgnoreURL   /url-path*   n=xÄy
    

    , which would match a URL like this /url-path?n=x%C3%84y.

    IgnoreURL /nc/your-site/events/ tx_news_pi1[eventDate]=1205&type=9829

    Only one search argument should appear on each IgnoreURL line, so it would look like this:

    IgnoreURL   /nc/your-site/events/   tx_news_pi1%5BeventDate%5D=1205
    IgnoreURL   /nc/your-site/events/   type=9829
    

    I appreciate your feedback very much, Marcus. Thank you.

  16. Marcus

    Thank You for your help, your support and for your patience. :)

    I think you can close this issue. It works very well.

  17. Log in to comment