Problems with IgnoreURL
Hello,
some URL's included the phrase /?eID=shariff on my webserver.
When i use IgnoreURL /?eID=shariff in my configuration file. He doesn't ignore the urls.
What can I do.
Thank You
Comments (25)
-
repo owner -
repo owner - changed status to open
-
repo owner The syntax for
IgnoreURL
will be extended in the next release to include search argument names and values. For example, this configuration:IgnoreURL /* eID=shariff
, will ignore a log record containing this URL:
/?abc=123&eID=shariff&xyz=456
, but will process log records with different
eID
values, like this:/?abc=123&eID=home&xyz=456
See README in the main branch for more information.
-
repo owner - marked as enhancement
-
repo owner - changed status to resolved
Will be available in 4.3.0
-
Thank you for including my IgnoreUrl Request in the new Version.
I know it's a BETA, but the IgnoreURL doesn't work or i make mistakes in the configuration file.
The configuration file of the 4.2. Version contains this:
IgnoreURL *.xml
this worked perfekt
With the ne Version the IgnoreURL configuration line above doesn't work.
I try different phrases
IgnoreURL / rss.xml*
IgnoreURL .xml*
IgnoreURL rss.xml
IgnoreURL "completePath"/rss.xml
All my configartions with IgnoreURL from the configuration file with the old version (4.2...) doesn't work.
Is this a bug or my mistake with the new parameters?
Thank you Marcus
-
repo owner Yes, it's a bug. Let me look into this tonight and I will post a fix in the next couple of days. Thanks for the heads up.
-
repo owner - changed status to open
-
repo owner I posted a new version. See if this works in your setup.
-
Thank you for the new Version. It works partially.
IgnoreURL /rss.xml is working*
IgnoreURL /404/ doesn't works
IgnoreURL .png doesn't works*
IgnoreURL .css doesn't works*
IgnoreURL /processed/ * doesn't works
IgnoreURL * /processed/ doesn't works
IgnoreURL /processed/ doesn't works
IgnoreURL / eID=shariff doesn't works*
The Spaces are tabs. With processed it try a few Phrases, but with no desired result.
-
repo owner Can you clarify how exactly it doesn't work in each case:
IgnoreURL /404/ doesn't works
This will look for the
/404/
text anywhere in the URL and should match any of these:/abc/404/xyz/ /404/ /abc/404/
When you say that it doesn't work, do you mean that any of these URLs are still in the report?
IgnoreURL *.png doesn't works
IgnoreURL *.css doesn't works
I tested these again and I don't see any URLs ending in
.png
or.css
in my reports. Check if the configuration file where you listed these lines is picked up. You should see its name on the command line afterProcessed configuration file
.IgnoreURL /processed/ * doesn't works
IgnoreURL * /processed/ doesn't works>
The first one will not work because it's looking for a URL with a search argument with an asterisk, like this:
/processed/?*
, which is probably not something you want to look for.The second looks for a any URL with a search argument
/processed/
, as in/abc/?/processed/
or/?/processed/
. Again, not sure if that is what you are looking for.IgnoreURL /* eID=shariff doesn't works
This should match any URL with this search argument, as in
/?eID=shariff
or/abc/?eID=shariff
.Can you also describe your OS, whether it's 32- or 64-bit, and the type of logs you are processing?
-
This will look for the /404/ text anywhere in the URL and should match any of these: But in my configuration he shows the /404/ in the results.
My OS is Win 7 and i use the 32bit Version. I also tested 64bit, with no improvement. The Logs are Apache CLF's.
Here Part of my conf and the URL-Result.
He shows 404 - typo3conf - and typo3temp, i also tested with *.png without function
-
repo owner Can you put the log file and the configuration file somewhere I can pick them up? Email the location to support@stonesteps.ca. Don't post it here.
Thanks!
-
You have Mail.
Thank You
-
repo owner I posted a new version that fixes the issue. Thank you very much for raising it and helping me to track down the bug. I recommend to delete the screenshots above now that I have your sample log. Give it a try, let me know if I missed anything.
One thing to mention is that if you try
IgnoreURL
with search arguments, add those patterns after regularIgnoreURL
configuration values. This is because they may match on the URL path and not on the query string and will stop the search. I will either describe this in README or will change it to work differently. Haven't decided yet.I would also recommend removing
HTMLPre
,HTMLHead
,HTMLBody
,HTMLPost
andHTMLEnd
from configuration. They only interfere with the standard HTML constructs. I will probably remove them in future releases.You can add
webalizer.css
anwebalizer.js
into your report directory to make reports look better. You can also reuse the same ones in multiple reports by usingHTMLCssPath
andHTMLJsPath
. -
Thank you, it works, but i don't understand your note for IgnoreURL with search arguments
I try this without function with tab und space.
IgnoreURL /test/* ?eID=shariff&url
IgnoreURL /test/ eID=shariff
- test is an example-url -
Or is this my mistake in syntax? sorry for my renewed question.
-
repo owner What happens is that once a URL path matches, it will only check search arguments for the same pattern, but no further. It's easier to explain with an example. Consider these filters:
IgnoreURL /abc/* IgnoreURL /xyz/* eID=1 IgnoreURL /xyz/* eID=2 IgnoreURL /xyz/* eID=3 IgnoreURL /def/*
Notice that there should be no question mark in the search argument and only one search argument should be in one
IgnoreURL
configuration variable.Let's say that the current URL is
/xyz/?x=1&eID=2&y=2
. It will first check URL path, which is/xyz/
against/abc/*
, which will not match, then it will check/xyz/
against/xyz/*
, which will match and it will go into the special search mode that will stop searching after all/xyz/*
patterns are checked. In this case, the line witheID=2
. Because the URL path matched, the next entry,def/*
, will not be even tried. This is done so a long list of ignore filters wouldn't slow-down processing too much.Now, consider a slightly different list of filters:
IgnoreURL /abc/* IgnoreURL /* eID=1 IgnoreURL /* eID=2 IgnoreURL /* eID=3 IgnoreURL /def/*
Now the pattern
/*
will match any URL, so if a URL in the log line is/def/p.html
, the pattern/*
will match this URL, but the URL doesn't have any search arguments, so all three lines witheID=
will not match. However, because/*
matched,/def/*
will not even be tested.One way to work this around is to have those broader filters at the very end, so all other patterns are matched first. However, if the URL pattern is distinct enough (i.e. a longer path or a specific and distinct page name), then this is a moot point because it usually matches only specific pages.
I hope this clarifies it. Let me know if you have any additional questions.
-
Ok, i think i have it :). Thank you.
Now i have tested again. I'am a little bit confused:
IgnoreURL /* eID=sharrif = works (all URL's with the Parameter eID=shariff are ignored)
when i use the follow for the parameter above
IgnoreURL /* eID=s = doesn't works (all URL's with the Parameter eID=shariff are not ignored, what would have to be done?)
when i use
IgnoreURL /* eID=sharrif = works (all URL's with the Parameter eID=shariff are ignored) IgnoreURL /* tx_news_ = doesn't works (all URL's with the Parameter tx_news are not ignored)
when i use IgnoreURL /* tx_news_ alone, it also doesn't work
The IgnoreURL with broader filters are in the end of my IgnoreURL List....
-
repo owner Search argument names and values must match exactly, not partially. In other words,
s
won't matchshariff
.If the name of your search argument contains URL-encoded characters, use actual characters in
IgnoreURL
, so fortx_news_pi1%5BeventDate%5D
it would look like this:IgnoreURL /url-path* tx_news_pi1[eventDate]
I would also advise to use actual URL paths for each filter and not to shortcut it with
/*
- this way you wouldn't filter out some URLs that may have nothing to do with that search argument (e.g. submitted by mistake to that page and is not processed by the page). -
Sorry for my another question. I hope, i don't annoy you.
in my report i have this for example
/nc/your-site/events/?tx_news_pi1%5BeventDate%5D=1205&type=9829
in my configuration i tested these lines
IgnoreURL /nc/your-site/events/ tx_news_pi1[eventDate]=1205&type=9829 IgnoreURL /nc/your-site/events/ tx_news_pi1[eventDate]=1205 IgnoreURL /nc/your-site/events/ tx_news_pi1[eventDate]
And he doesn't ignore this line respectively all lines for the last example.
-
repo owner My apologies. I forgot that
[
and]
are considered as special URL characters and confused you here. Such characters, which are listed in the second bullet point in URL Normalization, will remain URL encoded, so you would usetx_news_pi1%5BeventDate%5D
, like this:IgnoreURL /url-path* tx_news_pi1%5BeventDate%5D
Other characters will be decoded. For example, a character
Ä
will appear in URLs as%C3%84
, but you would use the actual character inIgnoreURL
, like this:IgnoreURL /url-path* n=xÄy
, which would match a URL like this
/url-path?n=x%C3%84y
.IgnoreURL /nc/your-site/events/ tx_news_pi1[eventDate]=1205&type=9829
Only one search argument should appear on each
IgnoreURL
line, so it would look like this:IgnoreURL /nc/your-site/events/ tx_news_pi1%5BeventDate%5D=1205 IgnoreURL /nc/your-site/events/ type=9829
I appreciate your feedback very much, Marcus. Thank you.
-
Thank You for your help, your support and for your patience. :)
I think you can close this issue. It works very well.
-
repo owner Great. Thanks for confirming and your help in troubleshooting!
-
repo owner - changed status to resolved
-
repo owner - changed status to closed
- Log in to comment
IgnoreURL
is only applied to the URL stem and not against URL queries, The only thing you can do now is to filter out query strings you don't want to see, but this will not exclude URLs from reports.For example, if you use
ExcludeSearchArg x
and these URLs are requested:, then these URLs will be reported:
, and these query strings will be reported.
This URL handling is somewhat traditional, as the original Webalizer didn't report URL query strings, and doesn't work for sites that use a single script to handle the entire site, like
/?page=/abc/
, which seems to be the case here.I will give it some thought going forward, but for now you can only filter out query strings (called search arguments in README).