Problem with multiple DNSChildren

Issue #16 closed
KS created an issue

After I sorted all config problems I made a first run. With "DNSLookups no" and "DNSChildren 2". So, no DNS lookup, but GeoIP lookup. Worked for some smaller log files, but failed for bigger ones. I have only very small or really big ones, so I cannot say which size might be a tipping point. The smallest of the "problem files" is a bit over 200 MB, the biggest 1,5 GB and over 5 million records. This large one has about 100.000 hosts according to the stats output.

The run hangs at the "Using logfile" stats indefinitely until I kill it. I tried it several times. Same result. (There were also problems because of corrupted dns.db afterwards, obviously because of killing the processes. Expected. I removed the db for the next runs.)

Then I tried with "DNSChildren 0". That worked. So I figured there might be some clash (thread safety?) between the 2 DNS workers. And, indeed, after I set to "DNSChildren 1" it ran thru. The run with the large file takes between 70 and 80 seconds. That's ok, I don't need a second worker for that. Nevertheless I wanted to report. I haven't tried with DNS resolution yet.

I have a question about this also: I noticed that with the "DNSChildren 1" run (not with 0) the memory went down to "free 0", according to the output of the free command. That's because it filled buffers/cache up as much as it could. It also used 3/4 of the swap file. I still have more than 50% of RAM "available", so it's not a "real" problem. However, my monitoring is set to monitor the swap file because I have one or two programs that seldomly may have a problem indicated by them filling up the swap file really quickly. Of course, the monitoring cannot distiguish between these and ssw filling up the swap. I see that there are some options in the "Database Configuration Options" that might help here. But am not sure which to try. Like MemoryMode, DbTrickleRate, DbDSync, DbCacheSize. Which could I try to remedy this situation? First objective would be to avoid swapping at all, second objective would be to not fill buffers/cache that much. Thanks!

Comments (47)

  1. StoneSteps repo owner

    Most likely you are experiencing issue #10. Since you are building from the source, you can download a patch for the commit referenced in the bug and see if it fixes it in your environment.

  2. KS reporter

    I did not rebuild yet. I thought I should be "immune" if I set DNSChildren 1. As I understood the patch allows for only one write thread. And DNSChildren 1 should limit to 1. Or do I misinterpret this? I just hit the issue with a different set of logfiles right with the first one. Could this issue also occur with only 1 DNSChild?

  3. StoneSteps repo owner

    If you didn't configure any spam filters, DNS database isn't useful. Just comment out DNSCache and keep DNSChildren at 2 and the DNS Berkeley DB will not be used.

    The main Berkeley DB used for aggregated log file data is only accessed by a single thread and it shouldn't exhibit this problem, even though the bug is still there.

  4. KS reporter

    Thanks for the quick answer! I thought I need the dnsdb for caching of dns lookups, anyway. Now I get "Error: Cannot initialize DNS resolver". Maybe that error is unrelated?

    DNSLookups yes
    #DNSCache   /etc/analog/ssw/work/dnscache.db
    DNSChildren 2
    GeoIPDBPath /usr/share/GeoIP/GeoLite2-Country.mmdb
    GeoIPCity no
    

    I also tried to patch with the diff of #10, but it fails, there is at least one more patch I would have to apply first. Is there a way to download the complete source at a given patchlevel via the web interface?

  5. StoneSteps repo owner

    Comment out DNSLookups as well. DNSCache and DNSLookups work together in that if the former is specified, but the latter is set to no, the DNS database will be stored to track additional information for IP addresses, such as spammer status. When you comment out both, only GeoIP look-ups will be done against the native GeoIP database and the DNS cache database will not be used.

    Just to be clear, I would advise against the actual DNS resolution. It's extremely slow, it will saturate your network with DNS traffic and it's not very informative because you mostly get ISP addresses and country names derived from DNS names are just bad and unreliable.

  6. KS reporter

    Ok, I thought I had understand you that I should use dns resolution without the database. So, that doesn't work? The point of dns resolution in our case would be to aggregate the hostnames to GroupDomains 1 (second level domain, right), so one could identify the top internet provider sources. dns resolution time wasn't that bad when I still used analog. Is there a way to aggregate the IP numbers as well? I mean, without knowing the "base"? Can I set to aggregate only on the first two triplets for instance? Or, if I set GeoIPCity to yes? How would this be used? Would I get a table/chart of cities as well? But many are locations only, so if I enable resolution of location this would slow it down again? For providers, using the ASNUM db from Maxmind would probably be the best way? If it was supported.

  7. StoneSteps repo owner

    If you want actual DNS resolution, use DNSCache and comment out DNSLookups.

    One thread should work, but will be slower.

    GeoIPCity in SSW v4 just adds a city column to the Hosts report. In v5 there is an additional report that aggregates on cities. v5 should be available some time in February.

    If you had DNSCache enabled without GeoIP, those addresses will not have country information.

    I need to look into additional ways to aggregate IP addresses. Country and city is all that is available now.

  8. KS reporter

    I don't quite get it. I figured "comment out" would mean "DNSLookups no". But your README says it means yes. So, uncommenting should do DNS lookups. But without a cache I get the "cannot initialize" error. So, for me it's either:

    DNSLookups no
    DNSCache /somepath
    

    -> no lookups, but geoip

    DNSLookups yes
    #DNSCache /somepath
    

    -> "cannot initialize" error

    DNSLookups yes
    DNSCache /somepath
    DNSChildren 1
    

    -> hangs sooner or later, with this new set of smaller logfiles it hangs immediately

    So, in my understanding: DNSLookups no -> no dnslookups DNSLookups yes with DNSCache disabled -> results in an error GeoIP works fine if a database is enabled. Maybe you misunderstood me? I wanted to try out the dns lookup. Just to see how fast it is and if it would help in aggregating the hostnames. Or if it really doesn't provide much information and I can avoid it completely. If I get it right, at the moment this isn't possible unless I patch. But I would have to apply several patches, because just this one diff will fail. Or I try to apply the diffed lines manually. I'll wait until 5.0.0 source is available. Or is there another way to retrieve the current complete source other than Mercurial (I don't want to install Mercurial)?

    Btw, I tried "GeoIPCity yes" and pointed to the GeoLite2-City.mmdb. This resulted in skipping all records. I have seen this before for logs, but only for a few, probably corrupted records. But now I got it for all log files. As the log files work fine with "GeoIPCity no" I was wondering if the "skipped record" warning could also be issued when processing the GeoIP database and thus it thinks the records in there are corrupted (they aren't)?

  9. StoneSteps repo owner

    Both DNS and GeoIP look-ups use the DNS resolver component, so to do either GeoIP or DNS you need DNSChildren, which is the number of threads that do the work.

    When DNS resolution is performed, look-up results are cached in the DNS cache database, so you need DNSCache for this database.

    When GeoIP look-ups are performed, you need GeoIPDBPath, which points to MaxMind GeoIP v2 database. This is a self-contained database and there is no need to cache look-up results, so DNSCache may be omitted.

    When one would like to carry spammer information from a month to a month, the only place to store these results is the DNS cache database. However, configuring the DNS cache database and DNS threads enables DNS look-ups, so if you want to use the DNS cache database for storing additional IP address information, such as spammers, but not to do expensive DNS look-ups, you would configure DNSLookups no That's the only time you would need to use it.

    So, if you want just GeoIP:

    DNSChildren   2
    GeoIPDatabase   /path/to/mmdb
    

    If you want DNS and GeoIP:

    DNSChildren   10
    DNSCache   /path/to/database
    GeoIPDatabase   /path/to/mmdb
    

    If you want GeoIP, spammer caching, but no DNS:

    DNSChildren   10
    DNSCache   /path/to/database
    GeoIPDatabase   /path/to/mmdb
    DNSLookups   no
    

    For the lock-up, I will push a source change for this bug into the v4.3 source tree. I don't intend to do a full release, but you will be able to download the source and build again, so we can see if the lock-ups you are experiencing are caused by this issue. Will post here when it is available.

    Btw, I tried "GeoIPCity yes" and pointed to the GeoLite2-City.mmdb. This resulted in skipping all records.

    GeoIPCity has no relation to log processing. I'm not sure what you mean by this. It merely does an additional look-up in the GeoIP city database and adds a column in the reports when it is enabled. If you can create or describe a small test case, I will look into this.

  10. StoneSteps repo owner

    Turns out I already pushed changes for the lock-up in the Berkeley DB for that other bug, but the person who reported the bug never reported back whether it fixed the issue or not for them. I also added the $(CC)-to-$(CXX), etc changes there to make it easier to rebuild. Go to the tags tab on the Downloads page and download the source at v4-3-4-1:

    https://bitbucket.org/StoneStepsInc/stonestepswebalizer/downloads/?tab=tags

    Let me know if it locks up while resolving DNS in this build. A word of caution if it works, if you specify too many threads, it will make it go faster, but your local network may be saturated with DNS traffic so much that other services may starve. Better to keep it under 10 and wait for it to finish.

  11. KS reporter

    Thanks! It compiles. But I have problems with it.

    DNSLookups yes
    DNSCache    /etc/analog/ssw/work/dnscache.db
    DNSChildren 2
    #GeoIPDBPath /usr/share/GeoIP/GeoLite2-Country.mmdb
    GeoIPDBPath /usr/share/GeoIP/GeoLite2-City.mmdb
    GeoIPCity yes
    

    I get GeoIP data. But no cities. I get a city column. It's very well possible that there are no cities for the 20 hosts I reduced TopHosts and TopKHosts to. So, lets forget that. But I'm not getting dns resolution. All hosts remain IPs. I checked three of them just in case those 20 might not have a hostname. But two of them do. Do I have to remove the dns database?

  12. StoneSteps repo owner

    The city database contains both, countries and cities. You need to point only to the city database.

  13. StoneSteps repo owner

    Also, if you ran this configuration without the city database before, IP addresses were saved without cities and they will be found in the DNS cache database, so you will not see cities in the report. You would have to point to a different DNS cache database to get a fresh start and populate cities. Unfortunately, this means DNS resolution would have to be repeated.

  14. KS reporter

    It's working now on one server which didn't have dnscache.db. I'll remove that on the other one as well. Yes, it's much slower. I guess, I won't use it ;-) The Country db was commented out, I just realize that I didn't use the quote marker for that snippet, so that bit got distorted by the "#". I change to the Country db if do without City support. I figure that might be faster as it uses a smaller db?

  15. KS reporter

    On the server where it seems to run it got stuck with one of the last log files. I eventually killed it once I had the impresssion that host couldn't have that many IPs. Before that I got this:

    Generating report for November 2018 (HTML)
    Saving current run data... [12/31/2018 22:14:06]
    Saving history information...
    Generating report for December 2018 (HTML)
    Bad character buffer
    

    There should have been one for January but I guess ssw got killed at this point. So, the script continued with the next host/files where it seemed to start normal but then got stuck at the "Using logfile" stage. I don't know if there is a connection. So, the first successful run didn't continue. I'll try on the other host now after removing all dbs.

    Update: the second run shows exactly the same buffer problem with that file, but it goes on with the next host/file and did not stop at the "Using ..." stage. Anyway, it's so slow (I upped it now to 10 children) that I will hardly use it. On the second server it's still walking thru the domains with "b...". Most of them with small log files. I can't imagine how long it would take to process the big ones I used during the last days for testing (fortunately, I had excluded them from today's tests :-) I agree, it's much too slow for regular use.

  16. StoneSteps repo owner

    On the server where it seems to run it got stuck with one of the last log files.

    Keep in mind that DNS resolution takes a really long time and it may appear to be locked up, but it may actually be running DNS resolution. The best way to verify this is to set DNSLookups to no and rerun on a fresh DNS cache database. This way all the threading machinery will work and it will use only local resources.

    I eventually killed it once I had the impresssion that host couldn't have that many IPs.

    Just to be clear, once you kill the process, you have to delete all Berkeley DB databases because they are in an unstable state at that point and may cause all sorts of issues if used.

    Bad character buffer

    This seems to indicate that one of the fields in the logs is longer than some internal buffers allow. Given that it is during HTML generation, it might be that the field was just below the workable size, but when it was HTML-encoded, it grew over that buffer size. Without having log files that cause this, it would be hard to troubleshoot this issue.

    it's much too slow for regular use.

    I am thinking to remove DNS resolution in the future. It's just too painful. I will add ASN information to the hosts table and as a separate report and will keep an eye on other ways to enhance IP reporting, but DNS resolution only works for very small logs.

  17. KS reporter

    Yes, I normally remove all the databases as I usually refresh config files with each run at the moment. The second run eventually ran thru. But the log files for that specific host took an extrem amount of time, half an hour!

    Processed 52165 records (117 ignored) in 0.81 seconds
    DNS cache hit ratio: 15% (1227:6504)
    DNS wait time is 1759.28 seconds
    Generated reports in 0.12 seconds
    

    Actually, the process time after "Using" wasn't that long. It hung very long right before each "Saving current run data." this time. It looks like it had a hard time to write 8.000 hosts to the dnscache because of the single writer thread? This means only 4 hosts per second. I can't believe that.

    Anyway, about the buffer overrun. It happened only when I had dns resolution switched on. How large is the hostname buffer? I looked at the hosts report for December and found this string "ec2-54-201-135-23.us-west-2.compute.amazonaws.com". It's 49 characters long. Could it be that?

    Yes, ASN addition/reporting would be a really good addition. I think. This would allow reporting on the access providers and also allow grouping the various IP ranges to just one access provider. I'll be happy to test that once it's available.

  18. KS reporter

    Looks like I was wrong about the buffer overrun happening only with dns lookup. I just missed it before. It happens with dns and geoip disabled as well (and all dbs removed). I can provide you with the logfile, it's <200k gzipped. The last time I got an output for January (it crashes on December) with ssw on that file was three days ago. I did some tests in between, so it must be something else. Maybe some grouping. You may need the config file as well to reproduce it.

  19. StoneSteps repo owner

    The processing time is all in the DNS - you can see it in 1759 seconds.

    Are you processing with query strings or just URLs?

    It is not a buffer overrun, which would be a security issue. The error prevents a buffer overrun. As a general rule I try not to truncate anythung, but may be I should reconsider some cases.

    If you can post the log file somewhere, send a link to support@stonesteps.ca. Don't post it here. Once I pick it up, I will reply so you can take it down. Thanks.

  20. StoneSteps repo owner

    I pushed a change for the Bad character buffer bug. Should process those logs now. Grab the source with the tag v4-3-5-1.

    One thing I noticed is that if you process these log files as CLF logs, some entries will be incorrectly reported because these are actually Apache logs and their user agent and referrer fields are wrapped in double quotes and not space-separated, as a CLF log would have it. This means that if there is a space in the referrer, the part after the space will be interpreted as a user agent. In order to process these entries correctly, process them as Apache logs and configure ApacheLogFormat.

    Once these logs are processed, you will see malicious referrer entries that are designed to exploit vulnerabilities in websites. For example, one of them starts with 554fca.

    Lastly, just for a reference, on an Ubuntu VM running on my laptop SSW generated HTML and TSV reports for 16129 records in about 4.5 seconds. GeoIP with country and city was on, DNS off, no group/hide lists (those will also slow things down) and with query strings.

  21. KS reporter

    Oh, ok, thanks for mentioning that LogFormat thing. Yes, I was using clf. I had definitely read part of the README and it says that most people should be fine with using clf against Apache logs. So, I did. I got a few "bad record", but the ratio wasn't bad. For the really big logs I got maybe a few hundred. But that's not much compared with the several million records that are in there. So, I thought it is normal and you are scrutenizing the lines more than webalizer does. And you do - just that I didn't know that is a screw I could use. I reran the tests without the fix (and with an ApacheLogFormat set) and the buffer warning was already gone. So, it was triggered by using the wrong log format. I've now built 4.3.5.1 and ran again. It's still gone.

    I don't see any params in the Referrer. But, as you suggested earlier I use ExcludeArgs * and that is probably used for Referrer as well? I see one referrer that starts with "45ea" and seems to have been truncated. It looks like a session identifier plus a PHP serialized array and may have actually been a goof on the server and not an attack. I didn't detect any other similar strings elsewhere.

    Thanks for the reference ;-) The whole run time for those 16129 records takes about 1-2 seconds. This looks fast. This is with lots of grouped referrers and agents, but without Args. GeoIP from the cache. Still means it takes a calculated 500s for the big file with over 5 million records. Actually it takes a bit longer, maybe 20 minutes.

    Update, just made the run on the big file. Not a single bad record. :-) Stats:

    Processed 5205809 records (16193 ignored) in 108.88 seconds, 47812/sec
    DNS cache hit ratio: 100% (164921:0)
    DNS wait time is 0.16 seconds
    Generated reports in 1.81 seconds
    Maintenance time is 467.02 seconds
    Total run time is 577.87 seconds
    
  22. StoneSteps repo owner

    and it says that most people should be fine with using clf against Apache logs

    The original CLF format doesn't have user agents and referrers and I cannot remember now what governs the extended/combined log format - most people just use Apache format. I will revisit the format and update README at some point.

    So, it was triggered by using the wrong log format. I've now built 4.3.5.1 and ran again. It's still gone.

    It was a fairly intricate bug. This line by itself would work. It took a memory allocation of a certain size before this referrer was processed and a double quote at the end of the line, which only happens with this log line when it is processed as a CLF format.

    But, as you suggested earlier I use ExcludeArgs * and that is probably used for Referrer as well?

    No, only for your own URLs.

    It looks like a session identifier plus a PHP serialized array and may have actually been a goof on the server and not an attack.

    Seemed like SQL in the logs, but I didn't look closer.

    For processing speed, it appears that most of it was spent on saving data. How many URLs and hosts are in that file? Also, how many group/hide entries you have in your lists?

  23. KS reporter

    I have now processed another log incremental to it, so I give data from one month which was some 3.5 million lines of the 5 million. "unique" hosts are 100.000, URLs are 670, referrers are 2.000, agents are 10.000. There are 110 GroupAgent (makes 15 groups of the 100.000), 102 HideAgent, 24 GroupReferrer, 12 HideURL (hiding *.gif etc.), only a few others. There are also 40 SearchEngine directives for search strings.

  24. StoneSteps repo owner

    I don't have logs this size and it's hard for me to gauge where things may slow down. Looking at the speed above (47812 rec/sec), 5M records should be processed in 105 seconds, so something is getting slower as more data piles up. Is there any chance you could share this log with me?

  25. KS reporter

    If you really want to try it I can do that. Like with the smaller one. The gzipped file is only 80 MB, but ungzipped it's 1,4 GB. I can also give you the config file. As I understand the "processing" value is only the time it takes to parse the log? And only then starts the hard work ... Note, I'm using these directives:

    MemoryMode no
    DbTrickleRate 50
    DbCacheSize 256M
    

    but it doesn't slow the processing very much (on an SSD). I have the feeling that it's even a bit faster with smaller files. With the same file and almost the same config (plus GroupURL directives that I don't use for ssw, so I have one stats with grouped filetypes and one without) awffull is in general about ten times faster, e.g. that same file takes about a minute. If you want to compare I can give you also the config for awffull, it's available from the repo on Ubuntu. It's not that much diffferent, it gives results that are within 1% of the ones from ssw. With one exception. The Referrers are 40% less with awffull. (Maybe it can't find them with "LogType auto". I'll check again if I can find a better LogType for it). It also shows "visits" for all stats, and for Entry/Exit it counts some stuff called Stickiness and Popularity, not sure if these are useful. It doesn't have Robot, but I don't use that at the moment, anyway. And it doesn't have the fancy JS stuff, I haven't tried that either.

  26. StoneSteps repo owner

    If you really want to try it I can do that. Like with the smaller one.

    That would be great. At least to know where the slow-down is.

    As I understand the "processing" value is only the time it takes to parse the log?

    It's a bit more. It parses the log file line by line and aggregates on various items, such as hosts, URLs, visits, cities, countries, etc. It also includes some intermediate look-ups in Berkeley DB, but excludes the bulk of the write done at the end of processing, which is what we see at the end. Once all is done, reports are generated from Berkeley DB, so data doesn't have to be in memory.

    MemoryMode no

    In v5 I removed memory mode completely, so it can process log files of any size with available memory. It keeps only about 2-3 visit timeout time worth of data in memory and everything else goes to Berkeley DB. I did some optimizations and for my 100 MB logs it handles it at 70-90K rec/sec, but I don't have logs as large to test this new mechanism. Looking forward to running your log file.

    And it doesn't have the fancy JS stuff, I haven't tried that either.

    Keep in mind that Highcharts are not Open Source. I tried a couple of Open Source graphing packages and they weren't as advanced to render charts similar to the PNG images. I may revisit it in the future, but for now if you want a world map and bigger charts, that's the only way.

  27. KS reporter

    I didn't check any of the JS stuff yet. I saw there are two js files. I wondered if this means there is a very fancy (with highcharts) way and the other a bit less fancy? Or is that "webalizer.js" just a wrapper around webalizer_highcharts.js? I'll send you the link once I moved the file to a server.

  28. KS reporter

    So, you think it wastes the time mostly with creating the report and/or writing the stuff to incr and hist dbs? The processing including the grouping and such is already done within the 109 seconds? Aha. Once you have something ready for v5 I can test it. Next month ;-)

  29. StoneSteps repo owner

    Sorry, I misread your post about MemoryMode. Given that it's set to no, the maintenance time includes intermediate writes to disk every SwapFrequency * 1000 record, which reduces the amount of memory required, but increases the number of database look-ups if some data that was recently requested was swapped out to disk. If you set MemoryMode yes, which is the default, the speed should be comparable with other Webalizer forks, given the same configuration.

    SSW v5 will do it more selectively and will keep track of what was not requested recently and the current size of data in memory and swap out only when it reaches the threshold. I will test it against that large file when I have a chance in the next few days to see where it stands.

    I wondered if this means there is a very fancy (with highcharts) way and the other a bit less fancy? Or is that "webalizer.js" just a wrapper around webalizer_highcharts.js?

    No, webalizer.js is for all JavaScript things. At some point I implemented a help system for XSL reports and It used JavaScript to bring up context-sensitive help window for column headings, etc. I removed it since then because nobody seemed to be interested in maintaining and translating that. It also has some stuff to move through reports with Ctrl-Alt-Up/Down keys, but primarily it implements the basis for pluggable JavaScript charts, so one could write their own JavaScript charts or I could switch to some other package easier.

    So, you think it wastes the time mostly with creating the report and/or writing the stuff to incr and hist dbs?

    Reading data for reports is quick and given that it works only with aggregated data is never the bottleneck. It's writing intermediate data during processing, as well as swapping out data that was just used and that is referenced again, so it has to be looked up in Berkeley DB, which is slower than in memory look-up structures.

    Group/hide are ordered lists to enforce rules that match first and there is no way to traverse them any quicker, so the more of those, the slower it will be, quite linearly. Some of them that take more than one entry, such as SearchEngine will not restart from the beginning, but in general it's still pretty slow. Doesn't matter for small lists, but I saw some people use thousands of those and that will definitely make it visibly slower.

  30. StoneSteps repo owner

    I had a chance to do some limited testing yesterday and I could reproduce it on my Ubuntu VM. It took about 16 minutes to process one of the logs. This included all query strings, no DNS, GeoIP country/city and nothing ignored, grouped or hidden. The same log file on the host WIndows machine running that VM processed same configuration in 3 minutes and without query strings in just over a minute.

    The default GCC on Ubuntu v16 cannot build the latest version of SSW, so I switched to Fedora, just to test and the result was the same - about 15 minutes. While looking at the machine resources during a test run, I noticed that VM's virtual disk is written at a much higher rate than I would expect and then I realized that my output was configured to be on a shared directory located on my host machine. I reran the same test with everything being on the VM and the results were similar to those on Windows.

    It will take me a few days to do more thorough tests for SSW v4.3.x, but in the meantime, if you are running on a VM, check if output goes to any mount points. If it does, as a test, try writing to a local directory.

  31. KS reporter

    Thanks for the update. So, you think in your case it was slowed down by the driver? The server I'm testing it on is an OpenVZ container hosted by a major hosting provider. I have no idea where the storage is located, but whatever I'm writing there it goes to the main disk of the VM. It doesn't have access to anything else beyond the swapfile where I don't have control of. Every other writing is really fast, for instance gzipping or so. I didn't test without Ignore, Group and Hide, but according to our earlier conversation one wouldn't expect a heavy impact from these less than 200 entries, right? (As I would consider that still "small"?) (And, anyway, removing those options so that it runs faster, is not an option ;-) Do you think it makes sense to put the path to the two dbs on a tmpfs and see how that compares? (Or run with Incremental no as a test? This should not create the hist and incr dbs ?) (Or point the html output to tmpfs, too? But these files are only 41 MB in total for two months.)

    About MemoryMode. First I got that you would be removing the MemoryMode and ssw 5.0 would be running in "DatabaseMode". But now after your correction I think it means that you are going to remove the MemoryMode config option and it is going to run in a "mixed" mode in case there is not that ample RAM? And no way to switch between the modes by configuration, because there's basically only one mode now? Did I get it more or less correct?

    Ubuntu 16 cannot compile ssw 5.0? Hm, Problem here :-(

  32. StoneSteps repo owner

    No need to change anything in the configuration. I just wanted to make sure there are no fancy mount points involved. I will run tests with the right version in various modes and report back. It will take me a few days.

    Regarding MemoryMode, all other Webalizer forks/versions are running as if MemoryMode yes is used. It will use more RAM this way because it doesn't dump intermediate data to disk. I cannot say definitively how much more RAM it will take for those large logs, but will be able to tell after those tests described above. If it feels too much, you can filter out query strings - those create 400K URLs, which is much more to write to the state database. It should also run faster in the memory mode set to yes because in SSW v4 it dumps records every X log records and may load some of those dumped again, which will require more saving. How much of a difference it will be, I cannot tell now. You can either give it a try or just wait while I finish the tests.

    Ubuntu 16 cannot compile ssw 5.0? Hm, Problem here

    It's just the version of GCC is older there. From the look of it, it should be possible to install newer g++ package, but I will need to test it better. I'm not cutting anything off at this point - just stating how things look now.

  33. StoneSteps repo owner

    I'm still running my tests, but found a bug in the database mode (i.e. MemoryMode no) that might explain the poor performance on your machine. The bug is that SwapFirstRecord and SwapFrequency are supposed to be in 1000s, as in the value 50 is interpreted as 50000, but their default values are not multiplied by 1000, so it swaps out URLs and hosts every 100 log records, which makes it run really slow. If you want to continue using database mode, add this to your configuration file (use tabs if you enabled phrase values in configuration):

    SwapFirstRecord   100
    SwapFrequency    100
    

    From my preliminary tests, for one of your larger files, it will require about 300 MB of RAM when running in the database mode and will spike to about 400-500 MB in the default memory mode. I used DbCacheSize set to 256 MB for most of my tests, but it could be omitted and Berkeley DB will use its own configuration optimal for the local machine.

    One more thing I wanted to mention is that those 16K ignored records are all shuffled time stamps. Just by a couple of seconds in those cases where I checked, but nevertheless, the log processor ignores older records.

  34. KS reporter

    Oops, that makes a"tiny" difference ;-)

    Processed 5205809 records (16193 ignored) in 91.38 seconds, 56968/sec
    DNS cache hit ratio: 100% (164921:0)
    DNS wait time is 0.19 seconds
    Generated reports in 1.99 seconds
    Maintenance time is 3.02 seconds
    Total run time is 96.58 seconds
    

    And there was no real change to the memory structure this time. free mem, cache and swap remained at about the state they were before. Great! But I think memorymode yes would take more RAM than you think. The first few times I ran it in default mode and found that it was maxing out the cache and the so-called "free" memory (not the available memory, there was still ample of that, more than 3GB) and once the value for free had hit zero it maxed out the swap file (only 1 GB) until it was at zero as well. That's why I switched to database mode, to avoid this heavy use of caches and swap. Swap... settings are only relevant for database mode, right? I hope the "new" mode in ssw 5 doesn't make too liberal use of available free RAM for this large a file.

    Thanks for the explanation of the ignored records. I didn't put much attention into that as it is not much compared to the number or records. But good to know the reason. There's obviously sometimes so many traffic that some apache workers get queued for logging and log a time that's already in the file.

  35. StoneSteps repo owner

    Thanks for the details. Memory mode does consume more RAM, but I wouldn't it expect in gigabytes. So far I tested mostly on Windows because of the tools I have and it would spike 100-200 MB more in the memory mode (on top of about 300 MB for that large log file), but shouldn't be as much as gigabytes. Will keep an eye on it when testing on Linux later.

    Yes, swap settings are ignore in the memory mode. In SSW v4.3.x you control how often in-memory items are flushed to Berkeley DB with the swap first record/frequency settings. It only affects SSW's internal memory and Berkeley DB memory is controlled by the database cache size. In my tests frequency below 50 (i.e. 50000 log records) would start taking toll on performance and going below 1000 is where it really starts running slow.

    SSW v5 will use RAM until its hits a threshold and then it will swap out items based on how far they are on the log time line, starting with a couple of configured visit lengths. I'm still testing it though, so it may change a bit before the release.

  36. StoneSteps repo owner

    Sorry for the delay. This large log file helped me find some things to improve in how SSW v5 handles memory and I spent some time gathering data I need.

    With regards to the memory tests, the table below shows various processing times for the log files with 5,205,809 records. Nothing was filtered or grouped to focus just on memory. URLs included all query strings, which reported 464,999 distinct URLs for one of the months. DNS look-ups were disabled, GeoIP with city look-ups was enabled. There were 98,511 hosts reported for the same month. In database mode data was swapped out of memory every 100K record. All tests were running on Windows, just because it is easier for me to capture and analyze data this way. The results should be similar on other platforms, but I don't have the capacity to run these tests on other platforms.

    max memory (MB) speed (rec/s) processing time (s) report time (s) maintenance time (s) total time (s)
    v4.3, DB mode, default cache size 109 40989 127.00 48.38 106.31 281.99
    v4.3, memory mode, default cache size 247 45218 115.13 51.98 58.22 225.55
    v4.3, DB mode, 128 MB cache, freq 100/100 305 68162 76.37 26.16 9.10 111.78
    v4.3, memory mode, 128 MB cache 420 83439 62.39 21.89 5.30 89.72

    In memory mode all items, such as all those 500K URLs are kept in memory while log files are being processed and saved in a Berkeley DB to produce reports. In database mode URLs, hosts, etc are saved into a Berkeley DB file for the current month every SwapFrequency log records, which puts a cap on maximum memory at the expense of processing speed.

    Processing time refers to log record processing, not including saving data in Berkeley DB, report time refers to how long it took to generate reports from the data in the Berkeley DB and maintenance time refers to the time it took SSW to save data in a Berkeley DB.

    If DbCacheSize is not specified, the default Berkeley DB cache size is used, which turns out to be very small in current terms - just a few hundred KBs.

    These are memory and CPU charts for each of the rows above. The blue line indicates memory (10 means 100 MB), the dark purple line shows memory peak usage, same scale, the red line means combined CPU usage for the SSW process (it's a 100 means one CPU core) and the bright purple line means disk writes (10 means 10 MB/s). The time line spans 5 minutes for all charts. That log contained data for two partial months, so two sets of reports were generated, as can be seen from the change in memory usage in the middle.

    This is the chart for SSW running in database mode with the default Berkeley DB cache size. The maximum memory in this case is capped at about 110 MB and disk writes occur throughout the entire run, as data is saved to the database.

    v4-3-db-mode-0mb.png

    In memory mode with the same default Berkeley DB cache size SSW runs a bit faster. Memory usage goes up to about 247 MB and disk writes start later on, when it is saving data and about to generate reports:

    v4-3-memory-mode-0mb.png

    Things improve quite a bit when Berkeley DB cache is increased to 128 MB. This is the chart for SSW running in database mode. Memory usage goes to about 305 MB.

    v4-3-db-mode-128mb-100-100freq.png

    Running in memory mode with a 128 MB cache, memory usage jumps to about 420 MB when data is saved to Berkeley DB. That's the fastest mode.

    v4-3-memory-mode-128mb.png

    Based on that, Berkeley DB cache size matters the most and I will change the SSW default to some larger value rather than rely on the Berkeley DB's default cache size.

    I am not seeing that memory jump to gigabytes you mentioned. What did you use to observe that value?

  37. KS reporter

    Hello, thanks for the detailed checks and data! Remember, I made my observations with the default SwapFrequency that was accidentally set to a very low value for the default in the source. I'm not sure if the SwapFrequency applies to memory mode as well. I guess it does? That's why I set to database mode, still not knowing about the small bug with SwapFrequency. That seemed to have helped, but I was still seeing the swap file going down heavily. After I set the SwapFrequency (and maybe also upped the DbCacheSize) this seems to have completely stopped. The memory values before and after are basically the same. I use what you can see in the config file I provided:

    MemoryMode no
    DbTrickleRate 50
    DbCacheSize 256M
    SwapFirstRecord   100
    SwapFrequency    100
    

    Don't know if DbTrickleRate makes any difference or is of use at all. I stayed with what seemed to work.

    I used the free command on Linux to observe the memory values. The normal memory level looks somewhat like that:

          total             used            free           shared    buff/cache   available
    Mem:        6291456     3039288      955708      649708     2296460     3013494
    Swap:       1048576      72320          97625
    

    with "Swap used" often being "0". First, when I ran in memory mode (and with none of those config values set) with the big log file the "Swap used" value quickly went up to maximum (1048576) and the "Mem free" went to 0. "Mem free" of 0 is not particularly bad if you still have enough "Mem available" - which I had. But it indicates there's something unusual going on which might become a problem later. Most of the memory in use when running ssw went in buffers/cache which went up by maybe 1 GB or more. The "Mem used" value did not change considerably. So, what happens is that buffers (I think it's not the cache) are getting filled and at the same time either these buffers or any old buffers get flushed to the swap file. I didn't check for the pattern of reads/writes to the swap file. I just noticed that it got "drained" quickly. I found this somewhat disturbing, as this can also decrease performance. I don't know the actual cause of that. Maybe it's a memory balancing mechanism of the kernel. Or some behavior induced by the Berkeley DB. Yesterday I ran the latest statistics on the files accumulated since then (which are big, but not as big as the last one, the biggest is 600 MB now) and with the config values shown above and there is no noticable impact on the free output. With these settings and with the latest source I downloaded from the repo it's also much faster and nearly as fast as awfull. It's fast enough that I didn't bother to write the exact values down yesterday. So, from my pov it runs quite fine if I adjust it a bit and run in database mode. I didn't ever go back to memory mode after the first bad experiences and am not inclined to do that now as it runs fine with the current settings ;-) (I provided the "new" values for the big file above on 2019-01-23 where I already made the same observation about speed and memory use, it's ok now with the current settings. I hope it doesn't change much in the future where some of the values may not be configurable anymore.)

  38. KS reporter

    With "latest source" I mean 4.3.5.1. I just looked and see there's a newer source 4.3.6.3 or so. I didn't test that yet!

  39. StoneSteps repo owner

    Thank you for the detailed post. Most people seem to stick with the tried-and-true memory mode and I don't get back much feedback like this. Especially for large amounts of data. It is much appreciated.

    It seems that Berkeley DB behaves differently on Linux, as I didn't see memory going to 1 GB on Windows. It looks like I need to do more tests later one. Need to wrap up some ongoing stuff first, though.

    I'm not sure if the SwapFrequency applies to memory mode as well. I guess it does?

    Swap frequency is ignored in memory mode because everything is kept in memory until it's time to generate a report.

    I just looked and see there's a newer source 4.3.6.3 or so. I didn't test that yet!

    I fixed the defaults for swap values in v4.3.6 and also the Ctrl-C handler that turned out to be broken for a while. No other changes that would be visible on Linux.

    Don't know if DbTrickleRate makes any difference or is of use at all.

    Trickle sets up a background thread to keep the specified percentage of Berkeley DB cache free for data it reads from disk, so it doesn't have to wait to dump dirty pages before reading data from disk. Its usefulness depends on many factors, such as whether there are repeated visitors, if URLs are requested in some patterns, etc. It's hard to say. I would keep it between 10% and 20%. It has no effect in memory mode.

    Most of the memory in use when running ssw went in buffers/cache which went up by maybe 1 GB or more

    There are two settings I removed in v5 that instruct Berkeley DB not to use system buffers where it is supported. DbDirect and DbDSync. Makes me wonder if I should reconsider. I will experiment later on with them on Linux.

  40. KS reporter

    There are two settings I removed in v5 that instruct Berkeley DB not to use system buffers where it is supported.

    Do you mean that using or not using them may get around the "1 GB issue"? If using those settings would help I would certainly like to use them unless they are removed ... If you have any testing instructions for 4.3.5.1 I can do a test with and without.

  41. StoneSteps repo owner

    Do you mean that using or not using them may get around the "1 GB issue"?

    It was just a guess that may be the OS uses the same amount of memory for buffering that Berkeley DB itself uses for caching, so the memory would double. Setting DbDirect to yes should eliminate system buffering, but I didn't test this configuration for quite a bit now and certainly not on Ubuntu. May be it's better if you leave it alone until I had a chance to do more testing. I've been busy reworking how v5 handles memory and once I'm done, will revisit these settings.

  42. StoneSteps repo owner

    I ran a few tests on Ubuntu v16.04 using this command:

    top -b -p `{ webalizer -o reports -f apache -c webalizer.conf 1gb-logfile.log > wb.out & } && echo $!`
    

    All values in these charts are as reported by top. The test VM has 1 CPU/core, 512 MB of RAM. SSW was configured with DbCacheSize set to 50 MB and swap frequency of 10000 records (i.e. value 10 in webalizer.conf).

    The red line in all charts in the CPU usage and is shown on the right axis. All other values are memory and are shown against the left axis. The RES memory is the actual RAM and is shown in light blue.

    All tests took about a couple of minutes, but I ran tests on a battery on my laptop, so it wasn't running at full speed, so timing is not important. Every dot on the X axis is about a couple of seconds, give or take.

    This test is for SSW v4.3.6 running in memory mode:

    ubuntu-16-04-1gb-log-50mb-buf-mem-mode-ssw-v4-3-6.png

    This is for SSW v4.3.6 running in the database mode.

    ubuntu-16-04-1gb-log-50mb-buf-db-mode-ssw-v4-3-6.png

    This is for SSW v5.0.0, just for comparison how it behaves without the database/memory mode switch.

    ubuntu-16-04-1gb-log-50mb-buf-ssw-v5-0-0.png

    With regards to DbDirect, it is not supported on Ubuntu. At least not on v16.04. I didn't try any other version.

    Unless memory jumps between those 2 second marks and then goes back, I don't see memory going to the maximum in any lasting way. Not quite sure what you observed on your system.

    You can try running top on your system to see what gets captured. You can also change the refresh rate for top output with -d 1 for one second between updates.

  43. StoneSteps repo owner

    After I sorted all config problems I made a first run. With "DNSLookups no" and "DNSChildren 2". So, no DNS lookup, but GeoIP lookup. Worked for some smaller log files, but failed for bigger ones. I have only very small or really big ones, so I cannot say which size might be a tipping point. The smallest of the "problem files" is a bit over 200 MB, the biggest 1,5 GB and over 5 million records. This large one has about 100.000 hosts according to the stats output.

    The run hangs at the "Using logfile" stats indefinitely until I kill it. I tried it several times. Same result. (There were also problems because of corrupted dns.db afterwards, obviously because of killing the processes. Expected. I removed the db for the next runs.)

    Then I tried with "DNSChildren 0". That worked. So I figured there might be some clash (thread safety?) between the 2 DNS workers. And, indeed, after I set to "DNSChildren 1" it ran thru. The run with the large file takes between 70 and 80 seconds. That's ok, I don't need a second worker for that. Nevertheless I wanted to report. I haven't tried with DNS resolution yet.

    I have a question about this also: I noticed that with the "DNSChildren 1" run (not with 0) the memory went down to "free 0", according to the output of the free command. That's because it filled buffers/cache up as much as it could. It also used 3/4 of the swap file. I still have more than 50% of RAM "available", so it's not a "real" problem. However, my monitoring is set to monitor the swap file because I have one or two programs that seldomly may have a problem indicated by them filling up the swap file really quickly. Of course, the monitoring cannot distiguish between these and ssw filling up the swap. I see that there are some options in the "Database Configuration Options" that might help here. But am not sure which to try. Like MemoryMode, DbTrickleRate, DbDSync, DbCacheSize. Which could I try to remedy this situation? First objective would be to avoid swapping at all, second objective would be to not fill buffers/cache that much. Thanks!

  44. StoneSteps repo owner

    There’s a Bitbucket bug that mixes up user names in comments, like the one above with the question about SSW locking up. Because of this bug, I didn’t realize that there was one more question here. My apologies for closing and not responding.

    It sounds like it’s issue #10, which has been addressed in v5.0.0. Let me know if you are running this version and it’s not working for you.

  45. Log in to comment