Solr Fails When Starting Before Zookeeper in Cloud Mode

Issue #26 resolved
Full name created an issue

In our setup, we use Solr in cloud mode (3 nodes) with Zookeeper installed on each server Solr is installed on. We have observed, seemingly at random on one of the 3 nodes (and usually exactly one, and usually the same one) Solr will start but visiting the admin panel or trying to make any kind of request will return an HTTP 404 error code:

2021-03-10 00:00:00.000 ERROR (qtp2051853139-16) [   ] o.a.s.s.SolrDispatchFilter Error processing the request. CoreContainer is either not initialized or shutting down.
2021-03-10 00:00:00.000 WARN  (qtp2051853139-16) [   ] o.e.j.s.HttpChannel /solr/custom-collection/select
javax.servlet.ServletException: javax.servlet.UnavailableException: Error processing the request. CoreContainer is either not initialized or shutting down.
        at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:146) ~[jetty-server-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) ~[jetty-server-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.rewrite.handler.RewriteHandler.handle(RewriteHandler.java:335) ~[jetty-rewrite-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) ~[jetty-server-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.server.Server.handle(Server.java:502) ~[jetty-server-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:364) [jetty-server-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:260) [jetty-server-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.io.AbstractConnection$ReadCallback.succeeded(AbstractConnection.java:305) [jetty-io-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.io.FillInterest.fillable(FillInterest.java:103) [jetty-io-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.io.ChannelEndPoint$2.run(ChannelEndPoint.java:118) [jetty-io-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.runTask(EatWhatYouKill.java:333) [jetty-util-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.doProduce(EatWhatYouKill.java:310) [jetty-util-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.tryProduce(EatWhatYouKill.java:168) [jetty-util-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.util.thread.strategy.EatWhatYouKill.run(EatWhatYouKill.java:126) [jetty-util-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.util.thread.ReservedThreadExecutor$ReservedThread.run(ReservedThreadExecutor.java:366) [jetty-util-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:765) [jetty-util-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:683) [jetty-util-9.4.14.v20181114.jar:9.4.14.v20181114]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
Caused by: javax.servlet.UnavailableException: Error processing the request. CoreContainer is either not initialized or shutting down.
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:359) ~[?:?]
        at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:341) ~[?:?]
        at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1602) ~[jetty-servlet-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:540) ~[jetty-servlet-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:146) ~[jetty-server-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:548) ~[jetty-security-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:132) ~[jetty-server-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:257) ~[jetty-server-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:1588) ~[jetty-server-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.server.handler.ScopedHandler.nextHandle(ScopedHandler.java:255) ~[jetty-server-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1345) ~[jetty-server-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:203) ~[jetty-server-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:480) ~[jetty-servlet-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:1557) ~[jetty-server-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.server.handler.ScopedHandler.nextScope(ScopedHandler.java:201) ~[jetty-server-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1247) ~[jetty-server-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:144) ~[jetty-server-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:220) ~[jetty-server-9.4.14.v20181114.jar:9.4.14.v20181114]
        at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:126) ~[jetty-server-9.4.14.v20181114.jar:9.4.14.v20181114]
        ... 17 more

A quick search reveals this is typically related to a missing/corrupt solr.xml file. This file does exist and will work if I simply restart the Solr service. The larger issue is that the entire cluster appears to stop working as a result of this one node returning a 404 error code.

Some interesting points to note is that we have seen this issue many months ago on another server in a completely different cluster but it disappeared after some now and has now appeared in this cluster. We are using Solr version 7.7.2 (moving data after an upgrade is not yet handled by this module, but that is a separate issue) and Zookeeper version 3.4.5.

I have tried to reproduce this issue on a standalone VM and was able to between 2 reboots. I then added zookeeper-server to the After= directive of the solr.service file and was able to stop this issue from recurring. However, due to the random nature, I cannot say with absolute confidence that the issue is completely resolved, but I am relatively confident. If someone else could try to reproduce or give their input that would be appreciated. If the issue is solved in a later version of Solr that would be good to know as well, since I could not find this being mentioned anywhere else.

In the meantime, here is a patch that allows defining the name of the Zookeeper service which will only be used to add this name to the After= directive in the solr.service file.

diff --git a/manifests/config.pp b/manifests/config.pp
index 3e0a380..841499a 100644
--- a/manifests/config.pp
+++ b/manifests/config.pp
@@ -105,6 +105,7 @@ class solr::config {
         solr_port    => $solr::solr_port,
         solr_bin     => $solr::solr_bin,
         solr_env     => $solr::solr_env,
+        zk_service   => $solr::zk_service,
       }),
       require => File[$::solr::solr_env],
     }
diff --git a/manifests/init.pp b/manifests/init.pp
index 3fb8124..5f02928 100644
--- a/manifests/init.pp
+++ b/manifests/init.pp
@@ -79,6 +79,10 @@
 # @param zk_hosts
 #   For configuring ZooKeeper ensemble.
 #
+# @param zk_service
+#   If Zookeeper is running on this node, ensure the Solr service starts
+#   before zk_service.
+#
 # @param log4j_maxfilesize
 #   Maximum allowed log file size (in bytes) before rolling over.
 #   Suffixes "KB", "MB" and "GB" are allowed.
@@ -163,6 +167,7 @@ class solr (
   Hash              $cores                           = {},
   Array             $required_packages               = $solr::params::required_packages,
   Optional[Array]   $zk_hosts                        = undef,
+  Optional[String]  $zk_service                      = undef,
   String            $log4j_maxfilesize               = '4MB',
   String            $log4j_maxbackupindex            = '9',
   Variant[
diff --git a/templates/solr.service.epp b/templates/solr.service.epp
index 8ccd4c6..29ac060 100644
--- a/templates/solr.service.epp
+++ b/templates/solr.service.epp
@@ -3,6 +3,7 @@
   String $solr_port,
   String $solr_bin,
   String $solr_env,
+  Optional[String] $zk_service,
 |-%>
 ####################################################################
 #### NOTE: THIS FILE IS PUPPET CONTROLLED - ANY CHAGES WILL BE LOST
@@ -10,7 +11,7 @@

 [Unit]
 Description=Apache SOLR
-After=syslog.target network.target remote-fs.target nss-lookup.target
+After=syslog.target network.target remote-fs.target nss-lookup.target <%- unless $zk_service == undef { %> <%= $zk_service %><% } %>

 [Service]
 PIDFile=<%= $solr_pid_dir %>/solr-<%= $solr_port %>.pid

Note that this patch will likely conflict with my previous patch for #24. I have pushed the code with both patches attached to my repository's dev branch.

Comments (4)

  1. Michael Speth

    Thank you for your submission. I will add this change as it zookeeper is a common enough tool that others might encounter this problem.

  2. Log in to comment