Site/device health monitoring

Issue #100 new
Alan Noble created an issue

There is a need to automatically monitor the health of a site and its devices. Each NetReceiver/VidGrind device has a corresponding uptime variable which has the name <SiteKey>_<DeviceID>.uptime which is updated with each /poll request from a device. The proposed approach is to periodically read the device uptime variables and check to see that the updated times are current. This is how the status is calculated by NetReceiver and VidGrind, the difference being that the latter are calculated on-demand, whereas we need a task that can be run automatically in the background.

Previously App Engine had a Task Queue API for this kind of thing but this has been deprecated in favor of the Cloud Tasks. In general, how it works is that the web service implements a method that is invoked periodically by the task. 

It is proposed to extend VidGrind’s API. The first step would be to implement a VidGrind method, such as /api/health, that reports the health of a site and/ordevice, and the second step is to invoke the method from the cloud task and report the results, e.g., via an email notification, etc.

  • /api/health/SiteKey would the health of all devices at a site.
  • /api/health/SiteKey/DeviceID would the health of a specific device.

Comments (10)

  1. kortschak

    Below is text copied/paraphrased/extended from email discussing approaches to this. Bringing it here for more general consideration.

    After reading the documentation for Google Tasks, and both reading those for and reimplementing a basic implementation of Google Scheduler I'm not sure about the relative merits of Google Scheduler and Google Tasks in this context. It looks to me like the AppEngine tasks were more powerful than the Cloud tasks (at least I was unable to find a cron-like feature in that), and the Scheduler does not do anything other than send a message (the behaviour of that service is pretty much exactly what the emulator does, but with some more resilience since kortschak/scheduler does not attempt re-sends and so on).

    So the options seem to be:

    1. having a persistently running subscriber that waits for cron messages from the scheduler and then does either, the health check, or queuing a health check task (which seems like an unnecessary indirection given the minimal time a health check takes);

    2. having a persistently running cron-aware service that does a health check (doable, but this shifts the cronspec configuration burden onto the application and away from Google Scheduler); or

    3. using Cloud Tasks to do both the health check and re-queue a health check task for the future (basically a re-entrant task). The last has all the additional configuration complexity with some additional brittleness spice.

    So the question here is what is the downside of having a persistent subscriber? What are the costs of having close to zero CPU usage running 24/7/365? I imagine that GCP can provide good guarantees that the subscriber will be running at any given time, so this is not really a concern.

  2. Alan Noble reporter

    Given that we have a persistently running service in the form of VidGrind with a health check method, option #1 seems natural. That is, of course, what we’ve implemented using GAE cron jobs.

  3. kortschak

    The simplest approach then would be to add pubsub subscription logic to VidGrind that subscribes to a health cron topic and calls the back end of the health check endpoint. This adds more weight to VidGrind, but it's probably going to be simpler than having a separate service that hits VidGrind via its web API. This would require factoring some of the logic in the end point out into a more self-contained function.

  4. kortschak

    OK, that's what I thought.

    So the GAE cron jobs and Google Scheduler do different work then; GAE cron sends requests to a URL while Google Scheduler publishes topic messages. I can start on refactoring the health check end point so that we have a web API and a subscriber if that is accepted as an appropriate way forward.

  5. Log in to comment