Celery 4.2.1 didn't work for me on Windows. It caused DICOM node queries to run multiple times. I've reverted back to 4.0.0 and the problem has gone away. I think it is this bug: https://github.com/celery/celery/issues/3430.
I think we should pin celery to version 4.0.0 otherwise users will run into this problem.
Rolling back to celery version 4.0.0 wasn't as easy as I thought either. After downgrading to celery 4.0.0 I received an error when trying to run celery:
ImportError: No module called async.timer
The above error was because the installed kombu package version (4.2.1) didn't match that of celery (4.0.0). I had to manually revert kombu to a point release that matched celery (kombu 4.0.2). Everything then worked again.
This doesn't work for me. I'm running 0.9.0b5 on my live system. Clicking the "Abort" button on an export has no effect - it continues. Clicking on "Terminate task" in the celery active task has no effect.
I'm running Celery with the following command:
celery worker -n default -P solo -Ofair -A openremproject -c 1 -Q default --pidfile=%celeryPidFile% --logfile=%celeryLogFile%
I assume that the updated code is included in 0.9.05b5?
I've tested the ability to terminate export jobs with each of the available execution pools. Termination only works when using the default -P prefork; termination does not work for -P solo, -P gevent or -P eventlet.
RabbitMQ interface shows nothing for processing tasks (in default queue)
RabbitMQ shows number of tasks waiting to be passed to Celery as 'tasks being processed'
RabbitMQ has no new UID style queues
Flower interface shows tasks, terminate fine
Exports interface shows tasks, terminate fine
On keeping a results record
We don't currently make use of the results we are hanging on to
We never get() or forget() the results, so they build up unnecessarily
On using acks_late
This means the acknowledgement is not sent to RabbitMQ until the task is complete.
If the result never comes, in some circumstances it is resent, starting the task again.
If the task takes too long, in some circumstances it is resent (some Celery versions only?)
If we can't think of a reason to keep the results backend, I propose we stop using it, then combine the status table and number of tasks waiting to be started from the current RabbitMQ page with the Flower page of current and completed tasks.
I've been trying to find good docs on solo etc, without much luck.
My current thinking is that solo is blocking, and as a result won't process the request for it to abort until it has finished the task, which kind of defeats the point.
So we either need to find an alternative way of killing things in Windows, or we need to reconsider whether it is necessary to use solo. I can't remember what the original problem was, but most of the discussions for other projects revolve around not getting results, which we don't make use of and might drop entirely...
I think that setting acks_late was to avoid long-running tasks such as PACS queries from being run multiple times.
I'll test what you've suggested above on my Windows system.
As an aside, I've just down-graded Celery on my live system from 4.0.0 to 3.1.25, the last version that "supported" Windows. This has enabled me to remove the "-P solo", and increase the concurrency back to 4 ("-c 4").
Cancelling exports now works.
I am running Celery 3.1.25 with the following command:
Being able to drop the -P solo with Celery 3.1.25 on Windows enables me to be running two things at once. For example, a PACS query can be going on at the same time as an export. Using "-p Solo" meant that if a PACS query was taking place then the system became unresponsive to exports etc until the query was complete. This is much better. So far...
Oh. My reading of acks_late was that if the task ran too long, the acknowledgement (when the task finished) would not come before the timeout and that would cause the task to be run multiple times!
I'm not worried about the error message in the Celery log. I get the same in my log. I guess we should document it so no-one else worries.
Are you able to remind me why with Celery 4+ you had to use solo and lose the concurrency? (Other than it not being supported anymore.)
I am not sure we have a way of specifying in our requirements a different version of Celery depending on whether we were using Windows or Linux, and it would be a shame to hold the Linux users back to old versions of Celery. And we may run into trouble as we move to Python 3.5+ if the Celery version is too old (not sure when that becomes a problem).
Trying to run Celery >=4.0 on Windows using the default pre-fork pool results in an error because the multi-threading is incompatible with Windows. It simply won't run on Windows (https://github.com/celery/celery/issues/3196).
The workaround is to use "-P solo". However, "solo" is a blocking pool, preventing Celery tasks from being run in parallel: they run one after the other. If you have a DICOM query that takes an hour then any other Celery task has to wait for the query to finish first.
There's also a suggestion that you could use the gevent or eventlet pools. However, I've not had any joy in getting these to work on Windows.
Environment variable didn't work for me. Something must have changed; let's drop that one.
Using the simple demonstration on the distributedpython website, solo and eventlet worked fine. gevent worked once I had pip installed it.
I can see indications that eventlet shouldn't be used for CPU bound long-running tasks (I think). But I can't find anything about gevent. I think I did see something somewhere, but I can't find it now.
I can't test them with OpenREM on my laptop because of the problems running RabbitMQ on a computer with the AD stealing the default directory variables. Maybe I should try it again, it has been a long time since I last tried!
I did try using gevent on a Windows system, but found that it blocked Celery tasks, much like using the solo switch. When using gevent the cancellation of exports did not work, probably as a result of this blocking behaviour. Perhaps I'll try again to double check.
I commented out the CELERY_ACKS_LATE = True and CELERY_RESULTS_BACKEND = 'rpc://' on my live system a couple of days ago. I haven't encountered any problems with this. PACS queries have all completed; I can cancel export tasks; skin dose maps are calculated in the background.
Windows Server 2012
RabbitMQ 3.6.9 with Erlang 19.1