Exports don't seem to die when aborted

David Platten

I've just noticed this for fluoroscopy xlsx exports. The abort button doesn't stop the export.

2018-12-13T08:56:33+00:00

Ed McDonagh reporter

Not sure what this was, or what you are experiencing now, but it's all getting revamped with my current branch.

Which version of celery are you running?

2018-12-13T08:59:07+00:00

David Platten

Celery version 4.0.0

2018-12-13T11:46:30+00:00

David Platten

I've just upgraded celery on my live Windows system to 4.2.1. Celery works (wasn't sure it would). However, I still can't cancel fluoroscopy export jobs, nor CT export jobs.

2018-12-13T12:32:59+00:00

Ed McDonagh reporter

The method currently in use has been removed for Celery 4. It still works on my systems, and I think I'm on 4. Might be different for Windows?

My branch has the new flower method of killing them (though doesn't currently delete the job in the database), and also changes to the new method for killing if you click on the existing button.

Not sure if that will work on celery 3 installs...

2018-12-13T12:45:04+00:00

David Platten

Celery 4.2.1 didn't work for me on Windows. It caused DICOM node queries to run multiple times. I've reverted back to 4.0.0 and the problem has gone away. I think it is this bug: https://github.com/celery/celery/issues/3430.

I think we should pin celery to version 4.0.0 otherwise users will run into this problem.

Rolling back to celery version 4.0.0 wasn't as easy as I thought either. After downgrading to celery 4.0.0 I received an error when trying to run celery:

ImportError: No module called async.timer

The above error was because the installed kombu package version (4.2.1) didn't match that of celery (4.0.0). I had to manually revert kombu to a point release that matched celery (kombu 4.0.2). Everything then worked again.

2018-12-13T15:34:06+00:00

Ed McDonagh reporter

changed milestone to 0.9.0

Should be resolved with issue ref ~~#705~~ changes.

2018-12-13T15:51:39+00:00

Ed McDonagh reporter

Hi @dplatten - did you check if killing tasks now works on Windows?

2019-01-25T08:57:10+00:00

David Platten

This doesn't work for me. I'm running 0.9.0b5 on my live system. Clicking the "Abort" button on an export has no effect - it continues. Clicking on "Terminate task" in the celery active task has no effect.

I'm running Celery with the following command:

celery worker -n default -P solo -Ofair -A openremproject -c 1 -Q default --pidfile=%celeryPidFile% --logfile=%celeryLogFile%

I assume that the updated code is included in 0.9.05b5?

2019-01-25T09:45:00+00:00

Ed McDonagh reporter

Ok. I'll have to get my head back into this and all you to test some things. Can you kill them from the flower interface?

I also need to do this to work out if we really need to ask for results or not, which would remove all those RabbitMQ tasks and might help with some other problems.

2019-01-25T09:54:54+00:00

David Platten

Trying to terminate from the Flower interface has no effect either. You get a "Success! Revoked ..." message at the top of the screen, but the export continues regardless.

2019-01-25T10:59:54+00:00

David Platten

The issue may be related to running Celery using the -P solo switch. See the last comment on this post: https://github.com/mher/flower/issues/169

2019-01-25T11:11:21+00:00

David Platten

I've just done a little testing of this on my Ubuntu system. Aborting exports doesn't work if "-P solo" is used in the Celery command; it works perfectly without this option.

2019-01-25T11:47:36+00:00

David Platten

The issue of -P solo tying things up is mentioned in this celery issue: https://github.com/celery/celery/issues/3768

May be fixed in Celery version 4.3...

2019-01-25T15:16:30+00:00

David Platten

I've tested the ability to terminate export jobs with each of the available execution pools. Termination only works when using the default -P prefork; termination does not work for -P solo, -P gevent or -P eventlet.

Tested on Ubuntu with Celery 4.2.1

2019-01-25T15:21:48+00:00

Ed McDonagh reporter

This may be more appropriate as a new issue, but I am trying to establish if we should be using a results backend, and if we should be using acks_late.

Initial investigations on Linux with Celery 3.1.19, default forking

With: Acks_late = True Results_backend=rpc

Current default

RabbitMQ interface shows number of task waiting and number of tasks being processed
RabbitMQ UID style queue appears immediately task is requested, with no messages waiting or tasks being processed. When task completes '1 message waiting'.
Flower interface shows tasks, terminate fine
Exports interface shows tasks, terminate fine

With: Acks_late = disabled Results_backend=rpc

RabbitMQ interface shows nothing for processing tasks (in default queue)
RabbitMQ shows number of tasks waiting to be passed to Celery as 'tasks being processed'
RabbitMQ UID style queue appears immediately task is requested, with no messages waiting or tasks being processed. When task completes '1 message waiting'.
Flower interface shows tasks, terminate fine
Exports interface shows tasks, terminate fine

With: Acks_late = disabled Results_backend=disabled

RabbitMQ interface shows nothing for processing tasks (in default queue)
RabbitMQ shows number of tasks waiting to be passed to Celery as 'tasks being processed'
RabbitMQ has no new UID style queues
Flower interface shows tasks, terminate fine
Exports interface shows tasks, terminate fine

Discussion

On keeping a results record

We don't currently make use of the results we are hanging on to
We never get() or forget() the results, so they build up unnecessarily

On using acks_late

This means the acknowledgement is not sent to RabbitMQ until the task is complete.
If the result never comes, in some circumstances it is resent, starting the task again.
If the task takes too long, in some circumstances it is resent (some Celery versions only?)

Proposal

If we can't think of a reason to keep the results backend, I propose we stop using it, then combine the status table and number of tasks waiting to be started from the current RabbitMQ page with the Flower page of current and completed tasks.

Thoughts?

2019-01-28T22:01:24+00:00

Ed McDonagh reporter

I've been trying to find good docs on solo etc, without much luck.

My current thinking is that solo is blocking, and as a result won't process the request for it to abort until it has finished the task, which kind of defeats the point.

So we either need to find an alternative way of killing things in Windows, or we need to reconsider whether it is necessary to use solo. I can't remember what the original problem was, but most of the discussions for other projects revolve around not getting results, which we don't make use of and might drop entirely...

Thoughts?

2019-01-29T09:15:57+00:00

David Platten

Hi Ed,

I think that setting acks_late was to avoid long-running tasks such as PACS queries from being run multiple times.

I'll test what you've suggested above on my Windows system.

As an aside, I've just down-graded Celery on my live system from 4.0.0 to 3.1.25, the last version that "supported" Windows. This has enabled me to remove the "-P solo", and increase the concurrency back to 4 ("-c 4").

Cancelling exports now works.

I am running Celery 3.1.25 with the following command:

celery worker -n default -A openremproject -c 4 -Q default --pidfile=%celeryPidFile% --logfile=%celeryLogFile%

Each time I cancel an export the following appears in my Celery default log file:

[2019-01-29 09:53:16,710: ERROR/MainProcess] Task remapp.exports.dx_export.exportDX2excel[2bffb1ff-7a13-424b-a752-3ed6cd2d95f0] raised unexpected: Terminated(-15,) Traceback (most recent call last): File "d:\server_apps\python27\lib\site-packages\billiard\pool.py", line 1678, in _set_terminated raise Terminated(-(signum or 0)) Terminated: -15

2019-01-29T09:54:18+00:00

David Platten

Being able to drop the -P solo with Celery 3.1.25 on Windows enables me to be running two things at once. For example, a PACS query can be going on at the same time as an export. Using "-p Solo" meant that if a PACS query was taking place then the system became unresponsive to exports etc until the query was complete. This is much better. So far...

2019-01-29T10:10:29+00:00

Ed McDonagh reporter

Oh. My reading of acks_late was that if the task ran too long, the acknowledgement (when the task finished) would not come before the timeout and that would cause the task to be run multiple times!

I'm not worried about the error message in the Celery log. I get the same in my log. I guess we should document it so no-one else worries.

Are you able to remind me why with Celery 4+ you had to use solo and lose the concurrency? (Other than it not being supported anymore.)

I am not sure we have a way of specifying in our requirements a different version of Celery depending on whether we were using Windows or Linux, and it would be a shame to hold the Linux users back to old versions of Celery. And we may run into trouble as we move to Python 3.5+ if the Celery version is too old (not sure when that becomes a problem).

2019-01-29T10:11:58+00:00

David Platten

Trying to run Celery >=4.0 on Windows using the default pre-fork pool results in an error because the multi-threading is incompatible with Windows. It simply won't run on Windows (https://github.com/celery/celery/issues/3196).

The workaround is to use "-P solo". However, "solo" is a blocking pool, preventing Celery tasks from being run in parallel: they run one after the other. If you have a DICOM query that takes an hour then any other Celery task has to wait for the query to finish first.

There's also a suggestion that you could use the gevent or eventlet pools. However, I've not had any joy in getting these to work on Windows.

there is a third suggestion (strategy 2 in the link below), which is to set an environment variable on Windows to make the prefork work. However, this did not work for me when I tried it (https://www.distributedpython.com/2018/08/21/celery-4-windows/)

2019-01-29T10:58:36+00:00

Ed McDonagh reporter

Ok. Awkward. I'll try that environment variable later.

2019-01-29T11:12:53+00:00

Ed McDonagh reporter

Environment variable didn't work for me. Something must have changed; let's drop that one.

Using the simple demonstration on the distributedpython website, solo and eventlet worked fine. gevent worked once I had pip installed it.

I can see indications that eventlet shouldn't be used for CPU bound long-running tasks (I think). But I can't find anything about gevent. I think I did see something somewhere, but I can't find it now.

I can't test them with OpenREM on my laptop because of the problems running RabbitMQ on a computer with the AD stealing the default directory variables. Maybe I should try it again, it has been a long time since I last tried!

2019-01-29T22:32:52+00:00

David Platten

I did try using gevent on a Windows system, but found that it blocked Celery tasks, much like using the solo switch. When using gevent the cancellation of exports did not work, probably as a result of this blocking behaviour. Perhaps I'll try again to double check.

2019-01-30T11:01:02+00:00

Ed McDonagh reporter

I am trying to get this going on my Windows laptop, currently battling with Powershell and RabbitMQ.

However, in the meantime I started Celery with -P solo and the status message that came up informed me that task events were off, and I should use "-E" to monitor events in this worker.

Is this something you have tried @dplatten to see if it makes a difference to killing tasks?

2019-01-30T22:33:05+00:00

David Platten

With Celery 4.2.1 on Windows, my live system:

I've just tried setting the "-E" switch with the "-P solo"; I am still unable to kill tasks.

I also tried setting concurrency to 4 ("-c 4") with "-E"; this didn't work either.

I've just switched back to Celery 3.1.25, which works well.

2019-01-31T08:56:35+00:00

David Platten

I commented out the CELERY_ACKS_LATE = True and CELERY_RESULTS_BACKEND = 'rpc://' on my live system a couple of days ago. I haven't encountered any problems with this. PACS queries have all completed; I can cancel export tasks; skin dose maps are calculated in the background.

I'm running:

OpenREM 0.9.0b5. Windows Server 2012 Celery 3.1.25 RabbitMQ 3.6.9 with Erlang 19.1

2019-01-31T09:01:37+00:00

Ed McDonagh reporter

My findings agree with yours @dplatten.

With: celery 4.2.1, billiard 3.5.0.5, kombu 4.2.1, amqp 2.3.2

solo works, but killing tasks is impossible and only one task can happen at a time
gevent works temporarily, but soon fails, only runs one task at a time (not sure why!), and killing tasks doesn't work.

With: celery-3.1.25, billiard-3.3.0.23, kombu-3.0.37, amqp-1.4.9