Multiple background tasks causes memory and PostgreSQL issues

Issue #963 resolved
David Platten created an issue

When using a Windows server each new background task spawns new python and PostgreSQL instances, increasing memory usage. These new instances are spawned at the point that the task is queued, not just at the point that the task is due to be run. When a large number of tasks arrive in OpenREM, such as when querying a PACS for radiographic images, the queue becomes large and the server memory use very high (> 90% on my 8 GB server). On my test system this also causes failure of the PostgreSQL server because it runs out of connections. The OpenREM interface becomes unresponsive and I have to remotely connect to the server to restart the PostgreSQL service and then terminate each task in the web interface.

Comments (124)

  1. David Platten reporter

    I have uncommented out the following four lines of my postgresql.conf file and modified the values to reflect the number of processors available on my server (4).

    max_worker_processes = 4        # (change requires restart)
    max_parallel_workers_per_gather = 2 # taken from max_parallel_workers
    max_parallel_maintenance_workers = 2    # taken from max_parallel_workers
    max_parallel_workers = 4        # maximum number of max_worker_processes that
                                    # can be used in parallel operations
    

    I’ve just ran a radiographic PACS query that retrieved 233 radiographic studies without an issue.

    I think that the auto-refresh of the tables on the task_admin page may be adding to the load on the server. As it stands the active task is updated every 2 seconds, the recent table every 5 seconds and the older tasks table every 11 seconds. I think that each of these updates requires queries of the database. The time between updates of these could be increased to ease the database load.

  2. David Platten reporter

    Removed the atomic transaction from the dx extractor as I think this may be responsible for the issues I have had with my live test system. Refs issue #957 and refs issue #963

    → <<cset f6f477da9d82>>

  3. David Platten reporter

    I have reverted my postgresql.conf file back to the original state, commenting out the four lines I activated on the 26th January. It is now using the default configuration file.

  4. Ed McDonagh

    Removed “Windows” from the title.

    Attempted to import a lot of CT/DX/MG/RF/NM studies at once, via Orthanc and a QR, and managed to effectively crash my linux server.

    I am getting this error in /var/log/syslog:

    django.db.utils.OperationalError: connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: FATAL: remaining connection slots are reserved for non-replication superuser connections

    This needs some work, and is almost certainly related to the comment Jannis left in the background module. I wonder if @Kevin Schärer might be able to look at this?

  5. Kevin Schärer

    @Ed McDonagh - the main problem I see is what David mentioned before - all processes are instantiated even though they cannot be processed all at the same time, thus a lot of memory is used and also many DB connections are active, which leads to your issue. I think we need something like a task queue / manager, which only instantiates and runs a couple of processes at the time. One could probably create a simple task queue for this project or use this module (Huey) which is lightweight and looks quite promising.

    I assume that this approach would solve both problems, including a more stable scaling of more concurrent tasks.

  6. Ed McDonagh

    Do you know if it works on Windows? The approach @Jannis Widmer implemented was to allow us to remove Celery (and RabbitMQ) from the stack to enable continued support in Windows. I don’t think Redis works on Windows, I don’t know about the other options for Huey.

  7. Ed McDonagh

    I’ve just tried the very simple adding demo on Windows with SQLite3, and it works to that level at least!

    There are advantages to being able to restrict the system to only processing one import of any type at a time - it solves a lot of race conditions that otherwise can occur when two objects from a single study are being imported. So for that reason (and the fact that we lost Celery and RabbitMQ and Erlang) I was very pleased with what Jannis did for us. And of course it works on Windows.

    I wonder if it makes sense to collect incoming import requests and maybe some other tasks with Huey, then feed the background module from that queue?

  8. Kevin Schärer

    @Ed McDonagh - I am quite interested in this topic. I will work on a proposal following your last comment.

  9. Kevin Schärer

    @Ed McDonagh - I figured out that when we declare the run_as_task method as (DB) Huey task, then we can actually implement the task queue easily without losing the atomic access functionality when importing studies.

  10. Ed McDonagh

    Sounds good, thank you. Can you implement this in a branch from develop please? We need to have a satisfactory solution to this before we release 1.0.

  11. Kevin Schärer
    • Adds Huey (lightweight task queue)
    • Declares run_as_task as Huey DB task

    • Currently known limitations: Study imports from scripts are not possible, since Django is not initialized

    (Refs #963)

    → <<cset b3b5e9741c57>>

  12. Ed McDonagh

    Tried using this with DICOM query-retrieve to import from a PACS, and after lots of file permission issues with the database files (easily fixed) the web interface tells me that it can’t start the job, and the command line interface seems to hang.

    I haven’t had the time to investigate further at this stage.

    I had merged the latest commit a888f73 with my 812 branch.

  13. Kevin Schärer

    @Ed McDonagh - I’ve built the Docker image on the issue963TaskQueue branch and run a simple study import, which was successful. Currently, one needs to start the consumer manually, which is not really persistent. I suggest that the consumer will run in a separate container, where both containers (consumer and django-openrem-app) will have access to a shared SQLite file.

    For the native installs, one still needs to start the consumer manually - or presumably via a script…

  14. Ed McDonagh

    For Docker, there doesn’t appear to be a huey image - I guess having a docker container is a little heavy for a lightweight task manager!

    So I think we’d need to create as lightweight container as possible, beginning with a Python 3 container?

  15. Kevin Schärer

    My idea would have been to spin up a second container with the same image we use for the Django app, since the consumer also needs the same config and environment as the app itself since it needs to execute the function we have passed in. The only difference would be the entrypoint. I’ve found an example (see docker-compose.yml) which does more or less the same.

  16. Kevin Schärer

    Actually, stopping a task which has already started executing is not supported via the Huey API. But since the PID is stored in the BackgroundTask object, we can send a signal to kill this task when needed.

  17. Kevin Schärer

    Ok, tasks can now be purged when the worker_type is set to process - that will work on native installs and when the consumer is on the same container as the Django app itself but not when we split both up; thus we would need an additional overhead to handle this - which I think is pretty bad…

  18. Kevin Schärer

    The latest commit (73a9a7e) solves this problem, since both Django and Huey consumer are now running in the same container. Furthermore, Huey consumer will be started automatically. Since the supervisor will start Django, the docker-compose.yml file needs an adjustment, i.e. deleting the command line entry.

  19. Ed McDonagh

    @Kevin Schärer - does Huey make any attempt to wait for background_task to return before throwing the next one in? If I start a QR task with lots of imports to be done, I am still, with Huey, getting errors from the postgres database running out of slots:

    Feb  8 14:34:28 frp-openrem-load Orthanc[1547356]: Traceback (most recent call last):
    Feb  8 14:34:28 frp-openrem-load Orthanc[1547356]:   File "/var/dose/veopenrem3/bin/openrem_rdsr.py", line 19, in <module>
    Feb  8 14:34:28 frp-openrem-load Orthanc[1547356]:     default_import.default_import(
    Feb  8 14:34:28 frp-openrem-load Orthanc[1547356]:   File "/var/dose/veopenrem3/lib/python3.10/site-packages/openrem/remapp/tools/default_import.py", line 46, in default_import
    Feb  8 14:34:28 frp-openrem-load Orthanc[1547356]:     wait_task(t)
    Feb  8 14:34:28 frp-openrem-load Orthanc[1547356]:   File "/var/dose/veopenrem3/lib/python3.10/site-packages/openrem/remapp/tools/background.py", line 232, in wait_task
    Feb  8 14:34:28 frp-openrem-load Orthanc[1547356]:     if task.get():
    Feb  8 14:34:28 frp-openrem-load Orthanc[1547356]:   File "/var/dose/veopenrem3/lib/python3.10/site-packages/huey/api.py", line 969, in get
    Feb  8 14:34:28 frp-openrem-load Orthanc[1547356]:     raise TaskException(result.metadata)
    Feb  8 14:34:28 frp-openrem-load Orthanc[1547356]: huey.exceptions.TaskException: OperationalError('connection to server on socket "/var/run/postgresql/.s.PGSQL.5432" failed: 
    FATAL:  remaining connection slots are reserved for non-replication superuser connections\n')
    

    If I attempt to run a query from the web interface, I get the following message:

    but the job runs anyway!

  20. Kevin Schärer

    @Ed McDonagh - I think the issue is that all tasks queue the import task and then wait in a while loop until the task has finished. Before the 88073d2 commit, there was still a database query happening, thus requiring a database connection. I replaced this.

  21. Ed McDonagh

    Thanks Kevin. With the update, there were no error messages regarding the postgres database.

    The web interface is still not working though - it says there was a problem starting the job. Thanks to the DICOM Query Summary Jannis added, you can still find the job, which is running, and execute the move when the find is complete.

  22. Kevin Schärer

    @Ed McDonagh - there was a problem, where the backend tried to access the UUID even though the Result (Huey class) object does not have such a member, but it has the id member instead.

  23. Ed McDonagh

    Thanks Kevin - web interface QR now starts as expected.

    Just looking through some of my errors - can I check that the num_of_task_type is still respected with this method? I’m assuming it is, but just want to rule it out as a source of an error I’m seeing.

    Also, would it be possible for the Huey info message to have the task_type information in?

  24. Kevin Schärer

    @Ed McDonagh - Sorry for the delay in getting back to you.

    I have not changed the code which checks the num_of_task_type, thus currently when two tasks with the same task_type are being executed, one of them (if the variables are set accordingly) will wait in a while loop blocking its worker - I don’t know whether there is a way the Huey consumer can take over this conditional execution allowing this worker to execute another task instead of waiting.

    From the Huey API you can get the arguments passed to the function being executed, thus the task_type will be contained as well. I’ll add this to the info message.

  25. Ed McDonagh

    Thanks for adding this in Kevin. I’ll test it later (I need to reconfigure a remote system first).

    I think you have sorted out running Huey in the docker container. Presumably for Ubuntu we’d add a systemd service, same as gunicorn?

    In Windows, I think Django is managed by IIS running the wfastcgi application - how would we manage huey on that platform?

  26. Ed McDonagh

    Hi @Kevin Schärer - initial tests indicate that the activity is now missing from the QR page - the query starts but there is no evidence on that page. And if you go looking for it on the DICOM Query Summary page it is there, but when the query finishes no Move button appears.

    I haven’t ruled out javascript caching - might I need to purge my cache?

    On the Huey info messages, I can’t see any change:

  27. Kevin Schärer

    Yeah, I think for Linux we may create a systemd service, which should work.

    For Windows I have my concerns. I have tested a native install (of the task queue branch) on a Windows Server 2022 machine. Everything works despite of the Huey consumer when the worker type is set to process. Fortunately the worker type thread works as expected. The only downside with this setup is, that threads cannot be terminated, thus aborting tasks will not work or worse, will kill the consumer completely since all threads will have the same PID as the consumer itself.

  28. Kevin Schärer

    Concerning the QR page, I‘ll look into it tomorrow. I don‘t think that the cache should play a role here, since I have not changed JS code for this page.

  29. Kevin Schärer

    For Windows native installs, I’ve found a way which allows us to terminate a task, although it is a bit hacky.

    The solution is to spawn each worker as a service which will execute the queued tasks - this would also be persistent across system restarts. To have enough permission to kill those processes from the Django application, we would need to create a new local user account (under Windows) and define the service user and the IIS Application Pool Identity to use this user account instead. This would also be more secure to have a local non-admin account than to set the identity to Local System which would have higher privileges. Currently, the IIS Application Pool Identity is set to ApplicationPoolIdentity which is more secure, however, prohibits the Django app to kill a huey process at all.

    What do you think @Ed McDonagh ?

  30. Kevin Schärer

    As for the QR page, it should be working again. Now I also realized what you really meant with the huey info message... 😅 Unfortunately, these messages cannot be adjusted, as they are completely defined within the huey module.

  31. Ed McDonagh

    Why does Windows have to give me such headaches!

    I think it is not unreasonable for a Windows server to have service accounts running things. Whether they are local service accounts or domain service accounts created for this server would be for local implementation to choose. It does make the installation (even) more involved, but I think it isn’t unreasonable.

    What do you think @David Platten ?

  32. Kevin Schärer

    I’ve forgotten to reference the issue in the latest commit (ee976b2)

    I've adjusted the settings for Windows to allow better control of tasks, as suggested earlier. I also started to update the documentation for native Windows installations to reflect the latest changes.

  33. Ed McDonagh

    It seems to start with importing background and falling over on the new get_queued_tasks function:

    WARNING: autodoc: failed to import module 'background' from module 'remapp.tools'; the following exception was raised:
    Traceback (most recent call last):
      File "/home/docs/checkouts/readthedocs.org/user_builds/openrem/envs/issue963taskqueue/lib/python3.8/site-packages/sphinx/ext/autodoc/importer.py", line 58, in import_module
        return importlib.import_module(modname)
      File "/home/docs/checkouts/readthedocs.org/user_builds/openrem/envs/issue963taskqueue/lib/python3.8/importlib/__init__.py", line 127, in import_module
        return _bootstrap._gcd_import(name[level:], package, level)
      File "<frozen importlib._bootstrap>", line 1014, in _gcd_import
      File "<frozen importlib._bootstrap>", line 991, in _find_and_load
      File "<frozen importlib._bootstrap>", line 975, in _find_and_load_unlocked
      File "<frozen importlib._bootstrap>", line 671, in _load_unlocked
      File "<frozen importlib._bootstrap_external>", line 783, in exec_module
      File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
      File "/home/docs/checkouts/readthedocs.org/user_builds/openrem/checkouts/issue963taskqueue/openrem/remapp/tools/background.py", line 330, in <module>
        def get_queued_tasks(task_type=None) -> list[QueuedTask]:
    TypeError: 'type' object is not subscriptable
    

    I don’t know if it is the typing, or something else? I haven’t looked at it properly.

    Anything jump out to you Kevin?

  34. Ed McDonagh

    Great. That makes sense, and means we can have the type hinting back in! I think I’ve made the decision to use Python 3.8 to 3.10 for this release, so we should assume it could be any of these. I have separate Python 3.8 and 3.9 pipelines to repeat the tests with those environments, but they don’t run unless we kick them off - presumably if we had then we’d see the same issue. In fact, I might just do that!

  35. Ed McDonagh

    Bother, I can’t unless I faff around creating a branch from before you reverted the type hinting, and I’ve got studying to do!

  36. Kevin Schärer

    While updating the native Linux installation instructions and installing OpenREM on an Ubuntu 22.04 machine, I discovered a potential problem with the SQLite database created by huey. Due to a bug in SQLite, there is no way to set the group write flag when creating the database file. So, when the service creates this file, the www-data user can use the SQLite database normally, while the logged in non-root user trying to import a study locally via a script will get a permission error, as the permissions of the database files are all set to 0o644.

    The simple solution would be to first create an empty database file and set the permissions accordingly. However, SQLite will create and remove some helper files on the fly, where the permission will still be set to 0o644 on file creation, eventually causing the same error.

    Windows is not affected by this issue, which lets me think whether we should switch to redis for Linux installations?

    Docker is also not affected, but maybe it would make sense to switch to redis there as well, which should be as simple as spinning up a container with the pre-built redis image?

    (@Ed McDonagh )

  37. Ed McDonagh

    I had a lot of trouble getting SQLite3 working on my test system - all permission based. I was assuming I had just missed something and a good set of instructions would sort it out. Maybe it is more complicated than that.

    How simple is the redis setup on Linux? I think the divergence between the systems is ok on this front.

  38. Kevin Schärer

    I’ve already got it working. So, it is quite simple. One only needs to install redis over apt - the server starts without any further configurations.

    I’ll look into securing the redis instance such that it won’t be accessible from outside the server - either via firewall or with a simple configuration.

    All in all, the additional installation overhead should not be too bad.

  39. Kevin Schärer

    Ok, I’ve now adapted the docs for native Linux installations and pushed a new branch on the docker repository, which both include Redis as backend instead of SQLite3.

    I've done a rudimentary test of both methods, which worked straight away.

  40. Ed McDonagh

    Hi @Kevin Schärer - does adding redis to the requirements file cause an issue for Windows installs? Does it just install and then get ignored?

  41. Kevin Schärer

    Hi @Ed McDonagh - Before I added redis as a dependency on the list, I tried to install it on Windows, which it did seamlessly. It is also stated on PyPI that the package is OS-independent.

    After installation, however, it is not used on Windows, where we still use SQLite.

  42. Kevin Schärer

    @Ed McDonagh - All in all, I think this problem is on its way to being resolved - or do you find something which is still missing respectively not working as intended?

  43. Ed McDonagh

    I agree Kevin, thank you. I need to do some more testing on Linux, I’m hoping that @David Platten is going to be able to test on Windows, and I should use it with Docker too.

    Can you make sure black is happy, then create a PR so that Codacy can go over it and we can review it once more.

  44. Ed McDonagh

    Kevin, can we get any indication of how many tasks Huey/Redis (on Linux/Docker) or Huey/SQLite3 are piling up?

  45. Kevin Schärer

    @Ed McDonagh at openrem/tasks/tasks/task_admin/ is a new panel called Queued Tasks where all tasks awaiting execution are displayed. There you can also remove a task from the queue.

  46. Ed McDonagh

    How did you get those lined up Kevin? I have tried and tried, but they all keep into Active Tasks. 🤔

  47. Kevin Schärer

    @Ed McDonagh - Depending on how many CPU cores you actually have on your system, it will spawn that many tasks parallelly. For my testing setup, I am using the following Bash script:

    #!/bin/bash
    for i in {1..100}
    do
       curl -X POST  http://localhost:8000/openrem/export/ctxlsx1/0/0/? -b 'csrftoken=<YOUR-CSRFTOKEN>' -b 'sessionid=<YOUR-SESSIONID>'
    done
    

    (replace both CSRF-token and session id accordingly; adjust the upper boundary if you happen to have something like a Threadripper CPU…)

    Then I also have multiple hundred study entries so that the export will actually take some time.

  48. Ed McDonagh

    That worked for me, once I’d limited the CPU to 2 rather than 32. I hadn’t limited the container’s resources till now.

    I think now it just needs a bit more testing as I suggested before then we can merge it in.

    Thanks Kevin.

  49. David Platten reporter

    How would I get this branch to work (for testing purposes) on my Windows 10 laptop, particularly the huey task queue bit?

  50. Kevin Schärer

    @David Platten - If you do not need a persistent setup, which I assume, then you can start a single huey consumer by typing in:

    python manage.py run_huey
    

    This will spawn a worker which executes a task after another.

  51. Ed McDonagh

    David are you thinking about running OpenREM on Windows 10/11 desktop for production or for ‘test’ purposes?

    For production use, aside from the warning at the top of the instructions that this isn’t recommended, and a few minor differences with IIS etc, are there any other differences? Are there differences for service accounts?

    For ‘test’ use, ie using manage.py runserver, there would presumably be a few places we’d need to give alternative instructions? If that is the aim, I’m not sure if it is better to have a second document saying - follow the instructions, but skip these bits, or to have a load of note boxes in the main document?

  52. Kevin Schärer

    @David Platten - I agree. I've added the description. To spin up multiple workers, we need to call run_huey in a separate console. This is because Windows only allows threads as a worker type, not processes. Since threads borrow the PID from their parent process, it is not possible to stop them individually if we have multiple workers in a huey consumer process. Therefore, we spin up their own consumer process, which allows stopping them individually.

  53. Ed McDonagh

    I think you should move the admonition box to just under the Task Queue title, because the sentence before and after the box flow from each other, so they work better if they are together (after the box).

    Also, can you use Task queue rather than Task Queue - I made the stylistic decision not to use capital initials in titles (except for proper nouns etc).

  54. David Platten reporter

    @Kevin Schärer I used python manage.py run_huey -w 2 to run two workers in parallel - easier than running another instance in a separate console.

  55. Kevin Schärer

    @David Platten - Have you tried killing one task when there are two tasks executing at the same time? In my tests, I killed both tasks when running the worker this way.

  56. Kevin Schärer

    @David Platten - Actually, you cannot kill a task at all because the console where the run_huey command is called is somehow blocking it.

    So, I think we need to use the script on Windows 10 & 11 too, to spawn the workers in the background. This also allows starting a stopped worker automatically when one aborts a task in the frontend.

  57. David Platten reporter

    For my local testing the issue I have with the script approach is that I don’t have permission to create new users on my work laptop, and don’t have permission to create new services. manage.py run_huey works for me as a way to test things, but I wouldn’t want to rely on it outside a testing environment.

  58. Ed McDonagh

    Hi @Kevin Schärer , for the docker install I have now got it working using your openrem/docker compose file etc (and following my own initial steps in the instructions which I forgot last time 🤦‍♂️ )

    Testing with 2 CPU allocated, and throwing lots of DICOM stores to the server in a short time, showed them all piling up in ‘active tasks’ not ‘queued tasks’. I tried the same with firing off lots of CT exports in quick succession, and again they piled up in ‘active tasks’. In both cases it appeared that two were being processed whilst the others were ‘not started’. In the import experiment, of the two started ones, I think it was only importing one at a time which was good to see.

    Have you been able to test this yourself with Docker?

  59. Kevin Schärer

    Hi @Ed McDonagh - I’ve just tested it and I only see 8 active tasks (because there are 8 threads available) and the rest is queued for execution. I don’t really know what is causing this at your end. Where have you set the limitation of 2 CPU cores? In the docker-compose.yml file or in the settings.py file?

    Referring to the slow Docker performance – I am also not able to reproduce that. In other words, pulling down all containers averages at roughly 4 seconds and starting them up takes only 2 seconds.

  60. Ed McDonagh

    Thanks for looking Kevin. The CPU restriction is on the (LXC/D) container. I wonder if docker is pretending it has more? Maybe I’ll look at restricting within the docker config?

    2 seconds to start? Blimey! I wonder what is going on with my setup!

  61. Kevin Schärer

    I’ve pushed these changes to docker; thus you’ll need to fetch the newest image, and set the variable in the .env.prod file. Then the number of active tasks should be limited to whatever you’ve configured.

  62. Ed McDonagh

    Thanks, I’ll try this later. If the value is unset does it work like before? (I haven’t looked at the commit yet)

    I’m wondering if there is either some sort of timeout for a web call that is being blocked at my institution, or some complication with LXC that is causing the start to be slow.

  63. Kevin Schärer

    That’s quite weird—I’ll look around if I find something which could cause slow start-ups.

  64. Ed McDonagh

    Don’t worry Kevin - now I’ve actually looked it seems that the default of LXC/D on ZFS is known to be slow with Docker, which I hadn’t appreciated. Recommendation is to set up a btrfs volume for containers that are running Docker, which is annoying!

  65. Ed McDonagh

    Once configured, your branch on a btrfs volume for /var/lib/docker takes 13 seconds real time to start.

    The zfs volume version timed out after 1 minute 27 seconds real time 😥

    I’ve tested the 2 worker setting in Huey, works as you described.

  66. Ed McDonagh

    @Kevin Schärer can you take a look at https://bitbucket.org/openrem/openrem/pipelines/results/4726 - I ran the Python 3.8 pipeline and it fell on this error:

    File "/opt/atlassian/pipelines/agent/build/openrem/remapp/views_admin.py", line 2266, in <module>
        def tasks(request, stage: str | None = None):
    TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'
    

    Is it fixable? I know we are standardising on Python 3.10 but I was hoping it would work with 3.8 and 3.9 too.

  67. Kevin Schärer

    @Ed McDonagh - According to PEP 604, | will be used to designate union types starting from Python 3.10. For compatibility with Python 3.9 and less one needs to use Union instead. I'll change that.

  68. Ed McDonagh

    Thanks for fixing that. 3.8 and 3.9 will be slower as all the tests are done, including the slow one. But 26 minutes seems too long. I don’t know if there is a change or if it is Bitbucket that is on a go-slow.

    I’ve also set off a 3.8 for my local settings branch so that should give us a clue.

  69. Log in to comment