trouble in the tutorial server

Issue #2234 closed
Bill Gabella created an issue

I had several issues with the tutorial server at https://etkhub.ndslabs.org/hub/login .

  • I had frequent disconnects and reconnects. I actually do not think this interrupted my clicking through the Jupyter notebook much,
  • In the notebook I would see the (*) indicator on simple cells for a long time, likely this when I lost connection.
  • Finally finished the build, lots of warnings but I think that is usual. When I run the command/cell with create/submit helloworld, there is a Warning about "Warning: Total number of threads and number of cores per node are inconsistent: procs=1, ppn-used=4 (procs must be an integer multiple of ppn-used)." However running the next cell I do finally see the Active (Finished).
  • Running the next cell show-output, I do not see INFO (helloworld): Hello World! I see messages about no Formaline output.
  • Running the smaller static_tov gives a more serious error: "The value of the MCA parameter "plm_rsh_agent" was set to a path that could not be found: plm_rsh_agent: ssh : rsh Please either unset the parameter, or check that the path is correct" And it did not run. And indeed in the plotting cell the data file was not found.

Comments (41)

  1. Roland Haas

    Hello Bill,

    thank you for the quick and thorough testing.

    It seems as if the tutorial server is not as ready as we'd have hoped.

    Yours, Roland

  2. Roland Haas

    I am trying this myself now.

    I do see that it does not update itself for long running operations like the ET compile where the state is not updated ie (*) sticks around.Maybe this is some sort of webserver timeout. It is also possible that the link stage uses more memory than allowed (the containers are limited to 8GB and at least on clusters [comet, specifically] I have seen Cactus use that much to compile).

    What browser/OS were you using? Mine is Mozilla/5.0 (X11; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0

    Initially "Hello, World!" did not run properly (skeleton simulation directory was created but no output files), deleting it and trying again did work.

    I do see the (fatal) "The value of the MCA parameter "plm_rsh_agent" was set to a path that could not be found: plm_rsh_agent: ssh : rsh Please either unset the parameter, or check that the path is correct" error. Which is odd given that this used to work before.

    Other things of note:

    • the tutorial builds BLAS which could be avoided by installing the proper Ubuntu package in the docker container
    • the tutorial builds LAPACK which could be avoided by installing the proper Ubuntu package in the docker container
    • it may be good to build a smaller configuration for "Hello, World!" which would finish building more quickly
    • similarly it may be useful to build only a reduced thornlist for the static_tov example. There are even tools in the ET that should determine the minimal list required. They may even still work (or work again given that there was a recent re-implementation).
    • the sed command tends (or at least tended) to confuse people. Possibly replace this with a %write cell using the full parfile content
    • the generated plot is missing x and y labels, and a legend. If one presented such a plot in an assignment, one would lose marks.
  3. Roland Haas

    This seems to be an OpenMPI issue, in that it checks for rsh or ssh in its $PATH pretty much before it does anything else even when no remote execution is needed.

    For example these:

    OMPI_MCA_plm_rsh_agent=sh mpirun -n 2 whoami
    mpirun -mca plm_rsh_agent sh -n 2 whoami
    

    both work.

    A fix may be to set the plm_rsh_agent option in /etc/openmpi-mca-params.confor in $HOME/.openmpi/mca-params.conf after which the simple

    mpirun -n 2 whoami
    

    works.

  4. Bill Gabella reporter

    My browser is Google Chrome Version 73.0.3683.75 (Official Build) (64-bit), and running on Fedora Linux 29 64-bit.

  5. Roland Haas

    Thank you. I have not been able to find out if / where there is a timeout on the server side that would stop updating the browser in long running cell executions.

  6. Steven R. Brandt

    I'm having trouble understanding how it compiles with mpich and runs using openmpi. The configure script will grab mpic++ out of the environment, and that should be in the same location as mpirun. Are you sure this is happening roland? Regardless, we should eliminate one of the mpi's.

  7. Roland Haas

    It seems like a broken package selection. The compile script picked mpic++ to find out the libraries (and mpic++ is from mpich see the output of readlink $(readlink $(which mpic++)) which is /usr/bin/mpicxx.mpich) but for some reason (no real idea why, maybe just bad luck in the order in which packages were installed) the equivalent for mpirun gives (readlink $(readlink $(which mpirun))) /usr/bin/mpiexec.openmpi.

    I have verified that this is indeed what is happening, though there were other things wrong as shown by the earlier comments by Bill and I. You can try this out yourself by opening up a notebook on the server and just playing around with a hello-world MPI code or the readlink commands shown above.

    I still do not know what causes the (*) to freeze or why the original Hello World Cactus run failed to start.

  8. Steven R. Brandt

    So when we first set this up, the tutorial server worked. I ran it from beginning to end repeatedly. I suspect the base image (jupyter/scipy-notebook) changed on us, and this caused many of these problems. In particular, I think openssh-client was removed and openmpi was added. It's hard to say for sure.

    My suggestion is that we add: openssh-client, vim, liblapack-dev and take out the mpich stuff.

    Alternatively, we could consider basing the image on ubuntu rather than jupyter/scipy-notebook.

  9. Roland Haas

    I agree. Basing it on Ubuntu (a small [headless] variant without GUI and Office tools, if possible so that the containers do not use up 10s of GBs of disk space each given that they all have to live in the same VM) may be nicer of the two options that you list in that this way the container would look closer to someone's Linux laptop / VM / Ubuntu-Linux-GNU subsystem.

  10. Steven R. Brandt

    The following Dockerfile runs the notebook on my laptop (provided I compile with a single process):

    FROM ubuntu:16.04
    
    USER root
    
    RUN apt-get -qq update && \
        apt-get -qq install \
            build-essential python python-pip gfortran git mpich2? \
            subversion curl gnuplot gnuplot-x11 time libmpich2?-dev \
            libnuma-dev numactl hwloc libhwloc-dev libssl-dev \
            hdf5-tools libhdf5-dev gdb gsl-bin libgsl0-dev \
            ffmpeg libgsl-dev libopenblas-dev libpapi-dev fftw3-dev \
            liblapack-dev vim openssh-client pkg-config && \
        apt-get -qq clean all && \
        apt-get -qq autoclean && \
        apt-get -qq autoremove && \
        rm -rf /var/lib/apt/lists/*
    
    RUN pip install --upgrade pip
    RUN pip install matplotlib numpy jupyter
    ENV NB_USER jovyan
    RUN useradd -m $NB_USER
    USER $NB_USER
    ENV USER $NB_USER
    COPY start-notebook.sh /usr/local/bin/
    COPY CactusTutorial.ipynb /tutorial/
    ENV PKG_CONFIG_PATH /usr/share/pkgconfig:/usr/lib/x86_64-linux-gnu/pkgconfig:/usr/lib/pkgconfig
    
    CMD ["start-notebook.sh", "--NotebookApp.token=''"]
    
  11. Roland Haas

    Very good.

    Would it be possible to change the installed packages to closer follow https://nbviewer.jupyter.org/github/nds-org/jupyter-et/blob/master/CactusTutorial.ipynb ? The "vim" package (while I approve of your choice of editor) is probably not required, is it?

    I believe that the

    RUN pip install --upgrade pip
    RUN pip install matplotlib numpy jupyter
    

    might benefit from being followed by

    rm -rf .cache/pip .pip
    

    or similar to get rid of any caches that pip created (there's also a --no-cache-dir option which may help).

    Note: updating the jupyter-et repo triggers a build of the ndslabs/jupyter-et docker image, but (at least right now) one has to manually run docker pull on etkhub.ndslabs.org

  12. Roland Haas

    So you do want to use mpich on the tutorial server but openmpi on the instructions we give to students?

    vim is nice but at least right now there is no terminal for them to get to on the tutorial server (they could get in if they'd run dokcker on their laptop though in that case I would expect "apt-get install vim" is within their expected skill set).

  13. Steven R. Brandt

    You have a point about the mpich vs. openmpi thing, though the users should not be able to tell the difference. As for getting a terminal, the notebook server lets you do that. When you first login, click on "New" on the right hand side of the screen and select Terminal. Poof. You have a terminal.

  14. Roland Haas

    I had a look at the container. It seems to have installed a couple of unexpected packages such as gromacs (https://packages.ubuntu.com/xenial/gromacs). My guess is (just trying this right now) that this is b/c gromacs is somehow recommend for a package. One can (and should) use the --no-install-recommends option to avoid this.

    You are mixing things by having Ubuntu install the base python then using pip for others. Would it not be more consistent to use Ubuntu's ipython3-package for jupyter?

  15. Ian Hinder

    Is nano available? It would be nice to have vim, emacs and nano. Nano is used in most Ubuntu examples because it is discoverable (visible menu shortcuts) and easy for new users to learn.

  16. Steven R. Brandt

    @rhaas80 it is necessary to mix things up. You can't install jupyter on Ubuntu without using pip, AFAIK. Having said that, matplotlib suddenly no longer works with pip install (unless you specify an older version) because the newest version is python3 only.

    @ianhinder I can certainly add nano and emacs.

    I'll be pushing a change to the dockerfile shortly.

  17. Roland Haas

    Thank you. I should have just tried myself rather than putting the burden on you :-). When I tried I noticed that Ubuntu 16.04, while it does have ipython3-notebook which is the python part of the notebook it does not offer a web-terminal yet. Since it is also quite old I then tried the next LTS release (18.04) which does have the more modern jupyter-notebook package available.

    Unfortunately I had to also realize that the BLAS and LAPACK ExternalLibraries do not find either openblas not lapack/blas system packages. The issue is that neither one of the two ExternalLibraries is multi-arch aware so only looks for /usr/lib/liblapack.so and not /usr/lib/x86_64-linux-gnu/liblapack.so .

    I also tried what would happen if I pass --no-install-recommends to apt-get but had to realize that while it prevents GROMACS (a molecular dynamics code) from being installed, our list of packages relies on recommended ones being installed as well (eg openmpi-bin along with libopenmpi-dev). Docker image size changes from ~1.5GB for the "standard Ubuntu" way to ~700MB if one does not install the recommended packages.

    This should be fixable by replacing the hand-coded search by uses Frank's bash_utils.sh and its find_lib function.

  18. Roland Haas

    More issues. I am now getting error message like this

    [ekohaes8:26785] Read -1, expected 313632, errno = 1
    

    by the hundreds (this is using Ubuntu 18.04 rather than 16.04 so may only happen for new versions of OpenMPI. The OpenMPI ticke referenced below mentions this to happen on at least OpenMPI 4.0 and 3.1.3). This is apparently known: https://github.com/open-mpi/ompi/issues/4948 with the workaround being to set an env variables (or setting in the .conf file in $HOME):

    export OMPI_MCA_btl_vader_single_copy_mechanism=none
    

    This can also depend on the docker version used it would seem as docker run --cap-add=SYS_PTRACE ... is offered as a host-side workaround.

  19. Steven R. Brandt

    OK, I've pushed a new version of the Dockerfile. It uses 16.04 because if I use a later version, then I can't run the image in singularity on older machines. I've put this and other justifications for my choices of packages in the Dockerfile.

  20. Roland Haas

    I approved and merged the Dockerfile changes and have pulled the updated docker image to the etkhub server.

    I have also updated the login screen etc as well as made CILogon the login provider. For those who already registered via github I tried to add their github email address so that you can use GitHub through CILogon to log in.

    @bgabella I had to guess your likely github email address.

  21. Bill Gabella reporter

    Just tried the login, https://etkhub.ndslabs.org/ , using the CILogon and then GitHub. That seems to be working as it puts me in a Jupyter notebook session with messages about trying to startup a server. After a bit (300secs) it fails. I will attach the failure. Screenshot from 2019-03-21 09-39-08.png

  22. Roland Haas

    Hmm that is no good. It seems as if it ran out of CPUs to allocate to your jupyter notbook. Right now there is 1 notebook running (mine), plus the hub and some infrastructure.

    We limit each notebook to 2 cpu (there's 4 cpus on the VM that this runs in), and do not set an explicit guarantee. I will try and see if a limit without an explicit guarantee has the guarantee default to the limit (and also check on how many cpus are guaranteed to the other running containers). The docs https://zero-to-jupyterhub.readthedocs.io/en/latest/user-resources.html#set-user-memory-and-cpu-guarantees-limits seem to not indicate an implied guarantee.

  23. Roland Haas

    It does seems as if a non-existent guarantee setting took the limit to be the guarantee. i have updated the settings and was able to start 2 notebooks at the same time.

    The server will probably support ~6 concurrent notebooks (0.5 CPU for each plus 1 CPU for admistrative tasks). It has enough memory for that (8GB total, no swap allowed). To scale larger we would have to find out how to have the etkhub spin up new VMs as needed (which is probably can somehow do).

    @bgabella please try again (I have terminated one of my instances just to be sure).

  24. Bill Gabella reporter

    I can login, using CILogon and my Github credentials.  I can pull up the Jupyter notebook, run a new terminal, and step through the notebook.  My only issue was the little Jupyter hourglass made me think it was still compiling the code, the sim build step, but I think it was done for some time (over 2 hours?)...the next cell, sim create-submit ran immediately.

    It seems it is working, though not very speedily.

  25. Roland Haas

    Compilation can likely be sped up a bit (factor of 2 or so) by restoring the -j2 make option that was removed somewhere along the way. While we do not have enough cores on the host to give 2 cores to many users (its a 4 core VM), the notebooks are allowed to use (the equivalent of) 2 cores so compilation could go faster.

    The timeout that you observe is a bit more worrisome as you had seen the same behaviour before, but it does not show up for me from my workstation.

    I will give this a try on my laptop at home as well.

  26. Ian Hinder

    Parallel compilation can be faster than serial compilation even on a machine with a single core, because the parallelisation hides filesystem latency. I remember noticing this on a machine with a particularly sluggish NFS filesystem. I don't know if this is relevant on this system

  27. Roland Haas

    @bgabella I think I see what you mean. In my case though, the tab icon does eventually switch back from the hourglass to the regular notebook icon. On the other hand I am left with some cells still showing the "*" busy marking even after everything finished running. We seem to not be the only ones affected: https://github.com/jupyter/notebook/issues/2748 and I verified that our jupyter code does indeed not have the lines proposed as a fix. I will verify if the proposed fix works for us and if so propose a change to the docker file to patch jupyter in the container.

  28. Roland Haas

    The fix proposed in https://github.com/jupyter/notebook/issues/2748 does indeed work for me in a test installation and I have applied the same fix to the actual docker image on the etkhub server.

    @bgabella if you were to re-test once more to see if things are working now, then I would say we can close the ticket if things are now working fine for you.

    I would make new tickets for improvements to the notebook rather than tagging them on to this one.

  29. Roland Haas

    @bgabella did you (or anyone else other than me, who cannot review this since I am the author) have time to test this once more?

  30. Steven R. Brandt

    I’ve recently tested the tutorial on Melete05 and discovered that one must set --ppn-used for each run (since that machine has 40 cores). Other than that, everything worked for me.

  31. Roland Haas

    I just fixed an issue related with culling idle notebooks (defined as no longer being shown in a browser) which was using the default of 1 hr rather than 4 days.

  32. Log in to comment