Cannot build pytorch after changes to magma in last 1-2 days

Issue #36 resolved
Daniel Bonner created an issue

Hi Magma developers and users,

Update: Thanks. I now realise I need to replace NVCCFLAGS with DEVCCFLAGS in my make.inc. However, one of the many changes to magma means I am unable to build Pytorch from it’s latest source. If I build magma from a snapshot of source that is 1-2 days old, then I can build Pytorch.

This problem occcurs on all my machines (laptop with Nvidia 2080 max q, desktop with Nvidia 2080 TI, server with 8 x Nvidia A100). The further details of my machines are below. Nb: The software and installation scripts I use are identical on all machines.

Ubuntu 20.10 (x86_64), gcc-10.2.0, pyenv, python 3.8.7, Nb: This is a virtual environment created with 'python -m venv', not conda
CUDA 11.2.1
GCC version: (Ubuntu 10.2.0-13ubuntu1) 10.2.0
CMake version: version 3.18.4

Python version: 3.8 (64-bit runtime)
Nvidia driver version: 460.32.03
cuDNN version: 8.1.0

All the best,

Dan

Comments (18)

  1. Cade Brown

    Daniel,

    What error are you getting? Looking through my-make-log-file.txt, it seems to compile successfully (albeit with some warnings)

    Additionally, are you using the git repo, or a release of MAGMA?

  2. Mark Gates

    It appears to be compiling only the static *.a libs, not the dynamic *.so libs. The default DEVCCFLAGS is missing -Xcompiler "$(FPIC)", which is required for dynamic libraries. DEVCCFLAGS replaced NVCCFLAGS, but all the make.inc-examples still have NVCCFLAGS. Does hipcc have the same -Xcompiler flag as nvcc?

  3. Stanimire Tomov

    Indeed, this must be related to NVCCFLAGS. We are in the process of merging the HIP and CUDA developments so NVCCFLAGS is obsolete; now replaced by DEVCCFLAGS.

    You can change the name (NVCCFLAGS to DEVCCFLAGS) in make.inc. (The make.inc examples are updated; it’s slight inconvenience that this change will require users to modify their existing make.inc just for this major change).

  4. Daniel Bonner reporter

    Thanks for the tip about DEVCCFLAGS. I was able to build magma.so and magma_sparse.so once I corrected my make.inc file. However something in magma has changed. After building magma (see new attached files), pytorch can not be built from source. My pytorch build (see new attached log file) ends in this error:

    [5116/5994] Building NVCC (Device) object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/torch
    

    _cuda_generated_ReduceMinMaxKernel.cu.o
    ninja: build stopped: subcommand failed.
    Traceback (most recent call last):
    File "<string>", line 1, in <module>
    File "/tmp/pip-req-build-95q72bm2/setup.py", line 818, in <module>
    build_deps()
    File "/tmp/pip-req-build-95q72bm2/setup.py", line 315, in build_deps
    build_caffe2(version=version,
    File "/tmp/pip-req-build-95q72bm2/tools/build_pytorch_libs.py", line 58, in build_caffe2
    cmake.build(my_env)
    File "/tmp/pip-req-build-95q72bm2/tools/setup_helpers/cmake.py", line 345, in build
    self.run(build_args, my_env)
    File "/tmp/pip-req-build-95q72bm2/tools/setup_helpers/cmake.py", line 140, in run
    check_call(command, cwd=self.build_dir, env=env)
    File "/home/daniel/.pyenv/versions/3.8.7/lib/python3.8/subprocess.py", line 364, in check_call
    raise CalledProcessError(retcode, cmd)
    subprocess.CalledProcessError: Command '['cmake', '--build', '.', '--target', 'install', '--config', 'Release
    ', '--', '-j', '256']' returned non-zero exit status 1.

    I know this is related to magma, because if I revert magma back to a version from 48 hours ago, then pytorch builds successfully. I have also tested this out by going back to Cuda 11.2.0 (not 11.2.1) so it is not related to a new version of Cuda.

    Much appreciated for your help 🙂

    All the best,

    Daniel

  5. Daniel Bonner reporter

    Pytorch (latest git source) does not build (same error above) with the latest Magma libraries built from source.

    Pytorch (latest git source) builds OK with Magma libraries built from a previous snapshot of Magma source (from git pull done roughly 2 days ago).

    I used this command to find out the most recent commit in this snapshot of the Magma source:

    git log HEAD~2..HEAD

    This most recent commit in this snapshot is: 00b6803a77e7d212d354a301d9d5b3ff49ced2b3 (HEAD -> master, origin/master, origin/HEAD)
    Author: Stan Tomov tomov@eecs.utk.edu
    Date: Wed Feb 3 01:56:06 2021 -0500, Comment: move back to cublas

    Let me know if you think I should raise this as a separate issue. I kinda thought it might be related.

    Also: I’m getting a message from the Bitbucket website (https://bitbucket.org/icl/magma/src/master/) that they are having problems.

    Stanimire: I noticed you said that the make.inc-examples have been updated to use DEVCCFLAGS. However, on Bitbucket (or git pull from Bitbucket), the make.inc-examples files still use NVCCFLAGS

    All the best,

    Daniel

  6. Stan Tomov

    I am not very familiar with pytorch but from the log files looks like magma_types.h is included and HAVE_CUBLAS is not defined. I see that from the logs when you build magma you pass -DHAVE_CUBLAS to the compiler and everything builds. But maybe there is some other compilation with pytorch where the file is included and HAVE_CUBLAS is not defined?

    If it is that, you can just comment out line 24 and try to rebuild.

    That’s one difference that I see from the previous version of magma. Before, there was no #error directive if HAVE_CUBLAS is not defined, and we would silently define it.

  7. Stan Tomov

    Daniel,

    Indeed, most of the make.inc examples are not updated yet (only the ones for hip and a make.inc.openblas for CUDA are). We will update them.

    The commit that you mentioned is not essential for the overall library. There were a lot of changes in a merge of merge-master-hipMAGMA into master that I am trying to figure out how was exactly done and why I can not see the changes.

  8. Daniel Bonner reporter

    Thanks for helping out with this Stan. I tried your suggestion and commented out line 24 in magma_types.h. Unfortunately pytorch still will not build.

    I had a look at the differences between my older magma snapshot (my “magma” directory) and the latest magma (my “magma-2” directory) using the meld GUI and also with the command:

    diff magma magma-2 (I’ve attached the result ‘magma-diff.txt’)

    It looks like Magma has changed a lot in the last 1-2 days. So I’m guessing one of these changes has meant that pytorch will not build.

    Pytorch builds OK with Magma libraries installed to /usr/local/magma and built from the old “magma” directory.

    Pytorch errors/fails to build with Magma libraries installed in the same place and built from the current “magma-2” directory.

    I made sure the libraries were available by creating the file:

    /etc/ld.so.conf.d/magma.conf

    Contents of file:
    /usr/local/magma/lib

    All the best,

    Daniel

  9. Daniel Bonner reporter

    This version of magma_types.h allows Pytorch to be built from source with Magma support

    There are 2 changes. Both are needed. The line numbers below refer to the line numbers in the original file (not the changed file):

    1) Line 24 is commented out:

    // #error No 'HAVE_*' macros were set! (defaulting to CUBLAS)

    2) Line 62 is replaced as follows. i.e. It needs to be '#if defined(HAVE_CUBLAS) not HAVE_CUDA:

    // PREVIOUS LINE:

    // #if defined(HAVE_CUDA)

    is replaced by:

    // NEW LINE:

    #if defined(HAVE_CUBLAS)

    All the best, Daniel

  10. Daniel Bonner reporter

    Hi Stan, Cade and Mark,

    I found out the 2 changes in magma_types.h needed to allow Pytorch to build successfully:

    Lin 24 - commented out.

    Line 62 - replace ‘HAVE_CUDA’ with ‘HAVE_CUBLAS’.

    Is this worth changing in your source code?

    Would you like me to make a pull request?
    All the best,

    Daniel

  11. Stan Tomov

    Hi Daniel,

    That’s great that the problem is discovered and the issue is fixed now!
    This is very important case for us when magma is integrated with other large packages.

    Thank you for the pull request.

    We will accept it and keep it in mind on how to organize this for the up-coming release, now that we know what is the problem.

    We must improve the control over these defines in some config file because now we fix the problem for CUDA (by defaulting to CUDA solution) but not if you want to install pytorch + magma for say AMD GPUs.

    Stan

  12. Daniel Bonner reporter

    Thanks Stan, Hopefully, it won't be too hard to cover the AMD option too. I understand that is what hipMAGMA is for, and now it has been merged with MAGMA. All the best, Daniel

  13. Log in to comment