Cannot build pytorch after changes to magma in last 1-2 days

Issue #36 resolved

Daniel Bonner created an issue 2021-02-10

Hi Magma developers and users,

Update: Thanks. I now realise I need to replace NVCCFLAGS with DEVCCFLAGS in my make.inc. However, one of the many changes to magma means I am unable to build Pytorch from it’s latest source. If I build magma from a snapshot of source that is 1-2 days old, then I can build Pytorch.

This problem occcurs on all my machines (laptop with Nvidia 2080 max q, desktop with Nvidia 2080 TI, server with 8 x Nvidia A100). The further details of my machines are below. Nb: The software and installation scripts I use are identical on all machines.

Ubuntu 20.10 (x86_64), gcc-10.2.0, pyenv, python 3.8.7, Nb: This is a virtual environment created with 'python -m venv', not conda
CUDA 11.2.1
GCC version: (Ubuntu 10.2.0-13ubuntu1) 10.2.0
CMake version: version 3.18.4

Python version: 3.8 (64-bit runtime)
Nvidia driver version: 460.32.03
cuDNN version: 8.1.0

All the best,

Dan

Comments (18)

Daniel Bonner reporter
- edited description
- 2021-02-10T14:59:28+00:00
Cade Brown
Daniel,

‌

What error are you getting? Looking through my-make-log-file.txt, it seems to compile successfully (albeit with some warnings)

Additionally, are you using the git repo, or a release of MAGMA?

‌

‌
- 2021-02-10T15:33:52+00:00
Mark Gates
It appears to be compiling only the static *.a libs, not the dynamic *.so libs. The default DEVCCFLAGS is missing -Xcompiler "$(FPIC)", which is required for dynamic libraries. DEVCCFLAGS replaced NVCCFLAGS, but all the make.inc-examples still have NVCCFLAGS. Does hipcc have the same -Xcompiler flag as nvcc?
- 2021-02-10T15:38:26+00:00
Stanimire Tomov
Indeed, this must be related to NVCCFLAGS. We are in the process of merging the HIP and CUDA developments so NVCCFLAGS is obsolete; now replaced by DEVCCFLAGS.

You can change the name (NVCCFLAGS to DEVCCFLAGS) in make.inc. (The make.inc examples are updated; it’s slight inconvenience that this change will require users to modify their existing make.inc just for this major change).
- 2021-02-10T19:10:35+00:00
Daniel Bonner reporter
- attached pytorch-my-log-file.txt
- attached magma-with-DEVCCFLAGS-my-make-log-file.txt
- attached magma-with-DEVCCFLAGS-my-sudo-s-make-install-log-file.txt
```
  <div class="preview-container wiki-content"></div>
  <div class="mask"></div>
</div>
```
</div> </form>
- 2021-02-11T02:57:06+00:00
Daniel Bonner reporter
Thanks for the tip about DEVCCFLAGS. I was able to build magma.so and magma_sparse.so once I corrected my make.inc file. However something in magma has changed. After building magma (see new attached files), pytorch can not be built from source. My pytorch build (see new attached log file) ends in this error:
```
[5116/5994] Building NVCC (Device) object caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/torch
```
_cuda_generated_ReduceMinMaxKernel.cu.o
ninja: build stopped: subcommand failed.
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-req-build-95q72bm2/setup.py", line 818, in <module>
build_deps()
File "/tmp/pip-req-build-95q72bm2/setup.py", line 315, in build_deps
build_caffe2(version=version,
File "/tmp/pip-req-build-95q72bm2/tools/build_pytorch_libs.py", line 58, in build_caffe2
cmake.build(my_env)
File "/tmp/pip-req-build-95q72bm2/tools/setup_helpers/cmake.py", line 345, in build
self.run(build_args, my_env)
File "/tmp/pip-req-build-95q72bm2/tools/setup_helpers/cmake.py", line 140, in run
check_call(command, cwd=self.build_dir, env=env)
File "/home/daniel/.pyenv/versions/3.8.7/lib/python3.8/subprocess.py", line 364, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['cmake', '--build', '.', '--target', 'install', '--config', 'Release
', '--', '-j', '256']' returned non-zero exit status 1.

I know this is related to magma, because if I revert magma back to a version from 48 hours ago, then pytorch builds successfully. I have also tested this out by going back to Cuda 11.2.0 (not 11.2.1) so it is not related to a new version of Cuda.

Much appreciated for your help

All the best,

Daniel
- 2021-02-11T03:01:41+00:00
Daniel Bonner reporter
Pytorch (latest git source) does not build (same error above) with the latest Magma libraries built from source.

Pytorch (latest git source) builds OK with Magma libraries built from a previous snapshot of Magma source (from git pull done roughly 2 days ago).

I used this command to find out the most recent commit in this snapshot of the Magma source:

git log HEAD~2..HEAD

This most recent commit in this snapshot is: 00b6803a77e7d212d354a301d9d5b3ff49ced2b3 (HEAD -> master, origin/master, origin/HEAD)
Author: Stan Tomov tomov@eecs.utk.edu
Date: Wed Feb 3 01:56:06 2021 -0500, Comment: move back to cublas

Let me know if you think I should raise this as a separate issue. I kinda thought it might be related.

Also: I’m getting a message from the Bitbucket website (https://bitbucket.org/icl/magma/src/master/) that they are having problems.

Stanimire: I noticed you said that the make.inc-examples have been updated to use DEVCCFLAGS. However, on Bitbucket (or git pull from Bitbucket), the make.inc-examples files still use NVCCFLAGS

All the best,

Daniel

‌
- 2021-02-11T04:47:55+00:00
Stan Tomov
I am not very familiar with pytorch but from the log files looks like magma_types.h is included and HAVE_CUBLAS is not defined. I see that from the logs when you build magma you pass -DHAVE_CUBLAS to the compiler and everything builds. But maybe there is some other compilation with pytorch where the file is included and HAVE_CUBLAS is not defined?

If it is that, you can just comment out line 24 and try to rebuild.

That’s one difference that I see from the previous version of magma. Before, there was no #error directive if HAVE_CUBLAS is not defined, and we would silently define it.
- 2021-02-11T04:49:45+00:00
Stan Tomov
Daniel,

Indeed, most of the make.inc examples are not updated yet (only the ones for hip and a make.inc.openblas for CUDA are). We will update them.

The commit that you mentioned is not essential for the overall library. There were a lot of changes in a merge of merge-master-hipMAGMA into master that I am trying to figure out how was exactly done and why I can not see the changes.
- 2021-02-11T05:57:48+00:00
Daniel Bonner reporter
- attached magma-diff.txt
- 2021-02-11T06:05:36+00:00
Daniel Bonner reporter
Thanks for helping out with this Stan. I tried your suggestion and commented out line 24 in magma_types.h. Unfortunately pytorch still will not build.

I had a look at the differences between my older magma snapshot (my “magma” directory) and the latest magma (my “magma-2” directory) using the meld GUI and also with the command:

diff magma magma-2 (I’ve attached the result ‘magma-diff.txt’)

It looks like Magma has changed a lot in the last 1-2 days. So I’m guessing one of these changes has meant that pytorch will not build.

Pytorch builds OK with Magma libraries installed to /usr/local/magma and built from the old “magma” directory.

Pytorch errors/fails to build with Magma libraries installed in the same place and built from the current “magma-2” directory.

I made sure the libraries were available by creating the file:

/etc/ld.so.conf.d/magma.conf

Contents of file:
/usr/local/magma/lib

All the best,

Daniel
- 2021-02-11T06:14:27+00:00
Daniel Bonner reporter
- edited description
- changed title to Cannot build pytorch after changes to magma in last 1-2 days
- 2021-02-11T06:47:08+00:00
Daniel Bonner reporter
- attached magma_types.h
This version of magma_types.h allows Pytorch to be built from source with Magma support

There are 2 changes. Both are needed. The line numbers below refer to the line numbers in the original file (not the changed file):

1) Line 24 is commented out:

// #error No 'HAVE_*' macros were set! (defaulting to CUBLAS)

2) Line 62 is replaced as follows. i.e. It needs to be '#if defined(HAVE_CUBLAS) not HAVE_CUDA:

// PREVIOUS LINE:

// #if defined(HAVE_CUDA)

is replaced by:

// NEW LINE:

#if defined(HAVE_CUBLAS)

All the best, Daniel
- 2021-02-13T05:51:14+00:00
Daniel Bonner reporter
Hi Stan, Cade and Mark,

I found out the 2 changes in magma_types.h needed to allow Pytorch to build successfully:

Lin 24 - commented out.

Line 62 - replace ‘HAVE_CUDA’ with ‘HAVE_CUBLAS’.

Is this worth changing in your source code?

Would you like me to make a pull request?
All the best,

Daniel
- 2021-02-13T05:56:33+00:00
Daniel Bonner reporter
Hi,

I creataed this pull request:

https://bitbucket.org/icl/magma/pull-requests/7

All the best,

Daniel
- 2021-02-13T07:36:55+00:00
Stan Tomov
Hi Daniel,

That’s great that the problem is discovered and the issue is fixed now!
This is very important case for us when magma is integrated with other large packages.

Thank you for the pull request.

We will accept it and keep it in mind on how to organize this for the up-coming release, now that we know what is the problem.

We must improve the control over these defines in some config file because now we fix the problem for CUDA (by defaulting to CUDA solution) but not if you want to install pytorch + magma for say AMD GPUs.

Stan

‌
- 2021-02-13T19:15:58+00:00
Daniel Bonner reporter
Thanks Stan, Hopefully, it won't be too hard to cover the AMD option too. I understand that is what hipMAGMA is for, and now it has been merged with MAGMA. All the best, Daniel
- 2021-02-14T12:25:11+00:00
Stan Tomov
- changed status to resolved
This issue has been resolved.
- 2021-02-18T18:40:46+00:00
Log in to comment

Assignee: –

Type: bug

Priority: major

Status: resolved

Votes: 0

Watchers: 1