Example not working

Issue #2 resolved
Peter Steinbach created an issue

I am working under

$ uname -a
Linux islay 4.4.0-kfd-compute-rocm-rel-1.1.1-10 #1 SMP Mon May 30 15:27:31 CDT 2016 x86_64 x86_64 x86_64 GNU/Linux
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.4 LTS
Release:        14.04
Codename:       trusty

and compiled the example hcfft_1D_R2C.cpp with

/opt/rocm/hcc/bin/clang++  -hc -std=c++amp -stdlib=libc++ -I/opt/rocm/hcc-lc/include -I/home/steinbac/software/hcfft/master/include  -hc -std=c++amp -L/opt/rocm/hcc-lc/lib -Wl,--rpath=/opt/rocm/hcc-lc/lib -lc++ -lc++abi -ldl -lpthread -Wl,--whole-archive -lmcwamp -Wl,--no-whole-archive  -L/home/steinbac/software/hcfft/master/lib -g -lhc_am -lhcfft -o hcfft_1D_R2C  hcfft_1D_R2C.cpp

I receive a segfault and the rocm-gdb stack trace of it looks like this:

#0  0x00002aaaac558a0f in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00002aaaab8a985b in FFTPlan::hcfftEnqueueTransformInternal(unsigned long, hcfftDirection_, float*, float*, float*) () from /home/steinbac/software/hcfft/master/lib/libhcfft.so
#2  0x00002aaaab89e261 in FFTPlan::hcfftEnqueueTransform(unsigned long, hcfftDirection_, float*, float*, float*) () from /home/steinbac/software/hcfft/master/lib/libhcfft.so
#3  0x00002aaaab858182 in hcfftExecR2C(unsigned long, float*, hc::short_vector::float_2*) () from /home/steinbac/software/hcfft/master/lib/libhcfft.so
#4  0x00000000004050eb in main (argc=1, argv=0x7fffffffdf48) at hcfft_1D_R2C.cpp:32

which occurs when hcfftExecR2C is called. Is a 1024-sized signal not supported yet? What am I missing?

Comments (31)

  1. Neelakandan Ramachandran

    Have created a Jira ticket for this Will try reproducing this and get back

    One thing is we are using ROCM 1.0 and hcc-hsail backend rather than the default hcc (LC backend).

  2. Peter Steinbach reporter

    alright, thanks! I just installed ROCM 2 weeks ago and upgraded this week. btw, my system contains a package called hcc_hsail. I am still new to the ROCM s/w stack, so could give me some guidance on how to run with the hcc-hsail backend?

  3. Neelakandan Ramachandran

    easiest way is to change the symlink /opt/rocm/hcc to point to /opt/rocm/hcc-hsail

    the issue reported here got reproduced. the developer @mkarunanidhi has rebased the changes and made the master branch clean of all such issues. You should be able to run this example with the master branch. I am currently looking at integrating fftw will keep you posted

  4. Neelakandan Ramachandran

    The rebasing is done. The master should be clean now and The optim branch captures the effort to optimize the ExecAPIs.

    Currently we are considering a usecase where the execution API's are invoked multiple times (> 70 K) times. We are trying to optimize for this particular use case

  5. Peter Steinbach reporter

    I pulled the recent master, my tests still does not compile with master on ROCm 1.1. any update on this or do I need to downgrade (however that would work)

  6. Neelakandan Ramachandran

    @psteinb I assume you are using the example https://bitbucket.org/multicoreware/hcfft/src/11adcb9aed10475f0b61ad685d91eaf20102b8ec/test/examples/hcfft_1D_R2C.cpp?at=master&fileviewer=file-view-default

    I don't happen to face any segfault on my end. I am too using ROCM 1.1. But seems your kernel driver is more recent. Below is what I get with uname -a

    Linux CirrascaleAMD 4.4.0-kfd-compute-rocm-rel-1.1-15 #1 SMP Fri May 6 15:32:45 CDT 2016 x86_64 x86_64 x86_64 GNU/Linux

  7. Peter Steinbach reporter

    I have GNU/Linux 4.4.0-kfd-compute-rocm-rel-1.1.1-10 running with a R9 Fiji Nano. I am not too much into the numbering, but I wonder if my kernel is ahead or yours or the other way around? I saw that there are updates available to rocm-dev, but they apparently pull in a new kernel. checking the repo source, I see two 4.4.0 kernels being available: http://packages.amd.com/rocm/apt/debian/pool/main/l/ yours AND mine. I am a bit confused now. If I find the time today, I'll try it out.

  8. Neelakandan Ramachandran

    But the gdb stack you have printed here point to hcfft API's. I m not sure if we are on same page

  9. Peter Steinbach reporter

    I replied to your post by mail, but apparently that didn't go through. here is my answer: Neelakandan - I just reran the examples from above with 4.4.0-kfd-compute-rocm-rel-1.1.1-10:

    $ /opt/rocm/hcc/bin/clang++  -hc -std=c++amp -stdlib=libc++ -I/opt/rocm/hcc-lc/include -I/home/steinbac/software/hcfft/master/include  -hc -std=c++amp -L/opt/rocm/hcc-lc/lib -Wl,--rpath=/opt/rocm/hcc-lc/lib -lc++ -lc++abi -ldl -lpthread -Wl,--whole-archive -lmcwamp -Wl,--no-whole-archive -L/home/steinbac/software/hcfft/master/lib -g -lhc_am -lhcfft -o hcfft_1D_R2C  hcfft_1D_R2C.cpp
    $ rocm-gdb --args ./hcfft_1D_R2C
    ROCm-gdb) r
    Starting program: /home/steinbac/development/cpp_sandbox/hcfft_example/hcfft_1D_R2C
    [Thread debugging using libthread_db enabled]
    Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
    [New Thread 0x2aaaad80a700 (LWP 10418)]
    [New Thread 0x2aaaaea46700 (LWP 10419)]
    [ROCm-gdb: <Handling a SIGALRM - Used for GPU Debugging>]
    [ROCm-gdb: <Handling a SIGALRM - Used for GPU Debugging>]
    [ROCm-gdb: GPU Debugging has been successfully initialized]
    
    Program received signal SIGILL, Illegal instruction.
    0x00002aaaac559a10 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
    (ROCm-gdb) bt
    #0  0x00002aaaac559a10 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
    #1  0x0000000000788f78 in ?? ()
    #2  0x00002aaaab8a993b in FFTPlan::hcfftEnqueueTransformInternal(unsigned long, hcfftDirection_, float*, float*, float*) () from /home/stein
    #3  0x00002aaaab89e341 in FFTPlan::hcfftEnqueueTransform(unsigned long, hcfftDirection_, float*, float*, float*) () from /home/steinbac/soft
    #4  0x00002aaaab858222 in hcfftExecR2C(unsigned long, float*, hc::short_vector::float_2*) () from /home/steinbac/software/hcfft/master/lib/l
    #5  0x000000000040523b in main (argc=1, argv=0x7fffffffe478) at hcfft_1D_R2C.cpp:32
    

    IIRC, that is the wrong backend. Doing the same with HSAIL, gives:

    /opt/rocm/hcc/bin/clang++  -hc -std=c++amp -stdlib=libc++ -I/opt/rocm/hcc-hsail/include -I/home/steinbac/software/hcfft/master/include  -hc -std=c++amp -L/opt/rocm/hcc-hsail/lib -Wl,--rpath=/opt/rocm/hcc-hsail/lib -lc++ -lc++abi -ldl -lpthread -Wl,--whole-archive -lmcwamp -Wl,--no-whole-archive  -L/home/steinbac/software/hcfft/master/lib -g -lhc_am -lhcfft -o hcfft_1D_R2C_hsail  hcfft_1D_R2C.cpp
    ld: warning: type and size of dynamic symbol `_binary_kernel_isa_end' are not defined
    ld: warning: type and size of dynamic symbol `_binary_kernel_isa_start' are not defined
    $ echo $?
    0
    

    But, when I call the binary, I also stumble upon the same SIGILL:

    $ rocm-gdb --args ./hcfft_1D_R2C_hsail
    #...
    [New Thread 0x2aaaad81b700 (LWP 10677)]
    [New Thread 0x2aaaaea57700 (LWP 10678)]
    [ROCm-gdb: <Handling a SIGALRM - Used for GPU Debugging>]
    [ROCm-gdb: <Handling a SIGALRM - Used for GPU Debugging>]
    [ROCm-gdb: GPU Debugging has been successfully initialized]
    
    Program received signal SIGILL, Illegal instruction.
    0x00002aaaac56aa10 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
    (ROCm-gdb) bt
    #0  0x00002aaaac56aa10 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
    #1  0x0000000000788f78 in ?? ()
    #2  0x00002aaaab8ba93b in FFTPlan::hcfftEnqueueTransformInternal(unsigned long, hcfftDirection_, float*, float*, float*) () from /home/steinbac/software/hcfft/master/lib/libhcfft.so
    #3  0x00002aaaab8af341 in FFTPlan::hcfftEnqueueTransform(unsigned long, hcfftDirection_, float*, float*, float*) () from /home/steinbac/software/hcfft/master/lib/libhcfft.so
    #4  0x00002aaaab869222 in hcfftExecR2C(unsigned long, float*, hc::short_vector::float_2*) () from /home/steinbac/software/hcfft/master/lib/libhcfft.so
    #5  0x000000000040517b in main (argc=1, argv=0x7fffffffe448) at hcfft_1D_R2C.cpp:32
    

    The hcfft build is from last night.

  10. Peter Steinbach reporter

    I don't need to:

    int main(int argc, char *argv[])
    {
      int N = argc > 1 ? atoi(argv[1]) : 1024;
    
  11. Peter Steinbach reporter

    I build and ran the gpudebugsdk/samples/MatrixMul example:

    Initializing HSA runtime...
    HSA device attributes:
            name: CPU Device
            type: CPU
            chip ID: 0x0
            HSA profile: Full
    HSA device attributes:
            name: CPU Device
            type: CPU
            chip ID: 0x0
            HSA profile: Full
    HSA device attributes:
            name: Fiji
            type: GPU
            chip ID: 0x7300
            HSA profile: Base
    Waiting for completion...
    Kernel dispatch executed in 0.240315 milliseconds.
    Complete.
    

    looks healthy to me.

  12. Peter Steinbach reporter

    that is, 11adcb9aed10475f0b61ad685d91eaf20102b8ec. A minor update as I rebuild with hsail again, I now get this:

    $ /opt/rocm/hcc/bin/clang++  -hc -std=c++amp -stdlib=libc++ -I/opt/rocm/hcc-hsail/include -I/home/steinbac/software/hcfft/master/include  -hc -std=c++amp -L/opt/rocm/hcc-hsail/lib -Wl,--rpath=/opt/rocm/hcc-hsail/lib -lc++ -lc++abi -ldl -lpthread -Wl,--whole-archive -lmcwamp -Wl,--no-whole-archive  -L/home/steinbac/software/hcfft/master/lib -g -lhc_am -lhcfft -o hcfft_1D_R2C_hsail  hcfft_1D_R2C.cpp
    $ rocm-gdb --args ./hcfft_1D_R2C_hsail
    ### Error: HSA_STATUS_ERROR_INVALID_SYMBOL_NAME (4115) at line:1835
    [Thread 0x2aaaaea59700 (LWP 12089) exited]
    [Thread 0x2aaaad81d700 (LWP 12088) exited]
    [Inferior 1 (process 12084) exited with code 0377]
    (ROCm-gdb) bt
    No stack.
    

    I must've not rebuild hsail with the hcfft version of June 9. :( My apologies.

  13. Peter Steinbach reporter

    any idea about this? I'd like to get going benchmarking hcfft! I see a rocm-dev update waiting upstream. should I install it? might that help?

  14. Neelakandan Ramachandran

    I don't know Peter. I am able to run the example test on my machine. Could you point me to the commit of hcfft that you are using ?

  15. Peter Steinbach reporter

    Interesting, I did a fresh install on a Dell R730 running Ubuntu 14.04.4 and the same kernel 4.4.0-kfd-compute-rocm-rel-1.1.1-10 . the machine hosts a S9300x2 and the examples work! So I suspect some problem with the fiji R9 nano. :(

  16. Neelakandan Ramachandran

    I am using Linux CirrascaleAMD 4.4.0-kfd-compute-rocm-rel-1.1-15 #1 SMP Fri May 6 15:32:45 CDT 2016 x86_64 x86_64 x86_64 GNU/Linux

    I find the R2C example runnable with it

  17. Peter Steinbach reporter

    then, I am lost. it's the ROCm stack I am using, just a different motherboard (the nano is inside a workstation, the S9300x2 inside a server). ok, the test samples work on the S9300x2, I am closing this issue.

  18. Peter Steinbach reporter

    I am not sure how to proceed on this and as I have a working solution on different hardware, I close this for now ... don't have the tools or knowledge to dig further.

  19. Neelakandan Ramachandran

    @psteinb I can get you access to my machine if you desire to experience a workable ROCM + NANO

  20. Gregory Stoner

    S9300x2 may have large bar vbios loaded and the your R9 Nano most likley does not have correct VBIOS for Large BAR support.

    Greg

  21. vrajesh mistry

    @psteinb @PRN @angstroms I am trying to execute hcfft_1D_R2C.cpp, will pls help me how to execute it? I have setup as per following:

    # uname -a
    Linux localhost 4.6.0-kfd-compute-rocm-rel-1.4-16 #1 SMP Tue Dec 13 13:14:21 EST 2016 x86_64 x86_64 x86_64 GNU/Linux
    

    Hardware Details: AMD APU A10-7850K OS: Ubuntu 16.04 64-bit edition No discrete GPU present in the system

    github issue reference: https://github.com/RadeonOpenCompute/hcc/issues/246

    Compilation:

     /opt/rocm/hcc/bin/clang++ `/opt/rocm/hcc/bin/hcc-config --cxxflags --ldflags` --amdgpu-target=AMD:AMDGPU:7:0:0   -lhc_am -L/opt/rocm/hcfft/lib -lhcfft hcfft_1D_R2C.cpp -o hcfft_1D_R2C.out
    

    Execution:

    # ./hcfft_1D_R2C.out 32 32
    No suitable runtime detected. Fall back to CPU!
    Segmentation fault (core dumped)
    
  22. vrajesh mistry

    @angstroms , as per the description in ROCm , it doesn't support APU A10 7850k with discrete GPU setup, and hence I don't have any discrete GPU on my system. Along with that I have installed ROCM and it is able to execute vectorCopy sample program successfully. here is full execution details:

    xyz@localhost:/opt/rocm/hsa/sample# make
    gcc -c -I/opt/rocm/include -o vector_copy.o vector_copy.c -std=c99
    gcc -Wl,--unresolved-symbols=ignore-in-shared-libs vector_copy.o -L/opt/rocm/lib -lhsa-runtime64 -o vector_copy
    xyz@localhost:/opt/rocm/hsa/sample# ./vector_copy
    Initializing the hsa runtime succeeded.
    Checking finalizer 1.0 extension support succeeded.
    Generating function table for finalizer succeeded.
    Getting a gpu agent succeeded.
    Querying the agent name succeeded.
    The agent name is gfx700.
    Querying the agent maximum queue size succeeded.
    The maximum queue size is 131072.
    Creating the queue succeeded.
    "Obtaining machine model" succeeded.
    "Getting agent profile" succeeded.
    Create the program succeeded.
    Adding the brig module to the program succeeded.
    Query the agents isa succeeded.
    Finalizing the program succeeded.
    Destroying the program succeeded.
    Create the executable succeeded.
    Loading the code object succeeded.
    Freeze the executable succeeded.
    Extract the symbol from the executable succeeded.
    Extracting the symbol from the executable succeeded.
    Extracting the kernarg segment size from the executable succeeded.
    Extracting the group segment size from the executable succeeded.
    Extracting the private segment from the executable succeeded.
    Creating a HSA signal succeeded.
    Finding a fine grained memory region succeeded.
    Allocating argument memory for input parameter succeeded.
    Allocating argument memory for output parameter succeeded.
    Finding a kernarg memory region succeeded.
    Allocating kernel argument memory buffer succeeded.
    Dispatching the kernel succeeded.
    Passed validation.
    Freeing kernel argument memory buffer succeeded.
    Destroying the signal succeeded.
    Destroying the executable succeeded.
    Destroying the code object succeeded.
    Destroying the queue succeeded.
    Freeing in argument memory buffer succeeded.
    Freeing out argument memory buffer succeeded.
    Shutting down the runtime succeeded.
    
  23. vrajesh mistry

    @psteinb @PRN @angstroms after some experiments I figured out on my APU A10 7850k (without any dgpu) solution to error "No suitable runtime detected. Fall back to cpu!". But now I am getting segmentation fault while executing simple hcfftlib code. - Following is the code:

      #include "hcfft.h"
    
      int main()
      {
             int N = 4;
             // HCFFT work flow
             hcfftHandle *plan = NULL;
             hcfftResult status  = hcfftPlan1d(plan, N, HCFFT_R2C);
       }
    

    and Following is the rocm-gdb log:

    Thread 1 "1d_r2c_test" received signal SIGSEGV, Segmentation fault.
    0x00007ffff70950ab in FFTRepo::createPlan(unsigned long*, FFTPlan*&) () from /opt/rocm/hcfft/lib/libhcfft.so
    (ROCm-gdb) backtrace
    #0  0x00007ffff70950ab in FFTRepo::createPlan(unsigned long*, FFTPlan*&) () from /opt/rocm/hcfft/lib/libhcfft.so
    #1  0x00007ffff7092082 in hcfftCreateDefaultPlanInternal(unsigned long*, hcfftDim_, unsigned long const*) () from /
    #2  0x00007ffff70955b6 in FFTPlan::hcfftCreateDefaultPlan(unsigned long*, hcfftDim_, unsigned long const*, hcfftDir
    #3  0x00007ffff704cd61 in hcfftPlan1d () from /opt/rocm/hcfft/lib/libhcfft.so
    #4  0x0000000000406b6f in main ()
    
  24. Log in to comment