DIjitso compilation failed on a worker nodes

Issue #29 new
Artur Krupa created an issue

I am having this code (updated with my promotor to 2017.2 version):

#!/usr/bin/env python


import os
os.environ['HOME']="/tmp"

from dolfin import *
from mshr import *
import sys
import random

set_log_level(40)

x0=random.uniform(0.2, 0.8)
y0=random.uniform(0.2, 0.8)

resolution = 300

domain =   Rectangle(Point(0.0, 0.0), Point(1.0, 1.0)) - Circle( Point(x0, y0), 0.15)

mesh = generate_mesh(domain, resolution)
domains = dolfin.MeshFunction("size_t", mesh, 2, mesh.domains())


# Create classes for defining parts of the boundaries and the interior
# of the domain
class Left(SubDomain):
    def inside(self, x, on_boundary):
        return (near(x[0], 0.0) and on_boundary)

class Right(SubDomain):
    def inside(self, x, on_boundary):
        return (near(x[0], 1.0) and on_boundary)

left = Left()
right = Right()

# Initialize mesh function for boundary domains
boundaries = MeshFunction("size_t", mesh, 1)
boundaries.set_all(0)
left.mark(boundaries, 1)
right.mark(boundaries, 2)

# Define input data
a0 = Constant(1.0)
voltage = 1.0

# Define function space and basis functions
V = FunctionSpace(mesh, "CG", 1)
u = TrialFunction(V)
v = TestFunction(V)

# Define Dirichlet boundary conditions at top and bottom boundaries
bcs = [DirichletBC(V, voltage, boundaries, 2),
       DirichletBC(V, 0.0, boundaries, 1)]

# Define new measures associated with the interior domains and
# exterior boundaries
dx = Measure("dx")(subdomain_data=domains)
ds = Measure("ds")(subdomain_data=boundaries)
f = Constant(0.0)



# Define variational form
a = inner(a0*grad(u), grad(v))*dx
L = f*v*dx()

# Solve problem
u = Function(V)

solve(a == L, u, bcs)

if(0):
    n = FacetNormal(mesh)
    m1 = a0*dot(grad(u), n)*ds(1)
    I1 = assemble(m1)
    print I1
    n = FacetNormal(mesh)
    m2 = a0*dot(grad(u), n)*ds(2)
    I2 = assemble(m2)
    print I2
    current = (I2+(-I1))/2
    print current
    R = voltage/current

if(1):
    energy = a0*inner(grad(u), grad(u))*dx
    power = assemble(energy)
    #print "Power total:", power
    R = voltage*voltage/power 

print x0, y0, R

I am using a Virtual Machine Set of workers with HTCondor to compute many solutions with FEniCS on many nodes at the same time. Above code was previously used on Hadoop cluster and worked fine (2 years ago). We decided to move to a HTCondor as a solution for cluster and as a solution is working amazingly (on Microsoft Azure).

Role of a cluster is simple - to push python code and run it, to get output results and save into one single file on a main node (master) for further use.

We are encountering still two problems:

1.) When I just jun this (above) code on a machine - any single node or any other, it works. But I am having message I have no idea how to remove:

--------------------------------------------------------------------------
[[62043,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: htcmanager

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------

2.) When I use HTCondor I am getting Dijitso error:

------------------- Start compiler output ------------------------
c++: error trying to exec 'cc1plus': execvp: No such file or directory

-------------------  End compiler output  ------------------------
Compilation failed! Sources, command, and errors have been written to: /var/lib/condor/execute/dir_26440/jitfailure-ffc_element_96b054dc61643dc89765c403a3b0fac357e5ae3a
Traceback (most recent call last):
  File "/var/lib/condor/execute/dir_26440/condor_exec.exe", line 57, in <module>
    V = FunctionSpace(mesh, "CG", 1)
  File "/usr/lib/python2.7/dist-packages/dolfin/functions/functionspace.py", line 199, in __init__
    self._init_convenience(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/dolfin/functions/functionspace.py", line 249, in _init_convenience
    constrained_domain=constrained_domain)
  File "/usr/lib/python2.7/dist-packages/dolfin/functions/functionspace.py", line 218, in _init_from_ufl
    dolfin_element, dolfin_dofmap = _compile_dolfin_element(element, mesh, constrained_domain=constrained_domain)
  File "/usr/lib/python2.7/dist-packages/dolfin/functions/functionspace.py", line 82, in _compile_dolfin_element
    ufc_element, ufc_dofmap = jit(element, mpi_comm=mesh.mpi_comm())
  File "/usr/lib/python2.7/dist-packages/dolfin/compilemodules/jit.py", line 70, in mpi_jit
    return local_jit(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/dolfin/compilemodules/jit.py", line 147, in jit
    "ffc.jit failed with message:\n%s" % (tb_text,))
  File "/usr/lib/python2.7/dist-packages/dolfin/cpp/common.py", line 2808, in dolfin_error
    return _common.dolfin_error(location, task, reason)
RuntimeError: 

*** -------------------------------------------------------------------------
*** DOLFIN encountered an error. If you are not able to resolve this issue
*** using the information listed below, you can ask for help at
***
***     fenics-support@googlegroups.com
***
*** Remember to include the error message listed below and, if possible,
*** include a *minimal* running example to reproduce the error.
***
*** -------------------------------------------------------------------------
*** Error:   Unable to perform just-in-time compilation of form.
*** Reason:  ffc.jit failed with message:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/dolfin/compilemodules/jit.py", line 142, in jit
    result = ffc.jit(ufl_object, parameters=p)
  File "/usr/lib/python2.7/dist-packages/ffc/jitcompiler.py", line 218, in jit
    module = jit_build(ufl_object, module_name, parameters)
  File "/usr/lib/python2.7/dist-packages/ffc/jitcompiler.py", line 134, in jit_build
    generate=jit_generate)
  File "/usr/lib/python2.7/dist-packages/dijitso/jit.py", line 219, in jit
     1.581010e-322rr_info['fail_dir'], err_info)
DijitsoError: Dijitso JIT compilation failed, see '/var/lib/condor/execute/dir_26440/jitfailure-ffc_element_96b054dc61643dc89765c403a3b0fac357e5ae3a' for details
.
*** Where:   This error was encountered inside jit.py.
*** Process: 0
*** 
*** DOLFIN version: 2017.2.0
*** Git changeset:  unknown
*** -------------------------------------------------------------------------

I know, that on a worker node, python code is executed by nouser/nogroup (it creates under /tmp folder another one .cache which is maybe the main problem, cause Dijitso tries to makes any operations but can not...

Of course when I have already .cache folder created by nouser/nogroup, still performing this code with another user (without privileges to save into this folder) lets me to make this code working.

I will be thankful for help and solution. I am losing my hair (almost bald already) solving this problem.

Comments (0)

  1. Log in to comment