- removed comment
hwloc is segfaulting
My efforts to run the ET testsuite fail when system_topology.cc attempts to bind threads to CPUs because hwloc_get_obj_by_depth(topology, pu_depth, pu_num) returns NULL. In the attached patch I check for a null return value and print a warning, then skip setting the affinity.
Keyword:
Comments (11)
-
-
reporter - removed comment
This is my IBM w520 laptop. I used the built-in Cactus hwloc. It is a documented error value that indicates that either 0, or 2+ objects of the requested type exist. I confess I didn't delve into which case it was.
-
- removed comment
I don't see NULL as documented error value... I find "Returns the topology object at logical index idx from depth depth." as documentation for hwloc_get_obj_by_depth.
What are the values of "pu_depth" and "pu_num"? Can you post the output of "lstopo" for your system? I assume you are running this under Linux? Is this a virtual machine?
-
- removed comment
http://linux.die.net/man/3/hwloc_get_obj_by_depth says: "If no object for that type exists, NULL is returned. If there are several levels with objects of that type, NULL is returned and ther caller may fallback to hwloc_get_obj_by_depth(). "
-
- removed comment
Replying to [comment:4 knarf]:
http://linux.die.net/man/3/hwloc_get_obj_by_depth says: "and ther caller may fallback to hwloc_get_obj_by_depth(). "
Which is odd - because that is the man page of that very same function.
-
reporter - removed comment
When I said documented error value, I meant the condition that Frank just quoted.
-
- removed comment
This page documents two functions, as many man pages do. The error code is for the second function only.
-
- removed comment
I just looked at this error again. This error occurs when hwloc_get_obj_by_depth is called for the second time. I assume that one of its arguments is wrong. Can you output its arguments pu_depth and pu_num? If pu_num<0 or pu_num>=8, can you also output core_num, smt_multiplier, and pu_offset? thread_offset, thread_num_in_proc, and thread_num would also be interesting.
-
reporter - removed comment
So I modified my code to look for these conditions, print a more useful error message, then abort. Since doing that, the error has not recurred. Not sure why. Maybe we can close the ticket.
-
reporter - changed status to resolved
- removed comment
-
- edited description
- changed status to closed
- Log in to comment
What system is this? What version of hwloc is this?
Is a NULL return value documented as error return, or did we find an error in hwloc here? (I'm asking because the hwloc developers are very keen on receiving bug reports.)