MemSpeed stencil performance test fails with assert() failure on Stampede2 KNL

Issue #2480 new
Roland Haas created an issue

The MemSpeed test fails with:

INFO (hwloc): library version 1.11.8, API version 0x10b00
INFO (SystemTopology): MPI process-to-host mapping:
This is MPI process 0 of 2
MPI hosts:
  0: c405-073.stampede2.tacc.utexas.edu
This MPI process runs on host 0 of 1
On this host, this is MPI process 0 of 2
INFO (SystemTopology): Topology support:
Discovery support:
  discovery->pu                            : yes
CPU binding support:
  cpubind->set_thisproc_cpubind            : yes
  cpubind->get_thisproc_cpubind            : yes
  cpubind->set_proc_cpubind                : yes
  cpubind->get_proc_cpubind                : yes
  cpubind->set_thisthread_cpubind          : yes
  cpubind->get_thisthread_cpubind          : yes
  cpubind->set_thread_cpubind              : yes
  cpubind->get_thread_cpubind              : yes
  cpubind->get_thisproc_last_cpu_location  : yes
  cpubind->get_proc_last_cpu_location      : yes
  cpubind->get_thisthread_last_cpu_location: yes
Memory binding support:
  membind->set_thisproc_membind            : no
  membind->get_thisproc_membind            : no
  membind->set_proc_membind                : no
  membind->get_proc_membind                : no
  membind->set_thisthread_membind          : yes
  membind->get_thisthread_membind          : yes
  membind->set_area_membind                : yes
  membind->get_area_membind                : yes
  membind->alloc_membind                   : yes
  membind->firsttouch_membind              : yes
  membind->bind_membind                    : yes
  membind->interleave_membind              : yes
  membind->replicate_membind               : no
  membind->nexttouch_membind               : no
  membind->migrate_membind                 : yes
INFO (SystemTopology): Hardware objects in this node:
Machine L#0: (P#0, total=100556096KB, DMIProductName="PowerEdge C6320p", DMIProductVersion=, DMIBoardVendor="Dell Inc.", DMIBoardName=0D2RR8, DMIBoardVersion=A00, DMIBoardAssetTag=, DMIChassisVendor="Dell Inc.", DMIChassisType=23, DMIChassisVersion=, DMIChassisAssetTag=, DMIBIOSVendor="Dell Inc.", DMIBIOSVersion=2.3.0, DMIBIOSDate=01/07/2020, DMISysVendor="Dell Inc.", Backend=Linux, LinuxCgroup=/, OSName=Linux, OSRelease=3.10.0-957.5.1.el7.x86_64, OSVersion="#1 SMP Fri Feb 1 14:54:57 UTC 2019", HostName=c405-073.stampede2.tacc.utexas.edu, Architecture=x86_64, hwlocVersion=1.11.8, ProcessName=cactus_sim)
  NUMANode L#0: (P#0, local=100556096KB, total=100556096KB)
    Package L#0: (P#0, CPUVendor=GenuineIntel, CPUFamilyNumber=6, CPUModelNumber=87, CPUModel="Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz", CPUStepping=1)
      L2Cache L#0: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#0: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#0: (P#0)
            PU L#0: (P#0)
            PU L#1: (P#68)
            PU L#2: (P#136)
            PU L#3: (P#204)
        L1dCache L#1: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#1: (P#1)
            PU L#4: (P#1)
            PU L#5: (P#69)
            PU L#6: (P#137)
            PU L#7: (P#205)
      L2Cache L#1: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#2: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#2: (P#2)
            PU L#8: (P#2)
            PU L#9: (P#70)
            PU L#10: (P#138)
            PU L#11: (P#206)
        L1dCache L#3: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#3: (P#3)
            PU L#12: (P#3)
            PU L#13: (P#71)
            PU L#14: (P#139)
            PU L#15: (P#207)
      L2Cache L#2: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#4: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#4: (P#6)
            PU L#16: (P#4)
            PU L#17: (P#72)
            PU L#18: (P#140)
            PU L#19: (P#208)
        L1dCache L#5: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#5: (P#7)
            PU L#20: (P#5)
            PU L#21: (P#73)
            PU L#22: (P#141)
            PU L#23: (P#209)
      L2Cache L#3: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#6: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#6: (P#8)
            PU L#24: (P#6)
            PU L#25: (P#74)
            PU L#26: (P#142)
            PU L#27: (P#210)
        L1dCache L#7: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#7: (P#9)
            PU L#28: (P#7)
            PU L#29: (P#75)
            PU L#30: (P#143)
            PU L#31: (P#211)
      L2Cache L#4: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#8: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#8: (P#10)
            PU L#32: (P#8)
            PU L#33: (P#76)
            PU L#34: (P#144)
            PU L#35: (P#212)
        L1dCache L#9: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#9: (P#11)
            PU L#36: (P#9)
            PU L#37: (P#77)
            PU L#38: (P#145)
            PU L#39: (P#213)
      L2Cache L#5: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#10: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#10: (P#12)
            PU L#40: (P#10)
            PU L#41: (P#78)
            PU L#42: (P#146)
            PU L#43: (P#214)
        L1dCache L#11: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#11: (P#13)
            PU L#44: (P#11)
            PU L#45: (P#79)
            PU L#46: (P#147)
            PU L#47: (P#215)
      L2Cache L#6: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#12: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#12: (P#14)
            PU L#48: (P#12)
            PU L#49: (P#80)
            PU L#50: (P#148)
            PU L#51: (P#216)
        L1dCache L#13: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#13: (P#15)
            PU L#52: (P#13)
            PU L#53: (P#81)
            PU L#54: (P#149)
            PU L#55: (P#217)
      L2Cache L#7: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#14: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#14: (P#16)
            PU L#56: (P#14)
            PU L#57: (P#82)
            PU L#58: (P#150)
            PU L#59: (P#218)
        L1dCache L#15: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#15: (P#17)
            PU L#60: (P#15)
            PU L#61: (P#83)
            PU L#62: (P#151)
            PU L#63: (P#219)
      L2Cache L#8: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#16: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#16: (P#18)
            PU L#64: (P#16)
            PU L#65: (P#84)
            PU L#66: (P#152)
            PU L#67: (P#220)
        L1dCache L#17: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#17: (P#19)
            PU L#68: (P#17)
            PU L#69: (P#85)
            PU L#70: (P#153)
            PU L#71: (P#221)
      L2Cache L#9: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#18: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#18: (P#20)
            PU L#72: (P#18)
            PU L#73: (P#86)
            PU L#74: (P#154)
            PU L#75: (P#222)
        L1dCache L#19: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#19: (P#21)
            PU L#76: (P#19)
            PU L#77: (P#87)
            PU L#78: (P#155)
            PU L#79: (P#223)
      L2Cache L#10: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#20: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#20: (P#22)
            PU L#80: (P#20)
            PU L#81: (P#88)
            PU L#82: (P#156)
            PU L#83: (P#224)
        L1dCache L#21: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#21: (P#23)
            PU L#84: (P#21)
            PU L#85: (P#89)
            PU L#86: (P#157)
            PU L#87: (P#225)
      L2Cache L#11: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#22: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#22: (P#24)
            PU L#88: (P#22)
            PU L#89: (P#90)
            PU L#90: (P#158)
            PU L#91: (P#226)
        L1dCache L#23: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#23: (P#25)
            PU L#92: (P#23)
            PU L#93: (P#91)
            PU L#94: (P#159)
            PU L#95: (P#227)
      L2Cache L#12: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#24: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#24: (P#26)
            PU L#96: (P#24)
            PU L#97: (P#92)
            PU L#98: (P#160)
            PU L#99: (P#228)
        L1dCache L#25: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#25: (P#27)
            PU L#100: (P#25)
            PU L#101: (P#93)
            PU L#102: (P#161)
            PU L#103: (P#229)
      L2Cache L#13: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#26: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#26: (P#28)
            PU L#104: (P#26)
            PU L#105: (P#94)
            PU L#106: (P#162)
            PU L#107: (P#230)
        L1dCache L#27: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#27: (P#29)
            PU L#108: (P#27)
            PU L#109: (P#95)
            PU L#110: (P#163)
            PU L#111: (P#231)
      L2Cache L#14: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#28: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#28: (P#30)
            PU L#112: (P#28)
            PU L#113: (P#96)
            PU L#114: (P#164)
            PU L#115: (P#232)
        L1dCache L#29: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#29: (P#31)
            PU L#116: (P#29)
            PU L#117: (P#97)
            PU L#118: (P#165)
            PU L#119: (P#233)
      L2Cache L#15: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#30: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#30: (P#32)
            PU L#120: (P#30)
            PU L#121: (P#98)
            PU L#122: (P#166)
            PU L#123: (P#234)
        L1dCache L#31: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#31: (P#33)
            PU L#124: (P#31)
            PU L#125: (P#99)
            PU L#126: (P#167)
            PU L#127: (P#235)
      L2Cache L#16: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#32: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#32: (P#34)
            PU L#128: (P#32)
            PU L#129: (P#100)
            PU L#130: (P#168)
            PU L#131: (P#236)
        L1dCache L#33: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#33: (P#35)
            PU L#132: (P#33)
            PU L#133: (P#101)
            PU L#134: (P#169)
            PU L#135: (P#237)
      L2Cache L#17: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#34: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#34: (P#36)
            PU L#136: (P#34)
            PU L#137: (P#102)
            PU L#138: (P#170)
            PU L#139: (P#238)
        L1dCache L#35: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#35: (P#37)
            PU L#140: (P#35)
            PU L#141: (P#103)
            PU L#142: (P#171)
            PU L#143: (P#239)
      L2Cache L#18: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#36: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#36: (P#38)
            PU L#144: (P#36)
            PU L#145: (P#104)
            PU L#146: (P#172)
            PU L#147: (P#240)
        L1dCache L#37: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#37: (P#39)
            PU L#148: (P#37)
            PU L#149: (P#105)
            PU L#150: (P#173)
            PU L#151: (P#241)
      L2Cache L#19: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#38: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#38: (P#40)
            PU L#152: (P#38)
            PU L#153: (P#106)
            PU L#154: (P#174)
            PU L#155: (P#242)
        L1dCache L#39: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#39: (P#41)
            PU L#156: (P#39)
            PU L#157: (P#107)
            PU L#158: (P#175)
            PU L#159: (P#243)
      L2Cache L#20: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#40: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#40: (P#42)
            PU L#160: (P#40)
            PU L#161: (P#108)
            PU L#162: (P#176)
            PU L#163: (P#244)
        L1dCache L#41: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#41: (P#43)
            PU L#164: (P#41)
            PU L#165: (P#109)
            PU L#166: (P#177)
            PU L#167: (P#245)
      L2Cache L#21: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#42: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#42: (P#44)
            PU L#168: (P#42)
            PU L#169: (P#110)
            PU L#170: (P#178)
            PU L#171: (P#246)
        L1dCache L#43: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#43: (P#45)
            PU L#172: (P#43)
            PU L#173: (P#111)
            PU L#174: (P#179)
            PU L#175: (P#247)
      L2Cache L#22: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#44: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#44: (P#46)
            PU L#176: (P#44)
            PU L#177: (P#112)
            PU L#178: (P#180)
            PU L#179: (P#248)
        L1dCache L#45: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#45: (P#47)
            PU L#180: (P#45)
            PU L#181: (P#113)
            PU L#182: (P#181)
            PU L#183: (P#249)
      L2Cache L#23: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#46: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#46: (P#48)
            PU L#184: (P#46)
            PU L#185: (P#114)
            PU L#186: (P#182)
            PU L#187: (P#250)
        L1dCache L#47: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#47: (P#49)
            PU L#188: (P#47)
            PU L#189: (P#115)
            PU L#190: (P#183)
            PU L#191: (P#251)
      L2Cache L#24: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#48: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#48: (P#50)
            PU L#192: (P#48)
            PU L#193: (P#116)
            PU L#194: (P#184)
            PU L#195: (P#252)
        L1dCache L#49: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#49: (P#51)
            PU L#196: (P#49)
            PU L#197: (P#117)
            PU L#198: (P#185)
            PU L#199: (P#253)
      L2Cache L#25: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#50: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#50: (P#52)
            PU L#200: (P#50)
            PU L#201: (P#118)
            PU L#202: (P#186)
            PU L#203: (P#254)
        L1dCache L#51: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#51: (P#53)
            PU L#204: (P#51)
            PU L#205: (P#119)
            PU L#206: (P#187)
            PU L#207: (P#255)
      L2Cache L#26: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#52: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#52: (P#58)
            PU L#208: (P#52)
            PU L#209: (P#120)
            PU L#210: (P#188)
            PU L#211: (P#256)
        L1dCache L#53: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#53: (P#59)
            PU L#212: (P#53)
            PU L#213: (P#121)
            PU L#214: (P#189)
            PU L#215: (P#257)
      L2Cache L#27: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#54: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#54: (P#60)
            PU L#216: (P#54)
            PU L#217: (P#122)
            PU L#218: (P#190)
            PU L#219: (P#258)
        L1dCache L#55: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#55: (P#61)
            PU L#220: (P#55)
            PU L#221: (P#123)
            PU L#222: (P#191)
            PU L#223: (P#259)
      L2Cache L#28: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#56: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#56: (P#62)
            PU L#224: (P#56)
            PU L#225: (P#124)
            PU L#226: (P#192)
            PU L#227: (P#260)
        L1dCache L#57: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#57: (P#63)
            PU L#228: (P#57)
            PU L#229: (P#125)
            PU L#230: (P#193)
            PU L#231: (P#261)
      L2Cache L#29: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#58: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#58: (P#64)
            PU L#232: (P#58)
            PU L#233: (P#126)
            PU L#234: (P#194)
            PU L#235: (P#262)
        L1dCache L#59: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#59: (P#65)
            PU L#236: (P#59)
            PU L#237: (P#127)
            PU L#238: (P#195)
            PU L#239: (P#263)
      L2Cache L#30: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#60: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#60: (P#66)
            PU L#240: (P#60)
            PU L#241: (P#128)
            PU L#242: (P#196)
            PU L#243: (P#264)
        L1dCache L#61: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#61: (P#67)
            PU L#244: (P#61)
            PU L#245: (P#129)
            PU L#246: (P#197)
            PU L#247: (P#265)
      L2Cache L#31: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#62: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#62: (P#68)
            PU L#248: (P#62)
            PU L#249: (P#130)
            PU L#250: (P#198)
            PU L#251: (P#266)
        L1dCache L#63: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#63: (P#69)
            PU L#252: (P#63)
            PU L#253: (P#131)
            PU L#254: (P#199)
            PU L#255: (P#267)
      L2Cache L#32: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#64: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#64: (P#70)
            PU L#256: (P#64)
            PU L#257: (P#132)
            PU L#258: (P#200)
            PU L#259: (P#268)
        L1dCache L#65: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#65: (P#71)
            PU L#260: (P#65)
            PU L#261: (P#133)
            PU L#262: (P#201)
            PU L#263: (P#269)
      L2Cache L#33: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
        L1dCache L#66: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#66: (P#72)
            PU L#264: (P#66)
            PU L#265: (P#134)
            PU L#266: (P#202)
            PU L#267: (P#270)
        L1dCache L#67: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
          Core L#67: (P#73)
            PU L#268: (P#67)
            PU L#269: (P#135)
            PU L#270: (P#203)
            PU L#271: (P#271)
INFO (SystemTopology): Thread CPU bindings:
  MPI process 0 on host 0 (process 0 of 2 on this host)
    OpenMP thread 0: PU set L#{0} P#{0}
    OpenMP thread 1: PU set L#{4} P#{1}
    OpenMP thread 2: PU set L#{8} P#{2}
    OpenMP thread 3: PU set L#{12} P#{3}
    OpenMP thread 4: PU set L#{16} P#{4}
    OpenMP thread 5: PU set L#{20} P#{5}
    OpenMP thread 6: PU set L#{24} P#{6}
    OpenMP thread 7: PU set L#{28} P#{7}
    OpenMP thread 8: PU set L#{32} P#{8}
    OpenMP thread 9: PU set L#{36} P#{9}
    OpenMP thread 10: PU set L#{40} P#{10}
    OpenMP thread 11: PU set L#{44} P#{11}
    OpenMP thread 12: PU set L#{48} P#{12}
    OpenMP thread 13: PU set L#{52} P#{13}
    OpenMP thread 14: PU set L#{56} P#{14}
    OpenMP thread 15: PU set L#{60} P#{15}
    OpenMP thread 16: PU set L#{64} P#{16}
  MPI process 1 on host 0 (process 1 of 2 on this host)
    OpenMP thread 0: PU set L#{136} P#{34}
    OpenMP thread 1: PU set L#{140} P#{35}
    OpenMP thread 2: PU set L#{144} P#{36}
    OpenMP thread 3: PU set L#{148} P#{37}
    OpenMP thread 4: PU set L#{152} P#{38}
    OpenMP thread 5: PU set L#{156} P#{39}
    OpenMP thread 6: PU set L#{160} P#{40}
    OpenMP thread 7: PU set L#{164} P#{41}
    OpenMP thread 8: PU set L#{168} P#{42}
    OpenMP thread 9: PU set L#{172} P#{43}
    OpenMP thread 10: PU set L#{176} P#{44}
    OpenMP thread 11: PU set L#{180} P#{45}
    OpenMP thread 12: PU set L#{184} P#{46}
    OpenMP thread 13: PU set L#{188} P#{47}
    OpenMP thread 14: PU set L#{192} P#{48}
    OpenMP thread 15: PU set L#{196} P#{49}
    OpenMP thread 16: PU set L#{200} P#{50}
INFO (SystemTopology): Setting thread CPU bindings:
INFO (SystemTopology): Thread CPU bindings:
  MPI process 0 on host 0 (process 0 of 2 on this host)
    OpenMP thread 0: PU set L#{0} P#{0}
    OpenMP thread 1: PU set L#{4} P#{1}
    OpenMP thread 2: PU set L#{8} P#{2}
    OpenMP thread 3: PU set L#{12} P#{3}
    OpenMP thread 4: PU set L#{16} P#{4}
    OpenMP thread 5: PU set L#{20} P#{5}
    OpenMP thread 6: PU set L#{24} P#{6}
    OpenMP thread 7: PU set L#{28} P#{7}
    OpenMP thread 8: PU set L#{32} P#{8}
    OpenMP thread 9: PU set L#{36} P#{9}
    OpenMP thread 10: PU set L#{40} P#{10}
    OpenMP thread 11: PU set L#{44} P#{11}
    OpenMP thread 12: PU set L#{48} P#{12}
    OpenMP thread 13: PU set L#{52} P#{13}
    OpenMP thread 14: PU set L#{56} P#{14}
    OpenMP thread 15: PU set L#{60} P#{15}
    OpenMP thread 16: PU set L#{64} P#{16}
  MPI process 1 on host 0 (process 1 of 2 on this host)
    OpenMP thread 0: PU set L#{68} P#{17}
    OpenMP thread 1: PU set L#{72} P#{18}
    OpenMP thread 2: PU set L#{76} P#{19}
    OpenMP thread 3: PU set L#{80} P#{20}
    OpenMP thread 4: PU set L#{84} P#{21}
    OpenMP thread 5: PU set L#{88} P#{22}
    OpenMP thread 6: PU set L#{92} P#{23}
    OpenMP thread 7: PU set L#{96} P#{24}
    OpenMP thread 8: PU set L#{100} P#{25}
    OpenMP thread 9: PU set L#{104} P#{26}
    OpenMP thread 10: PU set L#{108} P#{27}
    OpenMP thread 11: PU set L#{112} P#{28}
    OpenMP thread 12: PU set L#{116} P#{29}
    OpenMP thread 13: PU set L#{120} P#{30}
    OpenMP thread 14: PU set L#{124} P#{31}
    OpenMP thread 15: PU set L#{128} P#{32}
    OpenMP thread 16: PU set L#{132} P#{33}
INFO (SystemTopology): Extracting CPU/cache/memory properties:
  There are 4 PUs per core (aka hardware SMT threads)
  There are 1 threads per core (aka SMT threads used)
  Cache (unknown name) has type "data" depth 1
    size 32768 linesize 64 associativity 8 stride 4096, for 4 PUs
  Cache (unknown name) has type "unified" depth 2
    size 1048576 linesize 64 associativity 16 stride 65536, for 8 PUs
  Memory has type "local" depth 1
    size 102969442304 pagesize 4096, for 272 PUs
INFO (PUGH): Using physical to logical mappings: direct
INFO (PUGH): Using topology generator: automatic
INFO (Vectors): Using vector size 8 for architecture AVX512+AVX512ER (64-bit precision)
--------------------------------------------------------------------------------
Driver provided by PUGH
--------------------------------------------------------------------------------

INFO (PUGH): Not setting up a topology for 1 dimensions
INFO (PUGH): Not setting up a topology for 2 dimensions
INFO (PUGH): Setting up a topology for 3 dimensions
INFO (IOASCII): I/O Method 'IOASCII_1D' registered: output of 1D lines of grid functions/arrays to ASCII files
INFO (IOASCII): Periodic 1D output every 1 iterations
INFO (IOASCII): Periodic 1D output requested for 'GRID::r'
INFO (IOASCII): I/O Method 'IOASCII_2D' registered: output of 2D planes of grid functions/arrays to ASCII files
INFO (IOASCII): Periodic 2D output turned off
INFO (IOASCII): I/O Method 'IOASCII_3D' registered: output of 3D grid functions/arrays to ASCII files
INFO (IOASCII): Periodic 3D output turned off
INFO (MemSpeed): Measuring CPU, cache, memory, and communication speeds:
  Single-core measurements (using 1 MPI processes with 1 OpenMP threads each):
    CPU frequency:
      iterations=1000000... time=0.00407195 sec
      iterations=10000000... time=0.0405741 sec
      iterations=100000000... time=0.40577 sec
      iterations=300000000... time=1.21312 sec
      iterations=300000000... time=1.2129 sec
      result: 2703.1 GHz
    CPU floating point performance:
      iterations=1000000... time=0.00408101 sec
      iterations=10000000... time=0.0404639 sec
      iterations=100000000... time=0.404346 sec
      iterations=300000000... time=1.213 sec
      result: 31.6571 Gflop/sec
    CPU integer performance:
      iterations=1000000... time=0.00811601 sec
      iterations=10000000... time=0.0809209 sec
      iterations=100000000... time=0.808586 sec
      iterations=200000000... time=1.61725 sec
      result: 1.97867 Giop/sec
    Memory allocation performance for local memory (for 272 PUs) (using 1*1073741824 bytes):
      iterations=1... time=0.328679 sec
      iterations=3... time=0.938123 sec
      iterations=6... time=1.87499 sec
      result: 3.43599 GByte/sec
    Read latency of D1 cache (for 4 PUs) (using 1*24576 bytes):
      iterations=1000... time=0.00032711 sec
      iterations=10000... time=0.00273204 sec
      iterations=100000... time=0.0271342 sec
      iterations=1000000... time=0.270517 sec
      iterations=4000000... time=1.08053 sec
      result: 2.70133 nsec
    Read latency of L2 cache (for 8 PUs) (using 1*786432 bytes):
      iterations=1000... time=0.00294495 sec
      iterations=10000... time=0.0140121 sec
      iterations=100000... time=0.136882 sec
      iterations=800000... time=1.02385 sec
      result: 12.7981 nsec
    Read latency of local memory (for 272 PUs) (using 1*1073741824 bytes):
      iterations=1000... time=0.017643 sec
      iterations=10000... time=0.176322 sec
      iterations=60000... time=1.05638 sec
      result: 176.063 nsec
    Read bandwidth of D1 cache (for 4 PUs) (using 1*24576 bytes):
      iterations=1... time=2.14577e-06 sec
      iterations=10... time=4.91142e-05 sec
      iterations=100... time=3.60012e-05 sec
      iterations=1000... time=0.000230074 sec
      iterations=10000... time=0.00199389 sec
      iterations=100000... time=0.019356 sec
      iterations=1000000... time=0.192889 sec
      iterations=6000000... time=1.15808 sec
      result: 127.328 GByte/sec
    Read bandwidth of L2 cache (for 8 PUs) (using 1*786432 bytes):
      iterations=1... time=0.00011301 sec
      iterations=10... time=0.000781059 sec
      iterations=100... time=0.00484395 sec
      iterations=1000... time=0.0321031 sec
      iterations=10000... time=0.291711 sec
      iterations=40000... time=1.07934 sec
      result: 29.1449 GByte/sec
    Read bandwidth of local memory (for 272 PUs) (using 1*1073741824 bytes):
      iterations=1... time=0.139124 sec
      iterations=8... time=1.11338 sec
      result: 7.71522 GByte/sec
    Write latency of D1 cache (for 4 PUs) (using 1*24576 bytes):
      iterations=1000... time=1.97887e-05 sec
      iterations=10000... time=0.000127792 sec
      iterations=100000... time=0.00105596 sec
      iterations=1000000... time=0.010298 sec
      iterations=10000000... time=0.103325 sec
      iterations=100000000... time=1.03489 sec
      result: 1.29361 nsec
    Write latency of L2 cache (for 8 PUs) (using 1*786432 bytes):
      iterations=1000... time=9.10759e-05 sec
      iterations=10000... time=0.000571966 sec
      iterations=100000... time=0.00375795 sec
      iterations=1000000... time=0.0348961 sec
      iterations=10000000... time=0.320716 sec
      iterations=30000000... time=0.948481 sec
      iterations=60000000... time=1.89937 sec
      result: 3.95703 nsec
    Write latency of local memory (for 272 PUs) (using 1*1073741824 bytes):
      iterations=1000... time=0.000288963 sec
      iterations=10000... time=0.00225806 sec
      iterations=100000... time=0.022434 sec
      iterations=1000000... time=0.222748 sec
      iterations=5000000... time=1.11351 sec
      result: 27.8377 nsec
    Write bandwidth of D1 cache (for 4 PUs) (using 1*24576 bytes):
      iterations=1... time=4.05312e-06 sec
      iterations=10... time=1.4782e-05 sec
      iterations=100... time=8.39233e-05 sec
      iterations=1000... time=0.000637054 sec
      iterations=10000... time=0.00622296 sec
      iterations=100000... time=0.0612509 sec
      iterations=1000000... time=0.61436 sec
      iterations=2000000... time=1.22711 sec
      result: 40.0552 GByte/sec
    Write bandwidth of L2 cache (for 8 PUs) (using 1*786432 bytes):
      iterations=1... time=0.000114918 sec
      iterations=10... time=0.00090003 sec
      iterations=100... time=0.00708699 sec
      iterations=1000... time=0.0582938 sec
      iterations=10000... time=0.526602 sec
      iterations=20000... time=1.02799 sec
      result: 15.3003 GByte/sec
    Write bandwidth of local memory (for 272 PUs) (using 1*1073741824 bytes):
      iterations=1... time=0.133248 sec
      iterations=8... time=1.0641 sec
      result: 8.07247 GByte/sec
    Write bandwidth via cache-bypassing stores for local memory (for 272 PUs) (using 1*1073741824 bytes):
      iterations=1... time=0.132953 sec
      iterations=8... time=1.06518 sec
      result: 8.0643 GByte/sec
    Stencil code performance of D1 cache (for 4 PUs) (using 1*12^3 grid points, 1*27648 bytes):
      iterations=1... time=7.48634e-05 sec
      iterations=10... time=0.000219107 sec
      iterations=100... time=0.00188708 sec
      iterations=1000... time=0.0183041 sec
      iterations=10000... time=0.182543 sec
      iterations=60000... time=1.09559 sec
      result: 0.094634 Gupdates/sec
    Stencil code performance of L2 cache (for 8 PUs) (using 1*37^3 grid points, 1*810448 bytes):
      iterations=1... time=0.000602961 sec
      iterations=10... time=0.00587201 sec
      iterations=100... time=0.0577371 sec
      iterations=1000... time=0.575167 sec
      iterations=2000... time=1.14345 sec
      result: 0.0885966 Gupdates/sec
    Stencil code performance of local memory (for 272 PUs) (using 1*406^3 grid points, 1*1070774656 bytes):
      iterations=1... time=0.813491 sec
      iterations=2... time=1.62696 sec
      result: 0.0822682 Gupdates/sec
  Single-node measurements (using 2 MPI processes with 17 OpenMP threads each):
    CPU frequency:
      iterations=1000000... time=0.00417674 sec
      iterations=10000000... time=0.0405164 sec
      iterations=100000000... time=0.404582 sec
      iterations=300000000... time=1.2133 sec
      iterations=300000000... time=1.21315 sec
      result: 3941.4 GHz
    CPU floating point performance:
      iterations=1000000... time=0.00405846 sec
      iterations=10000000... time=0.0404528 sec
      iterations=100000000... time=0.404897 sec
      iterations=300000000... time=1.21318 sec
      result: 31.6523 Gflop/sec
    CPU integer performance:
      iterations=1000000... time=0.00815167 sec
      iterations=10000000... time=0.0808692 sec
      iterations=100000000... time=0.80888 sec
      iterations=200000000... time=1.61778 sec
      result: 1.97802 Giop/sec
    Memory allocation performance for local memory (for 272 PUs) (using 2*1073741824 bytes):
      iterations=1... time=0.0189642 sec
      iterations=10... time=0.180968 sec
      iterations=60... time=1.08605 sec
      result: 59.3198 GByte/sec
    Read latency of D1 cache (for 4 PUs) (using 34*24576 bytes):
      iterations=1000... time=0.000294622 sec
      iterations=10000... time=0.00271222 sec
      iterations=100000... time=0.0270828 sec
      iterations=1000000... time=0.270048 sec
      iterations=4000000... time=1.08007 sec
      result: 2.70017 nsec
    Read latency of L2 cache (for 8 PUs) (using 18*786432 bytes):
      iterations=1000... time=0.00887088 sec
      iterations=10000... time=0.0865954 sec
      iterations=100000... time=0.839289 sec
      iterations=200000... time=1.67403 sec
      result: 83.7014 nsec
    Read latency of local memory (for 272 PUs) (using 2*1073741824 bytes):
      iterations=1000... time=0.0175604 sec
      iterations=10000... time=0.180401 sec
      iterations=60000... time=1.05829 sec
      result: 176.382 nsec
    Read bandwidth of D1 cache (for 4 PUs) (using 34*24576 bytes):
      iterations=1... time=1.64088e-06 sec
      iterations=10... time=3.12749e-06 sec
      iterations=100... time=2.02445e-05 sec
      iterations=1000... time=0.000192306 sec
      iterations=10000... time=0.00193771 sec
      iterations=100000... time=0.019337 sec
      iterations=1000000... time=0.193137 sec
      iterations=6000000... time=1.15795 sec
      result: 127.342 GByte/sec
    Read bandwidth of L2 cache (for 8 PUs) (using 18*786432 bytes):
      iterations=1... time=0.000100276 sec
      iterations=10... time=0.000923627 sec
      iterations=100... time=0.00846056 sec
      iterations=1000... time=0.0819019 sec
      iterations=10000... time=0.819235 sec
      iterations=20000... time=1.63804 sec
      result: 9.60213 GByte/sec
    Read bandwidth of local memory (for 272 PUs) (using 2*1073741824 bytes):
      iterations=1... time=0.261908 sec
      iterations=4... time=1.05583 sec
      result: 4.06788 GByte/sec
    Write latency of D1 cache (for 4 PUs) (using 34*24576 bytes):
      iterations=1000... time=2.83367e-05 sec
      iterations=10000... time=0.000103172 sec
      iterations=100000... time=0.00106769 sec
      iterations=1000000... time=0.0110052 sec
      iterations=10000000... time=0.108229 sec
      iterations=100000000... time=1.07058 sec
      result: 1.33823 nsec
    Write latency of L2 cache (for 8 PUs) (using 18*786432 bytes):
      iterations=1000... time=0.000101798 sec
      iterations=10000... time=0.000947525 sec
      iterations=100000... time=0.00854621 sec
      iterations=1000000... time=0.0806903 sec
      iterations=10000000... time=0.791932 sec
      iterations=20000000... time=1.58361 sec
      result: 9.89757 nsec
    Write latency of local memory (for 272 PUs) (using 2*1073741824 bytes):
      iterations=1000... time=0.000648618 sec
      iterations=10000... time=0.00287849 sec
      iterations=100000... time=0.0226344 sec
      iterations=1000000... time=0.219735 sec
      iterations=5000000... time=1.10478 sec
      result: 27.6194 nsec
    Write bandwidth of D1 cache (for 4 PUs) (using 34*24576 bytes):
      iterations=1... time=1.99851e-06 sec
      iterations=10... time=7.82574e-06 sec
      iterations=100... time=6.25218e-05 sec
      iterations=1000... time=0.000607806 sec
      iterations=10000... time=0.00614851 sec
      iterations=100000... time=0.0613123 sec
      iterations=1000000... time=0.613361 sec
      iterations=2000000... time=1.22768 sec
      result: 40.0365 GByte/sec
    Write bandwidth of L2 cache (for 8 PUs) (using 18*786432 bytes):
      iterations=1... time=0.000110276 sec
      iterations=10... time=0.00112304 sec
      iterations=100... time=0.0102063 sec
      iterations=1000... time=0.0973017 sec
      iterations=10000... time=0.977225 sec
      iterations=20000... time=1.94673 sec
      result: 8.07953 GByte/sec
    Write bandwidth of local memory (for 272 PUs) (using 2*1073741824 bytes):
      iterations=1... time=0.192148 sec
      iterations=6... time=1.17048 sec
      result: 5.50413 GByte/sec
    Write bandwidth via cache-bypassing stores for local memory (for 272 PUs) (using 2*1073741824 bytes):
      iterations=1... time=0.192612 sec
      iterations=6... time=1.17092 sec
      result: 5.50206 GByte/sec
cactus_sim: /work/00945/rhaas/stampede2/ET_Next/arrangements/CactusUtils/MemSpeed/src/memspeed.cc:1355: void <unnamed>::measure_stencil_performance(int, int): Assertion `num_allocs % num_active_procs == 0' failed.
cactus_sim: /work/00945/rhaas/stampede2/ET_Next/arrangements/CactusUtils/MemSpeed/src/memspeed.cc:1355: void <unnamed>::measure_stencil_performance(int, int): Assertion `num_allocs % num_active_procs == 0' failed.

and the code that fails is:

 assert(num_allocs % num_active_procs == 0);

where num_allocs is const int num_allocs = (total_pus + mem_pus - 1) / mem_pus and num_active_procs is const int num_active_procs = small_cache ? comm_size : 1; ie comm_size given that the assert failed.

This is for the 2 MPI ranks test using 17 threads (so an odd number) on the KNL section of Stampede2.

I also noticed that there is some code later on that reads:

    if (comm_size > num_active_procs ||
        (skip_memsize + memsize) * comm_size > node_memory * 3 / 4) {
      printf("      [skipped -- too many MPI processes]\n");
      continue;
    }

which will trigger whenever num_active_procs is 1 (ie not the case of the assert).

This particular test already failed in the Turing release.

Comments (0)

  1. Log in to comment