MemSpeed stencil performance test fails with assert() failure on Stampede2 KNL
Issue #2480
new
The MemSpeed test fails with:
INFO (hwloc): library version 1.11.8, API version 0x10b00
INFO (SystemTopology): MPI process-to-host mapping:
This is MPI process 0 of 2
MPI hosts:
0: c405-073.stampede2.tacc.utexas.edu
This MPI process runs on host 0 of 1
On this host, this is MPI process 0 of 2
INFO (SystemTopology): Topology support:
Discovery support:
discovery->pu : yes
CPU binding support:
cpubind->set_thisproc_cpubind : yes
cpubind->get_thisproc_cpubind : yes
cpubind->set_proc_cpubind : yes
cpubind->get_proc_cpubind : yes
cpubind->set_thisthread_cpubind : yes
cpubind->get_thisthread_cpubind : yes
cpubind->set_thread_cpubind : yes
cpubind->get_thread_cpubind : yes
cpubind->get_thisproc_last_cpu_location : yes
cpubind->get_proc_last_cpu_location : yes
cpubind->get_thisthread_last_cpu_location: yes
Memory binding support:
membind->set_thisproc_membind : no
membind->get_thisproc_membind : no
membind->set_proc_membind : no
membind->get_proc_membind : no
membind->set_thisthread_membind : yes
membind->get_thisthread_membind : yes
membind->set_area_membind : yes
membind->get_area_membind : yes
membind->alloc_membind : yes
membind->firsttouch_membind : yes
membind->bind_membind : yes
membind->interleave_membind : yes
membind->replicate_membind : no
membind->nexttouch_membind : no
membind->migrate_membind : yes
INFO (SystemTopology): Hardware objects in this node:
Machine L#0: (P#0, total=100556096KB, DMIProductName="PowerEdge C6320p", DMIProductVersion=, DMIBoardVendor="Dell Inc.", DMIBoardName=0D2RR8, DMIBoardVersion=A00, DMIBoardAssetTag=, DMIChassisVendor="Dell Inc.", DMIChassisType=23, DMIChassisVersion=, DMIChassisAssetTag=, DMIBIOSVendor="Dell Inc.", DMIBIOSVersion=2.3.0, DMIBIOSDate=01/07/2020, DMISysVendor="Dell Inc.", Backend=Linux, LinuxCgroup=/, OSName=Linux, OSRelease=3.10.0-957.5.1.el7.x86_64, OSVersion="#1 SMP Fri Feb 1 14:54:57 UTC 2019", HostName=c405-073.stampede2.tacc.utexas.edu, Architecture=x86_64, hwlocVersion=1.11.8, ProcessName=cactus_sim)
NUMANode L#0: (P#0, local=100556096KB, total=100556096KB)
Package L#0: (P#0, CPUVendor=GenuineIntel, CPUFamilyNumber=6, CPUModelNumber=87, CPUModel="Intel(R) Xeon Phi(TM) CPU 7250 @ 1.40GHz", CPUStepping=1)
L2Cache L#0: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#0: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#0: (P#0)
PU L#0: (P#0)
PU L#1: (P#68)
PU L#2: (P#136)
PU L#3: (P#204)
L1dCache L#1: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#1: (P#1)
PU L#4: (P#1)
PU L#5: (P#69)
PU L#6: (P#137)
PU L#7: (P#205)
L2Cache L#1: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#2: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#2: (P#2)
PU L#8: (P#2)
PU L#9: (P#70)
PU L#10: (P#138)
PU L#11: (P#206)
L1dCache L#3: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#3: (P#3)
PU L#12: (P#3)
PU L#13: (P#71)
PU L#14: (P#139)
PU L#15: (P#207)
L2Cache L#2: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#4: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#4: (P#6)
PU L#16: (P#4)
PU L#17: (P#72)
PU L#18: (P#140)
PU L#19: (P#208)
L1dCache L#5: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#5: (P#7)
PU L#20: (P#5)
PU L#21: (P#73)
PU L#22: (P#141)
PU L#23: (P#209)
L2Cache L#3: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#6: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#6: (P#8)
PU L#24: (P#6)
PU L#25: (P#74)
PU L#26: (P#142)
PU L#27: (P#210)
L1dCache L#7: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#7: (P#9)
PU L#28: (P#7)
PU L#29: (P#75)
PU L#30: (P#143)
PU L#31: (P#211)
L2Cache L#4: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#8: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#8: (P#10)
PU L#32: (P#8)
PU L#33: (P#76)
PU L#34: (P#144)
PU L#35: (P#212)
L1dCache L#9: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#9: (P#11)
PU L#36: (P#9)
PU L#37: (P#77)
PU L#38: (P#145)
PU L#39: (P#213)
L2Cache L#5: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#10: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#10: (P#12)
PU L#40: (P#10)
PU L#41: (P#78)
PU L#42: (P#146)
PU L#43: (P#214)
L1dCache L#11: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#11: (P#13)
PU L#44: (P#11)
PU L#45: (P#79)
PU L#46: (P#147)
PU L#47: (P#215)
L2Cache L#6: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#12: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#12: (P#14)
PU L#48: (P#12)
PU L#49: (P#80)
PU L#50: (P#148)
PU L#51: (P#216)
L1dCache L#13: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#13: (P#15)
PU L#52: (P#13)
PU L#53: (P#81)
PU L#54: (P#149)
PU L#55: (P#217)
L2Cache L#7: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#14: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#14: (P#16)
PU L#56: (P#14)
PU L#57: (P#82)
PU L#58: (P#150)
PU L#59: (P#218)
L1dCache L#15: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#15: (P#17)
PU L#60: (P#15)
PU L#61: (P#83)
PU L#62: (P#151)
PU L#63: (P#219)
L2Cache L#8: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#16: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#16: (P#18)
PU L#64: (P#16)
PU L#65: (P#84)
PU L#66: (P#152)
PU L#67: (P#220)
L1dCache L#17: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#17: (P#19)
PU L#68: (P#17)
PU L#69: (P#85)
PU L#70: (P#153)
PU L#71: (P#221)
L2Cache L#9: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#18: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#18: (P#20)
PU L#72: (P#18)
PU L#73: (P#86)
PU L#74: (P#154)
PU L#75: (P#222)
L1dCache L#19: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#19: (P#21)
PU L#76: (P#19)
PU L#77: (P#87)
PU L#78: (P#155)
PU L#79: (P#223)
L2Cache L#10: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#20: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#20: (P#22)
PU L#80: (P#20)
PU L#81: (P#88)
PU L#82: (P#156)
PU L#83: (P#224)
L1dCache L#21: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#21: (P#23)
PU L#84: (P#21)
PU L#85: (P#89)
PU L#86: (P#157)
PU L#87: (P#225)
L2Cache L#11: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#22: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#22: (P#24)
PU L#88: (P#22)
PU L#89: (P#90)
PU L#90: (P#158)
PU L#91: (P#226)
L1dCache L#23: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#23: (P#25)
PU L#92: (P#23)
PU L#93: (P#91)
PU L#94: (P#159)
PU L#95: (P#227)
L2Cache L#12: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#24: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#24: (P#26)
PU L#96: (P#24)
PU L#97: (P#92)
PU L#98: (P#160)
PU L#99: (P#228)
L1dCache L#25: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#25: (P#27)
PU L#100: (P#25)
PU L#101: (P#93)
PU L#102: (P#161)
PU L#103: (P#229)
L2Cache L#13: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#26: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#26: (P#28)
PU L#104: (P#26)
PU L#105: (P#94)
PU L#106: (P#162)
PU L#107: (P#230)
L1dCache L#27: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#27: (P#29)
PU L#108: (P#27)
PU L#109: (P#95)
PU L#110: (P#163)
PU L#111: (P#231)
L2Cache L#14: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#28: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#28: (P#30)
PU L#112: (P#28)
PU L#113: (P#96)
PU L#114: (P#164)
PU L#115: (P#232)
L1dCache L#29: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#29: (P#31)
PU L#116: (P#29)
PU L#117: (P#97)
PU L#118: (P#165)
PU L#119: (P#233)
L2Cache L#15: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#30: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#30: (P#32)
PU L#120: (P#30)
PU L#121: (P#98)
PU L#122: (P#166)
PU L#123: (P#234)
L1dCache L#31: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#31: (P#33)
PU L#124: (P#31)
PU L#125: (P#99)
PU L#126: (P#167)
PU L#127: (P#235)
L2Cache L#16: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#32: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#32: (P#34)
PU L#128: (P#32)
PU L#129: (P#100)
PU L#130: (P#168)
PU L#131: (P#236)
L1dCache L#33: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#33: (P#35)
PU L#132: (P#33)
PU L#133: (P#101)
PU L#134: (P#169)
PU L#135: (P#237)
L2Cache L#17: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#34: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#34: (P#36)
PU L#136: (P#34)
PU L#137: (P#102)
PU L#138: (P#170)
PU L#139: (P#238)
L1dCache L#35: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#35: (P#37)
PU L#140: (P#35)
PU L#141: (P#103)
PU L#142: (P#171)
PU L#143: (P#239)
L2Cache L#18: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#36: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#36: (P#38)
PU L#144: (P#36)
PU L#145: (P#104)
PU L#146: (P#172)
PU L#147: (P#240)
L1dCache L#37: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#37: (P#39)
PU L#148: (P#37)
PU L#149: (P#105)
PU L#150: (P#173)
PU L#151: (P#241)
L2Cache L#19: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#38: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#38: (P#40)
PU L#152: (P#38)
PU L#153: (P#106)
PU L#154: (P#174)
PU L#155: (P#242)
L1dCache L#39: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#39: (P#41)
PU L#156: (P#39)
PU L#157: (P#107)
PU L#158: (P#175)
PU L#159: (P#243)
L2Cache L#20: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#40: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#40: (P#42)
PU L#160: (P#40)
PU L#161: (P#108)
PU L#162: (P#176)
PU L#163: (P#244)
L1dCache L#41: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#41: (P#43)
PU L#164: (P#41)
PU L#165: (P#109)
PU L#166: (P#177)
PU L#167: (P#245)
L2Cache L#21: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#42: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#42: (P#44)
PU L#168: (P#42)
PU L#169: (P#110)
PU L#170: (P#178)
PU L#171: (P#246)
L1dCache L#43: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#43: (P#45)
PU L#172: (P#43)
PU L#173: (P#111)
PU L#174: (P#179)
PU L#175: (P#247)
L2Cache L#22: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#44: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#44: (P#46)
PU L#176: (P#44)
PU L#177: (P#112)
PU L#178: (P#180)
PU L#179: (P#248)
L1dCache L#45: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#45: (P#47)
PU L#180: (P#45)
PU L#181: (P#113)
PU L#182: (P#181)
PU L#183: (P#249)
L2Cache L#23: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#46: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#46: (P#48)
PU L#184: (P#46)
PU L#185: (P#114)
PU L#186: (P#182)
PU L#187: (P#250)
L1dCache L#47: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#47: (P#49)
PU L#188: (P#47)
PU L#189: (P#115)
PU L#190: (P#183)
PU L#191: (P#251)
L2Cache L#24: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#48: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#48: (P#50)
PU L#192: (P#48)
PU L#193: (P#116)
PU L#194: (P#184)
PU L#195: (P#252)
L1dCache L#49: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#49: (P#51)
PU L#196: (P#49)
PU L#197: (P#117)
PU L#198: (P#185)
PU L#199: (P#253)
L2Cache L#25: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#50: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#50: (P#52)
PU L#200: (P#50)
PU L#201: (P#118)
PU L#202: (P#186)
PU L#203: (P#254)
L1dCache L#51: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#51: (P#53)
PU L#204: (P#51)
PU L#205: (P#119)
PU L#206: (P#187)
PU L#207: (P#255)
L2Cache L#26: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#52: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#52: (P#58)
PU L#208: (P#52)
PU L#209: (P#120)
PU L#210: (P#188)
PU L#211: (P#256)
L1dCache L#53: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#53: (P#59)
PU L#212: (P#53)
PU L#213: (P#121)
PU L#214: (P#189)
PU L#215: (P#257)
L2Cache L#27: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#54: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#54: (P#60)
PU L#216: (P#54)
PU L#217: (P#122)
PU L#218: (P#190)
PU L#219: (P#258)
L1dCache L#55: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#55: (P#61)
PU L#220: (P#55)
PU L#221: (P#123)
PU L#222: (P#191)
PU L#223: (P#259)
L2Cache L#28: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#56: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#56: (P#62)
PU L#224: (P#56)
PU L#225: (P#124)
PU L#226: (P#192)
PU L#227: (P#260)
L1dCache L#57: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#57: (P#63)
PU L#228: (P#57)
PU L#229: (P#125)
PU L#230: (P#193)
PU L#231: (P#261)
L2Cache L#29: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#58: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#58: (P#64)
PU L#232: (P#58)
PU L#233: (P#126)
PU L#234: (P#194)
PU L#235: (P#262)
L1dCache L#59: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#59: (P#65)
PU L#236: (P#59)
PU L#237: (P#127)
PU L#238: (P#195)
PU L#239: (P#263)
L2Cache L#30: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#60: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#60: (P#66)
PU L#240: (P#60)
PU L#241: (P#128)
PU L#242: (P#196)
PU L#243: (P#264)
L1dCache L#61: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#61: (P#67)
PU L#244: (P#61)
PU L#245: (P#129)
PU L#246: (P#197)
PU L#247: (P#265)
L2Cache L#31: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#62: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#62: (P#68)
PU L#248: (P#62)
PU L#249: (P#130)
PU L#250: (P#198)
PU L#251: (P#266)
L1dCache L#63: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#63: (P#69)
PU L#252: (P#63)
PU L#253: (P#131)
PU L#254: (P#199)
PU L#255: (P#267)
L2Cache L#32: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#64: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#64: (P#70)
PU L#256: (P#64)
PU L#257: (P#132)
PU L#258: (P#200)
PU L#259: (P#268)
L1dCache L#65: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#65: (P#71)
PU L#260: (P#65)
PU L#261: (P#133)
PU L#262: (P#201)
PU L#263: (P#269)
L2Cache L#33: (P#-1, size=1024KB, linesize=64, ways=16, Inclusive=1)
L1dCache L#66: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#66: (P#72)
PU L#264: (P#66)
PU L#265: (P#134)
PU L#266: (P#202)
PU L#267: (P#270)
L1dCache L#67: (P#-1, size=32KB, linesize=64, ways=8, Inclusive=0)
Core L#67: (P#73)
PU L#268: (P#67)
PU L#269: (P#135)
PU L#270: (P#203)
PU L#271: (P#271)
INFO (SystemTopology): Thread CPU bindings:
MPI process 0 on host 0 (process 0 of 2 on this host)
OpenMP thread 0: PU set L#{0} P#{0}
OpenMP thread 1: PU set L#{4} P#{1}
OpenMP thread 2: PU set L#{8} P#{2}
OpenMP thread 3: PU set L#{12} P#{3}
OpenMP thread 4: PU set L#{16} P#{4}
OpenMP thread 5: PU set L#{20} P#{5}
OpenMP thread 6: PU set L#{24} P#{6}
OpenMP thread 7: PU set L#{28} P#{7}
OpenMP thread 8: PU set L#{32} P#{8}
OpenMP thread 9: PU set L#{36} P#{9}
OpenMP thread 10: PU set L#{40} P#{10}
OpenMP thread 11: PU set L#{44} P#{11}
OpenMP thread 12: PU set L#{48} P#{12}
OpenMP thread 13: PU set L#{52} P#{13}
OpenMP thread 14: PU set L#{56} P#{14}
OpenMP thread 15: PU set L#{60} P#{15}
OpenMP thread 16: PU set L#{64} P#{16}
MPI process 1 on host 0 (process 1 of 2 on this host)
OpenMP thread 0: PU set L#{136} P#{34}
OpenMP thread 1: PU set L#{140} P#{35}
OpenMP thread 2: PU set L#{144} P#{36}
OpenMP thread 3: PU set L#{148} P#{37}
OpenMP thread 4: PU set L#{152} P#{38}
OpenMP thread 5: PU set L#{156} P#{39}
OpenMP thread 6: PU set L#{160} P#{40}
OpenMP thread 7: PU set L#{164} P#{41}
OpenMP thread 8: PU set L#{168} P#{42}
OpenMP thread 9: PU set L#{172} P#{43}
OpenMP thread 10: PU set L#{176} P#{44}
OpenMP thread 11: PU set L#{180} P#{45}
OpenMP thread 12: PU set L#{184} P#{46}
OpenMP thread 13: PU set L#{188} P#{47}
OpenMP thread 14: PU set L#{192} P#{48}
OpenMP thread 15: PU set L#{196} P#{49}
OpenMP thread 16: PU set L#{200} P#{50}
INFO (SystemTopology): Setting thread CPU bindings:
INFO (SystemTopology): Thread CPU bindings:
MPI process 0 on host 0 (process 0 of 2 on this host)
OpenMP thread 0: PU set L#{0} P#{0}
OpenMP thread 1: PU set L#{4} P#{1}
OpenMP thread 2: PU set L#{8} P#{2}
OpenMP thread 3: PU set L#{12} P#{3}
OpenMP thread 4: PU set L#{16} P#{4}
OpenMP thread 5: PU set L#{20} P#{5}
OpenMP thread 6: PU set L#{24} P#{6}
OpenMP thread 7: PU set L#{28} P#{7}
OpenMP thread 8: PU set L#{32} P#{8}
OpenMP thread 9: PU set L#{36} P#{9}
OpenMP thread 10: PU set L#{40} P#{10}
OpenMP thread 11: PU set L#{44} P#{11}
OpenMP thread 12: PU set L#{48} P#{12}
OpenMP thread 13: PU set L#{52} P#{13}
OpenMP thread 14: PU set L#{56} P#{14}
OpenMP thread 15: PU set L#{60} P#{15}
OpenMP thread 16: PU set L#{64} P#{16}
MPI process 1 on host 0 (process 1 of 2 on this host)
OpenMP thread 0: PU set L#{68} P#{17}
OpenMP thread 1: PU set L#{72} P#{18}
OpenMP thread 2: PU set L#{76} P#{19}
OpenMP thread 3: PU set L#{80} P#{20}
OpenMP thread 4: PU set L#{84} P#{21}
OpenMP thread 5: PU set L#{88} P#{22}
OpenMP thread 6: PU set L#{92} P#{23}
OpenMP thread 7: PU set L#{96} P#{24}
OpenMP thread 8: PU set L#{100} P#{25}
OpenMP thread 9: PU set L#{104} P#{26}
OpenMP thread 10: PU set L#{108} P#{27}
OpenMP thread 11: PU set L#{112} P#{28}
OpenMP thread 12: PU set L#{116} P#{29}
OpenMP thread 13: PU set L#{120} P#{30}
OpenMP thread 14: PU set L#{124} P#{31}
OpenMP thread 15: PU set L#{128} P#{32}
OpenMP thread 16: PU set L#{132} P#{33}
INFO (SystemTopology): Extracting CPU/cache/memory properties:
There are 4 PUs per core (aka hardware SMT threads)
There are 1 threads per core (aka SMT threads used)
Cache (unknown name) has type "data" depth 1
size 32768 linesize 64 associativity 8 stride 4096, for 4 PUs
Cache (unknown name) has type "unified" depth 2
size 1048576 linesize 64 associativity 16 stride 65536, for 8 PUs
Memory has type "local" depth 1
size 102969442304 pagesize 4096, for 272 PUs
INFO (PUGH): Using physical to logical mappings: direct
INFO (PUGH): Using topology generator: automatic
INFO (Vectors): Using vector size 8 for architecture AVX512+AVX512ER (64-bit precision)
--------------------------------------------------------------------------------
Driver provided by PUGH
--------------------------------------------------------------------------------
INFO (PUGH): Not setting up a topology for 1 dimensions
INFO (PUGH): Not setting up a topology for 2 dimensions
INFO (PUGH): Setting up a topology for 3 dimensions
INFO (IOASCII): I/O Method 'IOASCII_1D' registered: output of 1D lines of grid functions/arrays to ASCII files
INFO (IOASCII): Periodic 1D output every 1 iterations
INFO (IOASCII): Periodic 1D output requested for 'GRID::r'
INFO (IOASCII): I/O Method 'IOASCII_2D' registered: output of 2D planes of grid functions/arrays to ASCII files
INFO (IOASCII): Periodic 2D output turned off
INFO (IOASCII): I/O Method 'IOASCII_3D' registered: output of 3D grid functions/arrays to ASCII files
INFO (IOASCII): Periodic 3D output turned off
INFO (MemSpeed): Measuring CPU, cache, memory, and communication speeds:
Single-core measurements (using 1 MPI processes with 1 OpenMP threads each):
CPU frequency:
iterations=1000000... time=0.00407195 sec
iterations=10000000... time=0.0405741 sec
iterations=100000000... time=0.40577 sec
iterations=300000000... time=1.21312 sec
iterations=300000000... time=1.2129 sec
result: 2703.1 GHz
CPU floating point performance:
iterations=1000000... time=0.00408101 sec
iterations=10000000... time=0.0404639 sec
iterations=100000000... time=0.404346 sec
iterations=300000000... time=1.213 sec
result: 31.6571 Gflop/sec
CPU integer performance:
iterations=1000000... time=0.00811601 sec
iterations=10000000... time=0.0809209 sec
iterations=100000000... time=0.808586 sec
iterations=200000000... time=1.61725 sec
result: 1.97867 Giop/sec
Memory allocation performance for local memory (for 272 PUs) (using 1*1073741824 bytes):
iterations=1... time=0.328679 sec
iterations=3... time=0.938123 sec
iterations=6... time=1.87499 sec
result: 3.43599 GByte/sec
Read latency of D1 cache (for 4 PUs) (using 1*24576 bytes):
iterations=1000... time=0.00032711 sec
iterations=10000... time=0.00273204 sec
iterations=100000... time=0.0271342 sec
iterations=1000000... time=0.270517 sec
iterations=4000000... time=1.08053 sec
result: 2.70133 nsec
Read latency of L2 cache (for 8 PUs) (using 1*786432 bytes):
iterations=1000... time=0.00294495 sec
iterations=10000... time=0.0140121 sec
iterations=100000... time=0.136882 sec
iterations=800000... time=1.02385 sec
result: 12.7981 nsec
Read latency of local memory (for 272 PUs) (using 1*1073741824 bytes):
iterations=1000... time=0.017643 sec
iterations=10000... time=0.176322 sec
iterations=60000... time=1.05638 sec
result: 176.063 nsec
Read bandwidth of D1 cache (for 4 PUs) (using 1*24576 bytes):
iterations=1... time=2.14577e-06 sec
iterations=10... time=4.91142e-05 sec
iterations=100... time=3.60012e-05 sec
iterations=1000... time=0.000230074 sec
iterations=10000... time=0.00199389 sec
iterations=100000... time=0.019356 sec
iterations=1000000... time=0.192889 sec
iterations=6000000... time=1.15808 sec
result: 127.328 GByte/sec
Read bandwidth of L2 cache (for 8 PUs) (using 1*786432 bytes):
iterations=1... time=0.00011301 sec
iterations=10... time=0.000781059 sec
iterations=100... time=0.00484395 sec
iterations=1000... time=0.0321031 sec
iterations=10000... time=0.291711 sec
iterations=40000... time=1.07934 sec
result: 29.1449 GByte/sec
Read bandwidth of local memory (for 272 PUs) (using 1*1073741824 bytes):
iterations=1... time=0.139124 sec
iterations=8... time=1.11338 sec
result: 7.71522 GByte/sec
Write latency of D1 cache (for 4 PUs) (using 1*24576 bytes):
iterations=1000... time=1.97887e-05 sec
iterations=10000... time=0.000127792 sec
iterations=100000... time=0.00105596 sec
iterations=1000000... time=0.010298 sec
iterations=10000000... time=0.103325 sec
iterations=100000000... time=1.03489 sec
result: 1.29361 nsec
Write latency of L2 cache (for 8 PUs) (using 1*786432 bytes):
iterations=1000... time=9.10759e-05 sec
iterations=10000... time=0.000571966 sec
iterations=100000... time=0.00375795 sec
iterations=1000000... time=0.0348961 sec
iterations=10000000... time=0.320716 sec
iterations=30000000... time=0.948481 sec
iterations=60000000... time=1.89937 sec
result: 3.95703 nsec
Write latency of local memory (for 272 PUs) (using 1*1073741824 bytes):
iterations=1000... time=0.000288963 sec
iterations=10000... time=0.00225806 sec
iterations=100000... time=0.022434 sec
iterations=1000000... time=0.222748 sec
iterations=5000000... time=1.11351 sec
result: 27.8377 nsec
Write bandwidth of D1 cache (for 4 PUs) (using 1*24576 bytes):
iterations=1... time=4.05312e-06 sec
iterations=10... time=1.4782e-05 sec
iterations=100... time=8.39233e-05 sec
iterations=1000... time=0.000637054 sec
iterations=10000... time=0.00622296 sec
iterations=100000... time=0.0612509 sec
iterations=1000000... time=0.61436 sec
iterations=2000000... time=1.22711 sec
result: 40.0552 GByte/sec
Write bandwidth of L2 cache (for 8 PUs) (using 1*786432 bytes):
iterations=1... time=0.000114918 sec
iterations=10... time=0.00090003 sec
iterations=100... time=0.00708699 sec
iterations=1000... time=0.0582938 sec
iterations=10000... time=0.526602 sec
iterations=20000... time=1.02799 sec
result: 15.3003 GByte/sec
Write bandwidth of local memory (for 272 PUs) (using 1*1073741824 bytes):
iterations=1... time=0.133248 sec
iterations=8... time=1.0641 sec
result: 8.07247 GByte/sec
Write bandwidth via cache-bypassing stores for local memory (for 272 PUs) (using 1*1073741824 bytes):
iterations=1... time=0.132953 sec
iterations=8... time=1.06518 sec
result: 8.0643 GByte/sec
Stencil code performance of D1 cache (for 4 PUs) (using 1*12^3 grid points, 1*27648 bytes):
iterations=1... time=7.48634e-05 sec
iterations=10... time=0.000219107 sec
iterations=100... time=0.00188708 sec
iterations=1000... time=0.0183041 sec
iterations=10000... time=0.182543 sec
iterations=60000... time=1.09559 sec
result: 0.094634 Gupdates/sec
Stencil code performance of L2 cache (for 8 PUs) (using 1*37^3 grid points, 1*810448 bytes):
iterations=1... time=0.000602961 sec
iterations=10... time=0.00587201 sec
iterations=100... time=0.0577371 sec
iterations=1000... time=0.575167 sec
iterations=2000... time=1.14345 sec
result: 0.0885966 Gupdates/sec
Stencil code performance of local memory (for 272 PUs) (using 1*406^3 grid points, 1*1070774656 bytes):
iterations=1... time=0.813491 sec
iterations=2... time=1.62696 sec
result: 0.0822682 Gupdates/sec
Single-node measurements (using 2 MPI processes with 17 OpenMP threads each):
CPU frequency:
iterations=1000000... time=0.00417674 sec
iterations=10000000... time=0.0405164 sec
iterations=100000000... time=0.404582 sec
iterations=300000000... time=1.2133 sec
iterations=300000000... time=1.21315 sec
result: 3941.4 GHz
CPU floating point performance:
iterations=1000000... time=0.00405846 sec
iterations=10000000... time=0.0404528 sec
iterations=100000000... time=0.404897 sec
iterations=300000000... time=1.21318 sec
result: 31.6523 Gflop/sec
CPU integer performance:
iterations=1000000... time=0.00815167 sec
iterations=10000000... time=0.0808692 sec
iterations=100000000... time=0.80888 sec
iterations=200000000... time=1.61778 sec
result: 1.97802 Giop/sec
Memory allocation performance for local memory (for 272 PUs) (using 2*1073741824 bytes):
iterations=1... time=0.0189642 sec
iterations=10... time=0.180968 sec
iterations=60... time=1.08605 sec
result: 59.3198 GByte/sec
Read latency of D1 cache (for 4 PUs) (using 34*24576 bytes):
iterations=1000... time=0.000294622 sec
iterations=10000... time=0.00271222 sec
iterations=100000... time=0.0270828 sec
iterations=1000000... time=0.270048 sec
iterations=4000000... time=1.08007 sec
result: 2.70017 nsec
Read latency of L2 cache (for 8 PUs) (using 18*786432 bytes):
iterations=1000... time=0.00887088 sec
iterations=10000... time=0.0865954 sec
iterations=100000... time=0.839289 sec
iterations=200000... time=1.67403 sec
result: 83.7014 nsec
Read latency of local memory (for 272 PUs) (using 2*1073741824 bytes):
iterations=1000... time=0.0175604 sec
iterations=10000... time=0.180401 sec
iterations=60000... time=1.05829 sec
result: 176.382 nsec
Read bandwidth of D1 cache (for 4 PUs) (using 34*24576 bytes):
iterations=1... time=1.64088e-06 sec
iterations=10... time=3.12749e-06 sec
iterations=100... time=2.02445e-05 sec
iterations=1000... time=0.000192306 sec
iterations=10000... time=0.00193771 sec
iterations=100000... time=0.019337 sec
iterations=1000000... time=0.193137 sec
iterations=6000000... time=1.15795 sec
result: 127.342 GByte/sec
Read bandwidth of L2 cache (for 8 PUs) (using 18*786432 bytes):
iterations=1... time=0.000100276 sec
iterations=10... time=0.000923627 sec
iterations=100... time=0.00846056 sec
iterations=1000... time=0.0819019 sec
iterations=10000... time=0.819235 sec
iterations=20000... time=1.63804 sec
result: 9.60213 GByte/sec
Read bandwidth of local memory (for 272 PUs) (using 2*1073741824 bytes):
iterations=1... time=0.261908 sec
iterations=4... time=1.05583 sec
result: 4.06788 GByte/sec
Write latency of D1 cache (for 4 PUs) (using 34*24576 bytes):
iterations=1000... time=2.83367e-05 sec
iterations=10000... time=0.000103172 sec
iterations=100000... time=0.00106769 sec
iterations=1000000... time=0.0110052 sec
iterations=10000000... time=0.108229 sec
iterations=100000000... time=1.07058 sec
result: 1.33823 nsec
Write latency of L2 cache (for 8 PUs) (using 18*786432 bytes):
iterations=1000... time=0.000101798 sec
iterations=10000... time=0.000947525 sec
iterations=100000... time=0.00854621 sec
iterations=1000000... time=0.0806903 sec
iterations=10000000... time=0.791932 sec
iterations=20000000... time=1.58361 sec
result: 9.89757 nsec
Write latency of local memory (for 272 PUs) (using 2*1073741824 bytes):
iterations=1000... time=0.000648618 sec
iterations=10000... time=0.00287849 sec
iterations=100000... time=0.0226344 sec
iterations=1000000... time=0.219735 sec
iterations=5000000... time=1.10478 sec
result: 27.6194 nsec
Write bandwidth of D1 cache (for 4 PUs) (using 34*24576 bytes):
iterations=1... time=1.99851e-06 sec
iterations=10... time=7.82574e-06 sec
iterations=100... time=6.25218e-05 sec
iterations=1000... time=0.000607806 sec
iterations=10000... time=0.00614851 sec
iterations=100000... time=0.0613123 sec
iterations=1000000... time=0.613361 sec
iterations=2000000... time=1.22768 sec
result: 40.0365 GByte/sec
Write bandwidth of L2 cache (for 8 PUs) (using 18*786432 bytes):
iterations=1... time=0.000110276 sec
iterations=10... time=0.00112304 sec
iterations=100... time=0.0102063 sec
iterations=1000... time=0.0973017 sec
iterations=10000... time=0.977225 sec
iterations=20000... time=1.94673 sec
result: 8.07953 GByte/sec
Write bandwidth of local memory (for 272 PUs) (using 2*1073741824 bytes):
iterations=1... time=0.192148 sec
iterations=6... time=1.17048 sec
result: 5.50413 GByte/sec
Write bandwidth via cache-bypassing stores for local memory (for 272 PUs) (using 2*1073741824 bytes):
iterations=1... time=0.192612 sec
iterations=6... time=1.17092 sec
result: 5.50206 GByte/sec
cactus_sim: /work/00945/rhaas/stampede2/ET_Next/arrangements/CactusUtils/MemSpeed/src/memspeed.cc:1355: void <unnamed>::measure_stencil_performance(int, int): Assertion `num_allocs % num_active_procs == 0' failed.
cactus_sim: /work/00945/rhaas/stampede2/ET_Next/arrangements/CactusUtils/MemSpeed/src/memspeed.cc:1355: void <unnamed>::measure_stencil_performance(int, int): Assertion `num_allocs % num_active_procs == 0' failed.
and the code that fails is:
assert(num_allocs % num_active_procs == 0);
where num_allocs
is const int num_allocs = (total_pus + mem_pus - 1) / mem_pus
and num_active_procs
is const int num_active_procs = small_cache ? comm_size : 1;
ie comm_size
given that the assert failed.
This is for the 2 MPI ranks test using 17 threads (so an odd number) on the KNL section of Stampede2.
I also noticed that there is some code later on that reads:
if (comm_size > num_active_procs ||
(skip_memsize + memsize) * comm_size > node_memory * 3 / 4) {
printf(" [skipped -- too many MPI processes]\n");
continue;
}
which will trigger whenever num_active_procs
is 1 (ie not the case of the assert).
This particular test already failed in the Turing release.