[BIOSAL] Results with Xeon and Xeon Phi
Boisvert, Sebastien
boisvert at anl.gov
Mon Nov 3 15:35:02 CST 2014
Here are some results on jenny-mic0 (Intel Xeon Phi Knights Corner).
This is the first test we do on Intel Xeon Phi.
I basically just stripped the transport subsystem (George will like that, but the patch is not *clean* since
it is not a compile option) and also stripped zlib (I did not find zlib for KNC).
This is presumably not optimal I think, but this is an early result.
In particular, it would probably be better to run a couple of runtime node on 1 mic
(to test) or use some sort of internal routing to avoid all-to-all (some sort of polytope or torus inside the thorium
node).
I ran performance/latency_probe/latency_probe, which is basically is a multi-node multi-core multi-actor
ping-pong test. Metrics are throughputs are 4 levels: computation, node (thorium runtime node), worker (CPU core),
and actor.
In particular, here, I want to look at the "worker-throughput" which is scale-invariant in an ideal world.
The patch to be able to build on KNC with the Intel compiler:
Based on d9f752a91581
[boisvert at jenny biosal]$ git diff --stat
Makefile | 2 +-
core/Makefile.mk | 2 +-
core/file_storage/input/buffered_reader.c | 6 ++++--
engine/thorium/Makefile.mk | 2 +-
engine/thorium/transport/transport.c | 7 +++++++
5 files changed, 14 insertions(+), 5 deletions(-)
http://biosal.s3.amazonaws.com/patches/Intel/KNC.patch
####
Intel Xeon Phi Knights Corner (KNC)
Linux 2.6.38.8+mpss3.2.1.
- 61 Xeon Phi KNC cores
jenny-mic0 (no offload, directly on the PCIe card):
vendor_id : GenuineIntel
model name : 0b/01
cpu MHz : 1238.094
cache size : 512 KB
cpu cores : 61
siblings : 244
obviously, I would love to run perf on that jenny-mic0
Also, I don't know if clock_gettime is in the VDSO or if it generates a syscall.
With perf, I would see that easily.
boisvert at jenny-mic0:~$ ./latency_probe -threads-per-node 4|tee log
1x4
boisvert at jenny-mic0:~$ grep COUNTER log
PERFORMANCE_COUNTER node-count = 1
PERFORMANCE_COUNTER worker-count-per-node = 3
PERFORMANCE_COUNTER actor-count-per-worker = 100
PERFORMANCE_COUNTER worker-count = 3
PERFORMANCE_COUNTER actor-count = 300
PERFORMANCE_COUNTER message-count-per-actor = 40000
PERFORMANCE_COUNTER message-count = 12000000
PERFORMANCE_COUNTER elapsed-time = 151.904640 s
PERFORMANCE_COUNTER computation-throughput = 78996.928558 messages / s
PERFORMANCE_COUNTER node-throughput = 78996.928558 messages / s
PERFORMANCE_COUNTER worker-throughput = 26332.309519 messages / s
PERFORMANCE_COUNTER worker-latency = 37976 ns
PERFORMANCE_COUNTER actor-throughput = 263.323095 messages / s
PERFORMANCE_COUNTER actor-latency = 3797616 ns
1x8
boisvert at jenny-mic0:~$ ./latency_probe -threads-per-node 8|tee log
boisvert at jenny-mic0:~$ grep COUNTER log
PERFORMANCE_COUNTER node-count = 1
PERFORMANCE_COUNTER worker-count-per-node = 7
PERFORMANCE_COUNTER actor-count-per-worker = 100
PERFORMANCE_COUNTER worker-count = 7
PERFORMANCE_COUNTER actor-count = 700
PERFORMANCE_COUNTER message-count-per-actor = 40000
PERFORMANCE_COUNTER message-count = 28000000
PERFORMANCE_COUNTER elapsed-time = 176.217112 s
PERFORMANCE_COUNTER computation-throughput = 158894.897666 messages / s
PERFORMANCE_COUNTER node-throughput = 158894.897666 messages / s
PERFORMANCE_COUNTER worker-throughput = 22699.271095 messages / s
PERFORMANCE_COUNTER worker-latency = 44054 ns
PERFORMANCE_COUNTER actor-throughput = 226.992711 messages / s
PERFORMANCE_COUNTER actor-latency = 4405427 ns
1x16
boisvert at jenny-mic0:~$ ./latency_probe -threads-per-node 16|tee log
boisvert at jenny-mic0:~$ grep COUNTER log
PERFORMANCE_COUNTER node-count = 1
PERFORMANCE_COUNTER worker-count-per-node = 15
PERFORMANCE_COUNTER actor-count-per-worker = 100
PERFORMANCE_COUNTER worker-count = 15
PERFORMANCE_COUNTER actor-count = 1500
PERFORMANCE_COUNTER message-count-per-actor = 40000
PERFORMANCE_COUNTER message-count = 60000000
PERFORMANCE_COUNTER elapsed-time = 275.566235 s
PERFORMANCE_COUNTER computation-throughput = 217733.496928 messages / s
PERFORMANCE_COUNTER node-throughput = 217733.496928 messages / s
PERFORMANCE_COUNTER worker-throughput = 14515.566462 messages / s
PERFORMANCE_COUNTER worker-latency = 68891 ns
PERFORMANCE_COUNTER actor-throughput = 145.155665 messages / s
PERFORMANCE_COUNTER actor-latency = 6889155 ns
1x30
boisvert at jenny-mic0:~$ ./latency_probe -threads-per-node 30|tee log
boisvert at jenny-mic0:~$ grep COUNTER log
PERFORMANCE_COUNTER node-count = 1
PERFORMANCE_COUNTER worker-count-per-node = 29
PERFORMANCE_COUNTER actor-count-per-worker = 100
PERFORMANCE_COUNTER worker-count = 29
PERFORMANCE_COUNTER actor-count = 2900
PERFORMANCE_COUNTER message-count-per-actor = 40000
PERFORMANCE_COUNTER message-count = 116000000
PERFORMANCE_COUNTER elapsed-time = 506.929898 s
PERFORMANCE_COUNTER computation-throughput = 228828.483864 messages / s
PERFORMANCE_COUNTER node-throughput = 228828.483864 messages / s
PERFORMANCE_COUNTER worker-throughput = 7890.637375 messages / s
PERFORMANCE_COUNTER worker-latency = 126732 ns
PERFORMANCE_COUNTER actor-throughput = 78.906374 messages / s
PERFORMANCE_COUNTER actor-latency = 12673247 ns
With 1x30, here are the context switches (not a lot actually):
boisvert at jenny-mic0:~$ tail -n 2 /proc/15449/status
voluntary_ctxt_switches: 52
nonvoluntary_ctxt_switches: 393
I looked at the spinlock:
[boisvert at jenny ~]$ /usr/linux-k1om-4.7/bin/x86_64-k1om-linux-objdump -d libpthread.so.0 > libpthread.so.0.s
####
I also have results on Xeon E7 (4 CPU, 4x8 x86 cores):
Intel Xeon
- 32 Xeon E7 cores
[boisvert at bigmem biosal]$ grep "physical id" /proc/cpuinfo |sort|uniq |wc -l
4
vendor_id : GenuineIntel
model name : Intel(R) Xeon(R) CPU E7- 4830 @ 2.13GHz
cpu MHz : 2128.152
cache size : 24576 KB
cpu cores : 8
siblings : 16
4x8
configuration: 4 processes, 8 threads per process)
[boisvert at bigmem biosal]$ mpiexec -n 4 ./performance/latency_probe/latency_probe -threads-per-node 8 | tee log
PERFORMANCE_COUNTER node-count = 4
PERFORMANCE_COUNTER worker-count-per-node = 7
PERFORMANCE_COUNTER actor-count-per-worker = 100
PERFORMANCE_COUNTER worker-count = 28
PERFORMANCE_COUNTER actor-count = 2800
PERFORMANCE_COUNTER message-count-per-actor = 40000
PERFORMANCE_COUNTER message-count = 112000000
PERFORMANCE_COUNTER elapsed-time = 88.705164 s
PERFORMANCE_COUNTER computation-throughput = 1262609.687743 messages / s
PERFORMANCE_COUNTER node-throughput = 315652.421936 messages / s
PERFORMANCE_COUNTER worker-throughput = 45093.203134 messages / s
PERFORMANCE_COUNTER worker-latency = 22176 ns
PERFORMANCE_COUNTER actor-throughput = 450.932031 messages / s
PERFORMANCE_COUNTER actor-latency = 2217629 ns
1x30
[boisvert at bigmem biosal]$ ./performance/latency_probe/latency_probe -threads-per-node 30 | tee log
[boisvert at bigmem biosal]$ grep COUNTER log
PERFORMANCE_COUNTER node-count = 1
PERFORMANCE_COUNTER worker-count-per-node = 29
PERFORMANCE_COUNTER actor-count-per-worker = 100
PERFORMANCE_COUNTER worker-count = 29
PERFORMANCE_COUNTER actor-count = 2900
PERFORMANCE_COUNTER message-count-per-actor = 40000
PERFORMANCE_COUNTER message-count = 116000000
PERFORMANCE_COUNTER elapsed-time = 187.886480 s
PERFORMANCE_COUNTER computation-throughput = 617394.077619 messages / s
PERFORMANCE_COUNTER node-throughput = 617394.077619 messages / s
PERFORMANCE_COUNTER worker-throughput = 21289.450952 messages / s
PERFORMANCE_COUNTER worker-latency = 46971 ns
PERFORMANCE_COUNTER actor-throughput = 212.894510 messages / s
PERFORMANCE_COUNTER actor-latency = 4697161 ns
With 1x30, the Linux kernel has scaling issues:
With 'perf top', I can see that the spinlock code in the kernel is causing scalability issues (this is not in the application).
On Xeon E7, I am using Linux 2.6.32-431.29.2.el6.x86_64.
So presumably there is a similar problem on KNC.
Samples: 5M of event 'cycles', Event count (approx.): 863917195488
42.42% [kernel] [k] _spin_lock <=============== spinlock in kernel space
15.18% latency_probe [.] core_hash_table_find_bucket
6.38% latency_probe [.] core_hash_table_group_state
5.09% [vsyscall] [.] 0x000000000000014c
1.58% latency_probe [.] core_fast_ring_pop_multiple_producers
1.56% latency_probe [.] core_fast_queue_dequeue
1.45% latency_probe [.] thorium_node_run
1.42% [vdso] [.] 0x000000000000096a
1.35% latency_probe [.] thorium_node_resolve
1.25% libc-2.12.so [.] memcpy
1.10% latency_probe [.] thorium_worker_run
1.02% latency_probe [.] core_murmur_hash_2_64_a
0.97% [kernel] [k] unroll_tree_refs
0.96% libpthread-2.12.so [.] pthread_spin_lock <================ spinlock in userspace (in latency_probe)w
0.92% libc-2.12.so [.] __random
0.87% latency_probe [.] core_fast_ring_push_from_producer
0.80% latency_probe [.] core_hash_table_group_key
0.69% latency_probe [.] core_memory_copy
More information about the BIOSAL
mailing list