[BIOSAL] Results with Xeon and Xeon Phi

Mon Nov 3 15:35:02 CST 2014

Here are some results on jenny-mic0 (Intel Xeon Phi Knights Corner).

This is the first test we do on Intel Xeon Phi.
I basically just stripped the transport subsystem (George will like that, but the patch is not *clean* since
it is not a compile option) and also stripped zlib (I did not find zlib for KNC).

This is presumably not optimal I think, but this is an early result.
In particular, it would probably be better to run a couple of runtime node on 1 mic
(to test) or use some sort of internal routing to avoid all-to-all (some sort of polytope or torus inside the thorium
node).

I ran performance/latency_probe/latency_probe, which is basically is a multi-node multi-core multi-actor
ping-pong test. Metrics are throughputs are 4 levels: computation, node (thorium runtime node), worker (CPU core),
and actor.

In particular, here, I want to look at the "worker-throughput" which is scale-invariant in an ideal world.

The patch to be able to build on KNC with the Intel compiler:

Based on d9f752a91581

[boisvert at jenny biosal]$ git diff --stat
 Makefile                                  |    2 +-
 core/Makefile.mk                          |    2 +-
 core/file_storage/input/buffered_reader.c |    6 ++++--
 engine/thorium/Makefile.mk                |    2 +-
 engine/thorium/transport/transport.c      |    7 +++++++
 5 files changed, 14 insertions(+), 5 deletions(-)

http://biosal.s3.amazonaws.com/patches/Intel/KNC.patch

####
Intel Xeon Phi Knights Corner (KNC)
Linux 2.6.38.8+mpss3.2.1.

- 61 Xeon Phi KNC cores

jenny-mic0 (no offload, directly on the PCIe card):

vendor_id       : GenuineIntel
model name      : 0b/01
cpu MHz         : 1238.094
cache size      : 512 KB
cpu cores       : 61
siblings        : 244

obviously, I would love to run perf on that jenny-mic0

Also, I don't know if clock_gettime is in the VDSO or if it generates a syscall.
With perf, I would see that easily.

boisvert at jenny-mic0:~$ ./latency_probe -threads-per-node 4|tee log

1x4
boisvert at jenny-mic0:~$ grep COUNTER log
PERFORMANCE_COUNTER node-count = 1
PERFORMANCE_COUNTER worker-count-per-node = 3
PERFORMANCE_COUNTER actor-count-per-worker = 100
PERFORMANCE_COUNTER worker-count = 3
PERFORMANCE_COUNTER actor-count = 300
PERFORMANCE_COUNTER message-count-per-actor = 40000
PERFORMANCE_COUNTER message-count = 12000000
PERFORMANCE_COUNTER elapsed-time = 151.904640 s
PERFORMANCE_COUNTER computation-throughput = 78996.928558 messages / s
PERFORMANCE_COUNTER node-throughput = 78996.928558 messages / s
PERFORMANCE_COUNTER worker-throughput = 26332.309519 messages / s
PERFORMANCE_COUNTER worker-latency = 37976 ns
PERFORMANCE_COUNTER actor-throughput = 263.323095 messages / s
PERFORMANCE_COUNTER actor-latency = 3797616 ns

1x8

boisvert at jenny-mic0:~$ ./latency_probe -threads-per-node 8|tee log

boisvert at jenny-mic0:~$ grep COUNTER log
PERFORMANCE_COUNTER node-count = 1
PERFORMANCE_COUNTER worker-count-per-node = 7
PERFORMANCE_COUNTER actor-count-per-worker = 100
PERFORMANCE_COUNTER worker-count = 7
PERFORMANCE_COUNTER actor-count = 700
PERFORMANCE_COUNTER message-count-per-actor = 40000
PERFORMANCE_COUNTER message-count = 28000000
PERFORMANCE_COUNTER elapsed-time = 176.217112 s
PERFORMANCE_COUNTER computation-throughput = 158894.897666 messages / s
PERFORMANCE_COUNTER node-throughput = 158894.897666 messages / s
PERFORMANCE_COUNTER worker-throughput = 22699.271095 messages / s
PERFORMANCE_COUNTER worker-latency = 44054 ns
PERFORMANCE_COUNTER actor-throughput = 226.992711 messages / s
PERFORMANCE_COUNTER actor-latency = 4405427 ns

1x16

boisvert at jenny-mic0:~$ ./latency_probe -threads-per-node 16|tee log

boisvert at jenny-mic0:~$ grep COUNTER log
PERFORMANCE_COUNTER node-count = 1
PERFORMANCE_COUNTER worker-count-per-node = 15
PERFORMANCE_COUNTER actor-count-per-worker = 100
PERFORMANCE_COUNTER worker-count = 15
PERFORMANCE_COUNTER actor-count = 1500
PERFORMANCE_COUNTER message-count-per-actor = 40000
PERFORMANCE_COUNTER message-count = 60000000
PERFORMANCE_COUNTER elapsed-time = 275.566235 s
PERFORMANCE_COUNTER computation-throughput = 217733.496928 messages / s
PERFORMANCE_COUNTER node-throughput = 217733.496928 messages / s
PERFORMANCE_COUNTER worker-throughput = 14515.566462 messages / s
PERFORMANCE_COUNTER worker-latency = 68891 ns
PERFORMANCE_COUNTER actor-throughput = 145.155665 messages / s
PERFORMANCE_COUNTER actor-latency = 6889155 ns

1x30

boisvert at jenny-mic0:~$ ./latency_probe -threads-per-node 30|tee log

boisvert at jenny-mic0:~$ grep COUNTER log
PERFORMANCE_COUNTER node-count = 1
PERFORMANCE_COUNTER worker-count-per-node = 29
PERFORMANCE_COUNTER actor-count-per-worker = 100
PERFORMANCE_COUNTER worker-count = 29
PERFORMANCE_COUNTER actor-count = 2900
PERFORMANCE_COUNTER message-count-per-actor = 40000
PERFORMANCE_COUNTER message-count = 116000000
PERFORMANCE_COUNTER elapsed-time = 506.929898 s
PERFORMANCE_COUNTER computation-throughput = 228828.483864 messages / s
PERFORMANCE_COUNTER node-throughput = 228828.483864 messages / s
PERFORMANCE_COUNTER worker-throughput = 7890.637375 messages / s
PERFORMANCE_COUNTER worker-latency = 126732 ns
PERFORMANCE_COUNTER actor-throughput = 78.906374 messages / s
PERFORMANCE_COUNTER actor-latency = 12673247 ns

With 1x30, here are the context switches (not a lot actually):

boisvert at jenny-mic0:~$ tail -n 2 /proc/15449/status
voluntary_ctxt_switches:        52
nonvoluntary_ctxt_switches:     393

I looked at the spinlock:

[boisvert at jenny ~]$  /usr/linux-k1om-4.7/bin/x86_64-k1om-linux-objdump -d libpthread.so.0 > libpthread.so.0.s

####
I also have results on Xeon E7 (4 CPU, 4x8 x86 cores):

Intel Xeon 

- 32 Xeon E7 cores

[boisvert at bigmem biosal]$ grep "physical id" /proc/cpuinfo |sort|uniq |wc -l
4

vendor_id       : GenuineIntel
model name      : Intel(R) Xeon(R) CPU E7- 4830  @ 2.13GHz
cpu MHz         : 2128.152
cache size      : 24576 KB
cpu cores       : 8
siblings        : 16

4x8

configuration: 4 processes, 8 threads per process)

[boisvert at bigmem biosal]$ mpiexec -n 4 ./performance/latency_probe/latency_probe -threads-per-node 8 | tee log
PERFORMANCE_COUNTER node-count = 4
PERFORMANCE_COUNTER worker-count-per-node = 7
PERFORMANCE_COUNTER actor-count-per-worker = 100
PERFORMANCE_COUNTER worker-count = 28
PERFORMANCE_COUNTER actor-count = 2800
PERFORMANCE_COUNTER message-count-per-actor = 40000
PERFORMANCE_COUNTER message-count = 112000000
PERFORMANCE_COUNTER elapsed-time = 88.705164 s
PERFORMANCE_COUNTER computation-throughput = 1262609.687743 messages / s
PERFORMANCE_COUNTER node-throughput = 315652.421936 messages / s
PERFORMANCE_COUNTER worker-throughput = 45093.203134 messages / s
PERFORMANCE_COUNTER worker-latency = 22176 ns
PERFORMANCE_COUNTER actor-throughput = 450.932031 messages / s
PERFORMANCE_COUNTER actor-latency = 2217629 ns

1x30
[boisvert at bigmem biosal]$ ./performance/latency_probe/latency_probe -threads-per-node 30 | tee log
[boisvert at bigmem biosal]$ grep COUNTER log
PERFORMANCE_COUNTER node-count = 1
PERFORMANCE_COUNTER worker-count-per-node = 29
PERFORMANCE_COUNTER actor-count-per-worker = 100
PERFORMANCE_COUNTER worker-count = 29
PERFORMANCE_COUNTER actor-count = 2900
PERFORMANCE_COUNTER message-count-per-actor = 40000
PERFORMANCE_COUNTER message-count = 116000000
PERFORMANCE_COUNTER elapsed-time = 187.886480 s
PERFORMANCE_COUNTER computation-throughput = 617394.077619 messages / s
PERFORMANCE_COUNTER node-throughput = 617394.077619 messages / s
PERFORMANCE_COUNTER worker-throughput = 21289.450952 messages / s
PERFORMANCE_COUNTER worker-latency = 46971 ns
PERFORMANCE_COUNTER actor-throughput = 212.894510 messages / s
PERFORMANCE_COUNTER actor-latency = 4697161 ns

With 1x30, the Linux kernel has scaling issues:
With 'perf top', I can see that the spinlock code in the kernel is causing scalability issues (this is not in the application).
On Xeon E7, I am using Linux 2.6.32-431.29.2.el6.x86_64.

So presumably there is a similar problem on KNC.

Samples: 5M of event 'cycles', Event count (approx.): 863917195488                                                                                                           
 42.42%  [kernel]                          [k] _spin_lock                                          <=============== spinlock in kernel space                                           
 15.18%  latency_probe                     [.] core_hash_table_find_bucket                                                                                                   
  6.38%  latency_probe                     [.] core_hash_table_group_state                                                                                                   
  5.09%  [vsyscall]                        [.] 0x000000000000014c                                                                                                            
  1.58%  latency_probe                     [.] core_fast_ring_pop_multiple_producers                                                                                         
  1.56%  latency_probe                     [.] core_fast_queue_dequeue                                                                                                       
  1.45%  latency_probe                     [.] thorium_node_run                                                                                                              
  1.42%  [vdso]                            [.] 0x000000000000096a                                                                                                            
  1.35%  latency_probe                     [.] thorium_node_resolve                                                                                                          
  1.25%  libc-2.12.so                      [.] memcpy                                                                                                                        
  1.10%  latency_probe                     [.] thorium_worker_run                                                                                                            
  1.02%  latency_probe                     [.] core_murmur_hash_2_64_a                                                                                                       
  0.97%  [kernel]                          [k] unroll_tree_refs                                                                                                              
  0.96%  libpthread-2.12.so                [.] pthread_spin_lock                             <================ spinlock in userspace (in latency_probe)w                                                                          
  0.92%  libc-2.12.so                      [.] __random                                                                                                                      
  0.87%  latency_probe                     [.] core_fast_ring_push_from_producer                                                                                             
  0.80%  latency_probe                     [.] core_hash_table_group_key                                                                                                     
  0.69%  latency_probe                     [.] core_memory_copy