[BIOSAL] Results with Xeon and Xeon Phi

Boisvert, Sebastien boisvert at anl.gov
Tue Nov 4 12:49:38 CST 2014


Hi Fangfang,

I have the new results with version 2d045629 and it is very promising.
In summary, we are getting: 1.9 M messages / s on 122 threads.

This is equivalent to 4 Intel Xeon Phi E7-4830 (2.1 msg / s) in term of throughput
see http://lists.cels.anl.gov/pipermail/biosal/2014-November/000041.html for Intel Xeon E7-4830

Intel Xeon Phi 7120A tests were done on jenny-mic0.
http://ark.intel.com/products/80555/Intel-Xeon-Phi-Coprocessor-7120A-16GB-1_238-GHz-61-core

boisvert at jenny-mic0:~$ ./latency_probe -threads-per-node 61 |tee log
boisvert at jenny-mic0:~$ grep COUNTER log
PERFORMANCE_COUNTER node-count = 1
PERFORMANCE_COUNTER worker-count-per-node = 60
PERFORMANCE_COUNTER actor-count-per-worker = 100
PERFORMANCE_COUNTER worker-count = 60
PERFORMANCE_COUNTER actor-count = 6000
PERFORMANCE_COUNTER message-count-per-actor = 40000
PERFORMANCE_COUNTER message-count = 240000000
PERFORMANCE_COUNTER elapsed-time = 200.943560 s
PERFORMANCE_COUNTER computation-throughput = 1194365.223777 messages / s
PERFORMANCE_COUNTER node-throughput = 1194365.223777 messages / s
PERFORMANCE_COUNTER worker-throughput = 19906.087063 messages / s
PERFORMANCE_COUNTER worker-latency = 50235 ns
PERFORMANCE_COUNTER actor-throughput = 199.060871 messages / s
PERFORMANCE_COUNTER actor-latency = 5023588 ns

boisvert at jenny-mic0:~$ ./latency_probe -threads-per-node 122 |tee log
boisvert at jenny-mic0:~$ grep COUNTER log
PERFORMANCE_COUNTER node-count = 1
PERFORMANCE_COUNTER worker-count-per-node = 121
PERFORMANCE_COUNTER actor-count-per-worker = 100
PERFORMANCE_COUNTER worker-count = 121
PERFORMANCE_COUNTER actor-count = 12100
PERFORMANCE_COUNTER message-count-per-actor = 40000
PERFORMANCE_COUNTER message-count = 484000000
PERFORMANCE_COUNTER elapsed-time = 247.985263 s
PERFORMANCE_COUNTER computation-throughput = 1951728.880732 messages / s
PERFORMANCE_COUNTER node-throughput = 1951728.880732 messages / s
PERFORMANCE_COUNTER worker-throughput = 16129.990750 messages / s
PERFORMANCE_COUNTER worker-latency = 61996 ns
PERFORMANCE_COUNTER actor-throughput = 161.299907 messages / s
PERFORMANCE_COUNTER actor-latency = 6199631 ns

boisvert at jenny-mic0:~$ ./latency_probe -threads-per-node 200 | tee KNC-1x200
boisvert at jenny-mic0:~$ grep COUNTER KNC-1x200
PERFORMANCE_COUNTER node-count = 1
PERFORMANCE_COUNTER worker-count-per-node = 199
PERFORMANCE_COUNTER actor-count-per-worker = 100
PERFORMANCE_COUNTER worker-count = 199
PERFORMANCE_COUNTER actor-count = 19900
PERFORMANCE_COUNTER message-count-per-actor = 40000
PERFORMANCE_COUNTER message-count = 796000000
PERFORMANCE_COUNTER elapsed-time = 420.908741 s
PERFORMANCE_COUNTER computation-throughput = 1891146.280737 messages / s
PERFORMANCE_COUNTER node-throughput = 1891146.280737 messages / s
PERFORMANCE_COUNTER worker-throughput = 9503.247642 messages / s
PERFORMANCE_COUNTER worker-latency = 105227 ns
PERFORMANCE_COUNTER actor-throughput = 95.032476 messages / s
PERFORMANCE_COUNTER actor-latency = 10522718 ns

boisvert at jenny-mic0:~$ grep COUNTER KNC-1x244
PERFORMANCE_COUNTER node-count = 1
PERFORMANCE_COUNTER worker-count-per-node = 243
PERFORMANCE_COUNTER actor-count-per-worker = 100
PERFORMANCE_COUNTER worker-count = 243
PERFORMANCE_COUNTER actor-count = 24300
PERFORMANCE_COUNTER message-count-per-actor = 40000
PERFORMANCE_COUNTER message-count = 972000000
PERFORMANCE_COUNTER elapsed-time = 561.661087 s
PERFORMANCE_COUNTER computation-throughput = 1730580.988668 messages / s
PERFORMANCE_COUNTER node-throughput = 1730580.988668 messages / s
PERFORMANCE_COUNTER worker-throughput = 7121.732464 messages / s
PERFORMANCE_COUNTER worker-latency = 140415 ns
PERFORMANCE_COUNTER actor-throughput = 71.217325 messages / s
PERFORMANCE_COUNTER actor-latency = 14041527 ns


The code was built without any change from the git tree.
The build commands:
export INTEL_LICENSE_FILE=28518 at ftsn2
source /soft/compilers/intel/composer_xe_2013_sp1.1.106/bin/compilervars.sh
make -j CFLAGS="-mmic -O3 -I." CC=icc CONFIG_MPI=n CONFIG_ZLIB=n


The executable is not compatible with x86-64:

Machine type: k1om, 64 bits (not x86-64 according to readelf)
[boisvert at jenny biosal]$ file performance/latency_probe/latency_probe 
performance/latency_probe/latency_probe: ELF 64-bit LSB executable, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.16, not stripped
[boisvert at jenny biosal]$ readelf performance/latency_probe/latency_probe -a | grep interpreter
      [Requesting program interpreter: /lib64/ld-linux-k1om.so.2]


> From: Fangfang Xia [fangfang.xia at gmail.com]
> Sent: Monday, November 03, 2014 3:47 PM
> To: Boisvert, Sebastien
> Cc: biosal at lists.cels.anl.gov
> Subject: Re: [BIOSAL] Results with Xeon and Xeon Phi
> 
> 
> This interesting. I’m curious what the call stacks for these spin locks are?
> 
> On Nov 3, 2014, at 3:35 PM, Boisvert, Sebastien <boisvert at anl.gov> wrote:
> 42.42%
>   [kernel]                          [k] _spin_lock    
> 
> 
> 


More information about the BIOSAL mailing list