[BIOSAL] Progress on the message delivery time

Sat Nov 1 14:45:05 CDT 2014

> From: George K. Thiruvathukal [gkt at cs.luc.edu]
> Sent: Saturday, November 01, 2014 10:18 AM
> To: Boisvert, Sebastien
> Cc: biosal at lists.cels.anl.gov
> Subject: Re: [BIOSAL] Progress on the message delivery time
> 
> 
> 
> 
> Seb,
> 
> 
> This is great news.
> So if I am reading this correctly, we are now on par with other actors implementations in terms of order of magnitude (10^6 for the computation messaging rate looks promising).

Yes indeed. It is currently at 1.2 M msg / s at the machine level (32 x86-64 cores).

I think this can be increased to 2 M msg / s by reducing the amount of instructions in the critical section (your suggestion) in
core_ring_push_multiple_producer.

I am measuring throughput at 4 levels:

computation-throughput
node-throughput
worker-throughput
actor-throughput

In an ideal world, "worker-throughput" is the one we want to maximize because "actor-throughput" depends on the number of actors per worker
and "node-throughput" increases when you have ore CPU cores too.

> Which of your test programs is being used to do this benchmark?

The program is latency_probe.

 => https://github.com/sebhtml/biosal/tree/master/performance/latency_probe <=

It is very compact (1 actor script)

[boisvert at bigmem biosal]$ wc -l  performance/latency_probe/*
   10 performance/latency_probe/main.c
   11 performance/latency_probe/Makefile.mk
  288 performance/latency_probe/process.c
   26 performance/latency_probe/process.h
  335 total

To run it:

[boisvert at bigmem biosal]$ time mpiexec -n 4 performance/latency_probe/latency_probe -threads-per-node 8 |tee log 
...
real    1m33.844s
user    48m53.256s
sys     0m41.324s

Right now, a worker processes a message every 22 μs:

[boisvert at bigmem biosal]$ grep COUNTER log
PERFORMANCE_COUNTER node-count = 4
PERFORMANCE_COUNTER worker-count-per-node = 7
PERFORMANCE_COUNTER actor-count-per-worker = 100
PERFORMANCE_COUNTER worker-count = 28
PERFORMANCE_COUNTER actor-count = 2800
PERFORMANCE_COUNTER message-count-per-actor = 40000
PERFORMANCE_COUNTER message-count = 112000000
PERFORMANCE_COUNTER elapsed-time = 89.486134 s
PERFORMANCE_COUNTER computation-throughput = 1251590.555109 messages / s
PERFORMANCE_COUNTER node-throughput = 312897.638777 messages / s
PERFORMANCE_COUNTER worker-throughput = 44699.662682 messages / s
PERFORMANCE_COUNTER worker-latency = 22371 ns
PERFORMANCE_COUNTER actor-throughput = 446.996627 messages / s
PERFORMANCE_COUNTER actor-latency = 2237153 ns

According to LTTng (events thorium_actor:receive_enter and thorium_actor:receive_exit),
it take around 3 μs for processing ACTION_PING or ACTION_PING_REPLY:

Latency quantiles in a thorium node main loop (ns):

> quantile(r[,1]*1000*1000*1000, c(0.1, 0.3, 0.5, 0.7, 0.9, 0.98, 0.99))
 10%  30%  50%  70%  90%  98%  99% 
 902  923  947 1032 2180 5563 7723 

$ perf top -e branch-misses

Samples: 394K of event 'branch-misses', Event count (approx.): 121824414                                                                                                                                           
 13.09%  latency_probe                     [.] core_hash_table_find_bucket                                                                                                                                         
 12.82%  latency_probe                     [.] core_murmur_hash_2_64_a                                                                                                                                             
  8.08%  libopen-pal.so.6.1.1              [.] 0x000000000005bb30                                                                                                                                                  
  7.43%  libmpi.so.1.5.0                   [.] 0x000000000003eb97                                                                                                                                                  
  4.68%  libc-2.12.so                      [.] memcpy                                                                                                                                                              
  4.49%  latency_probe                     [.] core_fast_ring_pop_multiple_producers                                                                                                                               
  3.16%  latency_probe                     [.] core_red_black_tree_add                                                                                                                                             
  2.90%  latency_probe                     [.] core_hash_table_group_state                                                                                                                                         
  2.79%  latency_probe                     [.] thorium_worker_run                                                                                                                                                  
  2.78%  [vdso]                            [.] 0x00000000000009d2                                                                                                                                                  
  2.26%  perf                              [.] dso__find_symbol                                                                                                                                                    
  2.24%  [vsyscall]                        [.] 0x0000000000000400                                                                                                                                                  
  2.21%  perf                              [.] 0x000000000013c5ba                                                                                                                                                  
  1.38%  latency_probe                     [.] core_red_black_tree_get_lowest_key                                                                                                                                  
  1.38%  latency_probe                     [.] core_fast_queue_dequeue                                                                                                                                             
  1.34%  latency_probe                     [.] core_map_get_value                 

$ perf top -e cache-misses
Samples: 995K of event 'cache-misses', Event count (approx.): 342927320                                                                                                                                            
 16.26%  libpthread-2.12.so                   [.] pthread_spin_lock                                                                                                                                                
  8.77%  libc-2.12.so                         [.] memcpy                                                                                                                                                           
  8.75%  latency_probe                        [.] core_fast_ring_pop_multiple_producers                                                                                                                            
  5.34%  latency_probe                        [.] core_memory_copy                                                                                                                                                 
  5.04%  latency_probe                        [.] core_fast_ring_push_from_producer                                                                                                                                
  4.26%  latency_probe                        [.] core_hash_table_find_bucket                                                                                                                                      
  3.08%  latency_probe                        [.] thorium_worker_run                                                                                                                                               
  2.81%  mca_pml_ob1.so                       [.] mca_pml_ob1_iprobe                                                                                                                                               
  2.74%  libc-2.12.so                         [.] __random                                                                                                                                                         
  2.36%  latency_probe                        [.] core_fast_ring_pop_from_consumer                                                                                                                                 
  1.80%  latency_probe                        [.] thorium_worker_execute_local_delivery                                                                                                                            
  1.69%  [vdso]                               [.] 0x00000000000009bd                                                                                                                                               
  1.52%  latency_probe                        [.] core_dynamic_hash_table_destroy                                                                                                                                  
  1.52%  [vsyscall]                           [.] 0x0000000000000008                                                                                                                                               
  1.52%  [kernel]                             [k] clear_page_c                                                                                                                                                     
  1.34%  latency_probe                        [.] thorium_node_run                                                                                                                                                 
  1.28%  latency_probe                        [.] thorium_node_resolve                                                                                                                                             
  1.22%  latency_probe                        [.] thorium_node_prepare_received_message                                                                                                                            
  1.07%  libc-2.12.so                         [.] __random_r           

The spinlock is stirring a lot of things in the cache coherency of the system obviously.

$ perf top -e cpu-cycles
Samples: 1M of event 'cpu-cycles', Event count (approx.): 601021284856                                                                                                                                             
 10.61%  [vsyscall]                        [.] 0x000000000000014c                                                                                                                                                  
  8.42%  latency_probe                     [.] core_hash_table_find_bucket                                                                                                                                         
  5.24%  libpthread-2.12.so                [.] pthread_spin_lock                                                                                                                                                   
  4.61%  libc-2.12.so                      [.] memcpy                                                                                                                                                              
  4.30%  latency_probe                     [.] thorium_worker_run                                                                                                                                                  
  3.94%  latency_probe                     [.] core_fast_ring_pop_multiple_producers                                                                                                                               
  3.46%  mca_pml_ob1.so                    [.] mca_pml_ob1_iprobe                                                                                                                                                  
  3.25%  libpthread-2.12.so                [.] 0x000000000000c392                                                                                                                                                  
  3.14%  [vdso]                            [.] 0x00000000000009eb                                                                                                                                                  
  2.33%  latency_probe                     [.] core_fast_ring_push_from_producer                                                                                                                                   
  2.32%  latency_probe                     [.] core_memory_copy                                                                                                                                                    
  2.19%  latency_probe                     [.] core_murmur_hash_2_64_a                                                                                                                                             
  1.79%  latency_probe                     [.] thorium_node_run                                                                                                                                                    
  1.76%  latency_probe                     [.] core_hash_table_group_state                                                                                                                                         
  1.72%  latency_probe                     [.] core_fast_queue_dequeue                                                                                                                                             
  1.67%  latency_probe                     [.] thorium_worker_dequeue_actor                                                                                                                                        
  1.34%  latency_probe                     [.] thorium_node_resolve                                                                                                                                                
  1.14%  latency_probe                     [.] core_fast_ring_pop_from_consumer                                                                                                                                    
  1.07%  librt-2.12.so                     [.] clock_gettime                                                                                                                                                       
  0.95%  latency_probe                     [.] thorium_worker_free_message              

- The [vdso] entry is the kernel routines stored in userspace to avoid syscalls;
- I think vsyscall is related to vdso although I am not certain about that;

On the system I am using, this is the code of spinlock (objdump -d):

0000003f1a20c380 <pthread_spin_lock>:
  3f1a20c380:   f0 ff 0f                lock decl (%rdi)
  3f1a20c383:   75 0b                   jne    3f1a20c390 <pthread_spin_lock+0x10>
  3f1a20c385:   31 c0                   xor    %eax,%eax
  3f1a20c387:   c3                      retq   
  3f1a20c388:   0f 1f 84 00 00 00 00    nopl   0x0(%rax,%rax,1)
  3f1a20c38f:   00 
  3f1a20c390:   f3 90                   pause  
  3f1a20c392:   83 3f 00                cmpl   $0x0,(%rdi)
  3f1a20c395:   7f e9                   jg     3f1a20c380 <pthread_spin_lock>
  3f1a20c397:   eb f7                   jmp    3f1a20c390 <pthread_spin_lock+0x10>
  3f1a20c399:   90                      nop
  3f1a20c39a:   90                      nop
  3f1a20c39b:   90                      nop
  3f1a20c39c:   90                      nop
  3f1a20c39d:   90                      nop
  3f1a20c39e:   90                      nop
  3f1a20c39f:   90                      nop

The spin_lock in musl-libc:

http://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_spin_lock.c

The a_swap (atomic compare and swap) is defined in:
http://git.musl-libc.org/cgit/musl/tree/arch/x86_64/atomic.h#n62

a_spin is defined in this one:
http://git.musl-libc.org/cgit/musl/tree/arch/x86_64/atomic.h#n91

The 'pause' x86 instruction basically does some spinning in hardware I think.

Linux uses a ticket lock for its spin_lock() implementation.
Pthread (in the IEEE spec) does not use ticket spin lock.

The Linux x86 code:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/x86/include/asm/spinlock.h#n82

I think we should use a ticket lock too to avoid doing all these CAS operations.

Basically, with the spinlock, the requester has a ticket. It will spin as long as the current_ticket is not its ticket
(someone  else owns the lock).

> 
> 
> I spent yesterday in our department/faculty meeting. Almost back in business.
> 
> 
> George
> 
> 
> 
> 
> 
> George K. Thiruvathukal, PhD
> 
> Professor of Computer Science, Loyola University Chicago
> 
> Director, Center for Textual Studies and Digital Humanities
> Guest Faculty, Argonne National Laboratory, Math and Computer Science Division
> Editor in Chief, Computing in
>  Science and Engineering (IEEE CS/AIP)
> 
> (w) gkt.tv (v)
>  773.829.4872
> 
> 
> 
> 
> 
> 
> 
> On Fri, Oct 31, 2014 at 8:48 PM, Boisvert, Sebastien 
> <boisvert at anl.gov> wrote:
> 
> > ________________________________________
> > From: 
> biosal-bounces at lists.cels.anl.gov [biosal-bounces at lists.cels.anl.gov] on behalf of Boisvert,
>  Sebastien [boisvert at anl.gov]
> > Sent: Thursday, October 30, 2014 9:26 PM
> > To: 
> biosal at lists.cels.anl.gov
> > Subject: [BIOSAL] Progress on the message delivery time
> 
> > OK, in the industry, people are getting 50 M messages / second on one single machine with 48 x86-64 Opteron cores with Akka. They are using
> > 2 actors per core. (96 actors). The article is online here: 
> http://letitcrash.com/post/20397701710/50-million-messages-per-second-on-a-single-machine
> > They are using a throughput setting of 20, which means that when an actor takes control of a x86 core, it can receive (one after the other)
> > up to 20 messages. I suppose therefore that any actor in the system have more than 1 in-flight message to any destination, otherwise the
> > throughput configuration would not do anything because each actor would statistically receive at most 1 message for any time window of, say, a couple of
> > μs. They don't mention in-flight messages in the blog article though, so I might be wrong on that anyway.
> > With the LMAX Disuptor (1-consumer 1-producer), a throughput of 6 M msg / sec is generated.
> > According to 
> http://musings-of-an-erlang-priest.blogspot.com/2012/07/i-only-trust-benchmarks-i-have-rigged.html :
> > Erlang can tick at 1 M msg / s and Akka can do 2 M msg / s.
> > In that same article, they are using more than 1 in-flight message per source actor ("so the right way to test this is to push a lot of messages to it").
> > In F#, one single actor is processing 4.6 M msg / s. I think, however, that their system is only running one actor, and also the code
> > contains some "shared memory black magic" here: 
> http://zbray.com/2012/12/09/building-an-actor-in-f-with-higher-throughput-than-akka-and-erlang-actors/
> > Finally, the Gemini network (Cray XE6) is capable of moving " tens of millions of MPI messages per second"
> http://www.cray.com/Products/Computing/XE/Technology.aspx
> > I think this figure of for the whole system, not for an individual node.
> >
> > In Thorium:
> > Hardware: 32 x86-64 cores (Intel(R) Xeon(R) CPU E7- 4830  @ 2.13GHz)
> > Rules: 100 actor per core, each source actor is allowed only 1 in-flight request message at most (ACTION_PING).
> > Also, there is a timeout of 100 μs for the multiplexer, so messages typically wait at least 100 μs before leaving workers
> > so that they can be multiplexed.
> > ###
> > 4x8, 100 actors per core
> > 4 nodes, 28 worker threads (4 * 7), 2800 actors (28 * 100)
> > Total sent message count: 112000000 (2800 * 40000)
> > Time: 196955912801 nanoseconds (196.955913 s)
> > Computation messaging rate: 568655.179767 messages / second
> > Node messaging rate: 142163.794942 messages / second
> > Worker messaging rate: 20309.113563 messages / second  <=====================
> > Actor messaging rate: 203.091136 messages / second
> > With this, the delivery latency (at the worker level) is around 50 μs.
> 
> 
> 
> The actor-level throughput was increased by ~80%:
> 
> 
> 4 nodes, 28 worker threads (4 * 7), 2800 actors (28 * 100)
> Total sent message count: 112000000 (2800 * 40000)
> Time: 107121732947 nanoseconds (107.121733 s)
> Computation messaging rate: 1045539.471019 messages / second
> Node messaging rate: 261384.867755 messages / second
> Worker messaging rate: 37340.695394 messages / second
> Actor messaging rate: 373.406954 messages / second   <=====================
> 
> 
> 
> 
> > ###
> > 1x4, 100 actors per core
> > 1 nodes, 3 worker threads (1 * 3), 300 actors (3 * 100)
> > Total sent message count: 12000000 (300 * 40000)
> > Time: 49616199643 nanoseconds (49.616200 s)
> > Computation messaging rate: 241856.492161 messages / second
> > Node messaging rate: 241856.492161 messages / second
> > Worker messaging rate: 80618.830720 messages / second  <=====================
> > Actor messaging rate: 806.188307 messages / second
> > With this, the  delivery latency (at the worker level) is around 12 μs.
> >
> > I moved the small message multiplexing into workers. Now, the small message demultiplexing and also the message
> > recycling code path must be migrated to workers too (outside of thorium_node).
> > My main goal is to reduce 4x8 down to 12 μs (like 1x4).
> >
> > Thanks.
> >
> > _______________________________________________
> > BIOSAL mailing list
> > 
> BIOSAL at lists.cels.anl.gov
> > 
> https://lists.cels.anl.gov/mailman/listinfo/biosal
> _______________________________________________
> BIOSAL mailing list
> BIOSAL at lists.cels.anl.gov
> https://lists.cels.anl.gov/mailman/listinfo/biosal
> 
> 
> 
> 
>