[BIOSAL] Progress on the message delivery time
Boisvert, Sebastien
boisvert at anl.gov
Sat Nov 1 14:45:05 CDT 2014
> From: George K. Thiruvathukal [gkt at cs.luc.edu]
> Sent: Saturday, November 01, 2014 10:18 AM
> To: Boisvert, Sebastien
> Cc: biosal at lists.cels.anl.gov
> Subject: Re: [BIOSAL] Progress on the message delivery time
>
>
>
>
> Seb,
>
>
> This is great news.
> So if I am reading this correctly, we are now on par with other actors implementations in terms of order of magnitude (10^6 for the computation messaging rate looks promising).
Yes indeed. It is currently at 1.2 M msg / s at the machine level (32 x86-64 cores).
I think this can be increased to 2 M msg / s by reducing the amount of instructions in the critical section (your suggestion) in
core_ring_push_multiple_producer.
I am measuring throughput at 4 levels:
computation-throughput
node-throughput
worker-throughput
actor-throughput
In an ideal world, "worker-throughput" is the one we want to maximize because "actor-throughput" depends on the number of actors per worker
and "node-throughput" increases when you have ore CPU cores too.
> Which of your test programs is being used to do this benchmark?
The program is latency_probe.
=> https://github.com/sebhtml/biosal/tree/master/performance/latency_probe <=
It is very compact (1 actor script)
[boisvert at bigmem biosal]$ wc -l performance/latency_probe/*
10 performance/latency_probe/main.c
11 performance/latency_probe/Makefile.mk
288 performance/latency_probe/process.c
26 performance/latency_probe/process.h
335 total
To run it:
[boisvert at bigmem biosal]$ time mpiexec -n 4 performance/latency_probe/latency_probe -threads-per-node 8 |tee log
...
real 1m33.844s
user 48m53.256s
sys 0m41.324s
Right now, a worker processes a message every 22 μs:
[boisvert at bigmem biosal]$ grep COUNTER log
PERFORMANCE_COUNTER node-count = 4
PERFORMANCE_COUNTER worker-count-per-node = 7
PERFORMANCE_COUNTER actor-count-per-worker = 100
PERFORMANCE_COUNTER worker-count = 28
PERFORMANCE_COUNTER actor-count = 2800
PERFORMANCE_COUNTER message-count-per-actor = 40000
PERFORMANCE_COUNTER message-count = 112000000
PERFORMANCE_COUNTER elapsed-time = 89.486134 s
PERFORMANCE_COUNTER computation-throughput = 1251590.555109 messages / s
PERFORMANCE_COUNTER node-throughput = 312897.638777 messages / s
PERFORMANCE_COUNTER worker-throughput = 44699.662682 messages / s
PERFORMANCE_COUNTER worker-latency = 22371 ns
PERFORMANCE_COUNTER actor-throughput = 446.996627 messages / s
PERFORMANCE_COUNTER actor-latency = 2237153 ns
According to LTTng (events thorium_actor:receive_enter and thorium_actor:receive_exit),
it take around 3 μs for processing ACTION_PING or ACTION_PING_REPLY:
Latency quantiles in a thorium node main loop (ns):
> quantile(r[,1]*1000*1000*1000, c(0.1, 0.3, 0.5, 0.7, 0.9, 0.98, 0.99))
10% 30% 50% 70% 90% 98% 99%
902 923 947 1032 2180 5563 7723
$ perf top -e branch-misses
Samples: 394K of event 'branch-misses', Event count (approx.): 121824414
13.09% latency_probe [.] core_hash_table_find_bucket
12.82% latency_probe [.] core_murmur_hash_2_64_a
8.08% libopen-pal.so.6.1.1 [.] 0x000000000005bb30
7.43% libmpi.so.1.5.0 [.] 0x000000000003eb97
4.68% libc-2.12.so [.] memcpy
4.49% latency_probe [.] core_fast_ring_pop_multiple_producers
3.16% latency_probe [.] core_red_black_tree_add
2.90% latency_probe [.] core_hash_table_group_state
2.79% latency_probe [.] thorium_worker_run
2.78% [vdso] [.] 0x00000000000009d2
2.26% perf [.] dso__find_symbol
2.24% [vsyscall] [.] 0x0000000000000400
2.21% perf [.] 0x000000000013c5ba
1.38% latency_probe [.] core_red_black_tree_get_lowest_key
1.38% latency_probe [.] core_fast_queue_dequeue
1.34% latency_probe [.] core_map_get_value
$ perf top -e cache-misses
Samples: 995K of event 'cache-misses', Event count (approx.): 342927320
16.26% libpthread-2.12.so [.] pthread_spin_lock
8.77% libc-2.12.so [.] memcpy
8.75% latency_probe [.] core_fast_ring_pop_multiple_producers
5.34% latency_probe [.] core_memory_copy
5.04% latency_probe [.] core_fast_ring_push_from_producer
4.26% latency_probe [.] core_hash_table_find_bucket
3.08% latency_probe [.] thorium_worker_run
2.81% mca_pml_ob1.so [.] mca_pml_ob1_iprobe
2.74% libc-2.12.so [.] __random
2.36% latency_probe [.] core_fast_ring_pop_from_consumer
1.80% latency_probe [.] thorium_worker_execute_local_delivery
1.69% [vdso] [.] 0x00000000000009bd
1.52% latency_probe [.] core_dynamic_hash_table_destroy
1.52% [vsyscall] [.] 0x0000000000000008
1.52% [kernel] [k] clear_page_c
1.34% latency_probe [.] thorium_node_run
1.28% latency_probe [.] thorium_node_resolve
1.22% latency_probe [.] thorium_node_prepare_received_message
1.07% libc-2.12.so [.] __random_r
The spinlock is stirring a lot of things in the cache coherency of the system obviously.
$ perf top -e cpu-cycles
Samples: 1M of event 'cpu-cycles', Event count (approx.): 601021284856
10.61% [vsyscall] [.] 0x000000000000014c
8.42% latency_probe [.] core_hash_table_find_bucket
5.24% libpthread-2.12.so [.] pthread_spin_lock
4.61% libc-2.12.so [.] memcpy
4.30% latency_probe [.] thorium_worker_run
3.94% latency_probe [.] core_fast_ring_pop_multiple_producers
3.46% mca_pml_ob1.so [.] mca_pml_ob1_iprobe
3.25% libpthread-2.12.so [.] 0x000000000000c392
3.14% [vdso] [.] 0x00000000000009eb
2.33% latency_probe [.] core_fast_ring_push_from_producer
2.32% latency_probe [.] core_memory_copy
2.19% latency_probe [.] core_murmur_hash_2_64_a
1.79% latency_probe [.] thorium_node_run
1.76% latency_probe [.] core_hash_table_group_state
1.72% latency_probe [.] core_fast_queue_dequeue
1.67% latency_probe [.] thorium_worker_dequeue_actor
1.34% latency_probe [.] thorium_node_resolve
1.14% latency_probe [.] core_fast_ring_pop_from_consumer
1.07% librt-2.12.so [.] clock_gettime
0.95% latency_probe [.] thorium_worker_free_message
- The [vdso] entry is the kernel routines stored in userspace to avoid syscalls;
- I think vsyscall is related to vdso although I am not certain about that;
On the system I am using, this is the code of spinlock (objdump -d):
0000003f1a20c380 <pthread_spin_lock>:
3f1a20c380: f0 ff 0f lock decl (%rdi)
3f1a20c383: 75 0b jne 3f1a20c390 <pthread_spin_lock+0x10>
3f1a20c385: 31 c0 xor %eax,%eax
3f1a20c387: c3 retq
3f1a20c388: 0f 1f 84 00 00 00 00 nopl 0x0(%rax,%rax,1)
3f1a20c38f: 00
3f1a20c390: f3 90 pause
3f1a20c392: 83 3f 00 cmpl $0x0,(%rdi)
3f1a20c395: 7f e9 jg 3f1a20c380 <pthread_spin_lock>
3f1a20c397: eb f7 jmp 3f1a20c390 <pthread_spin_lock+0x10>
3f1a20c399: 90 nop
3f1a20c39a: 90 nop
3f1a20c39b: 90 nop
3f1a20c39c: 90 nop
3f1a20c39d: 90 nop
3f1a20c39e: 90 nop
3f1a20c39f: 90 nop
The spin_lock in musl-libc:
http://git.musl-libc.org/cgit/musl/tree/src/thread/pthread_spin_lock.c
The a_swap (atomic compare and swap) is defined in:
http://git.musl-libc.org/cgit/musl/tree/arch/x86_64/atomic.h#n62
a_spin is defined in this one:
http://git.musl-libc.org/cgit/musl/tree/arch/x86_64/atomic.h#n91
The 'pause' x86 instruction basically does some spinning in hardware I think.
Linux uses a ticket lock for its spin_lock() implementation.
Pthread (in the IEEE spec) does not use ticket spin lock.
The Linux x86 code:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/arch/x86/include/asm/spinlock.h#n82
I think we should use a ticket lock too to avoid doing all these CAS operations.
Basically, with the spinlock, the requester has a ticket. It will spin as long as the current_ticket is not its ticket
(someone else owns the lock).
>
>
> I spent yesterday in our department/faculty meeting. Almost back in business.
>
>
> George
>
>
>
>
>
> George K. Thiruvathukal, PhD
>
> Professor of Computer Science, Loyola University Chicago
>
> Director, Center for Textual Studies and Digital Humanities
> Guest Faculty, Argonne National Laboratory, Math and Computer Science Division
> Editor in Chief, Computing in
> Science and Engineering (IEEE CS/AIP)
>
> (w) gkt.tv (v)
> 773.829.4872
>
>
>
>
>
>
>
> On Fri, Oct 31, 2014 at 8:48 PM, Boisvert, Sebastien
> <boisvert at anl.gov> wrote:
>
> > ________________________________________
> > From:
> biosal-bounces at lists.cels.anl.gov [biosal-bounces at lists.cels.anl.gov] on behalf of Boisvert,
> Sebastien [boisvert at anl.gov]
> > Sent: Thursday, October 30, 2014 9:26 PM
> > To:
> biosal at lists.cels.anl.gov
> > Subject: [BIOSAL] Progress on the message delivery time
>
> > OK, in the industry, people are getting 50 M messages / second on one single machine with 48 x86-64 Opteron cores with Akka. They are using
> > 2 actors per core. (96 actors). The article is online here:
> http://letitcrash.com/post/20397701710/50-million-messages-per-second-on-a-single-machine
> > They are using a throughput setting of 20, which means that when an actor takes control of a x86 core, it can receive (one after the other)
> > up to 20 messages. I suppose therefore that any actor in the system have more than 1 in-flight message to any destination, otherwise the
> > throughput configuration would not do anything because each actor would statistically receive at most 1 message for any time window of, say, a couple of
> > μs. They don't mention in-flight messages in the blog article though, so I might be wrong on that anyway.
> > With the LMAX Disuptor (1-consumer 1-producer), a throughput of 6 M msg / sec is generated.
> > According to
> http://musings-of-an-erlang-priest.blogspot.com/2012/07/i-only-trust-benchmarks-i-have-rigged.html :
> > Erlang can tick at 1 M msg / s and Akka can do 2 M msg / s.
> > In that same article, they are using more than 1 in-flight message per source actor ("so the right way to test this is to push a lot of messages to it").
> > In F#, one single actor is processing 4.6 M msg / s. I think, however, that their system is only running one actor, and also the code
> > contains some "shared memory black magic" here:
> http://zbray.com/2012/12/09/building-an-actor-in-f-with-higher-throughput-than-akka-and-erlang-actors/
> > Finally, the Gemini network (Cray XE6) is capable of moving " tens of millions of MPI messages per second"
> http://www.cray.com/Products/Computing/XE/Technology.aspx
> > I think this figure of for the whole system, not for an individual node.
> >
> > In Thorium:
> > Hardware: 32 x86-64 cores (Intel(R) Xeon(R) CPU E7- 4830 @ 2.13GHz)
> > Rules: 100 actor per core, each source actor is allowed only 1 in-flight request message at most (ACTION_PING).
> > Also, there is a timeout of 100 μs for the multiplexer, so messages typically wait at least 100 μs before leaving workers
> > so that they can be multiplexed.
> > ###
> > 4x8, 100 actors per core
> > 4 nodes, 28 worker threads (4 * 7), 2800 actors (28 * 100)
> > Total sent message count: 112000000 (2800 * 40000)
> > Time: 196955912801 nanoseconds (196.955913 s)
> > Computation messaging rate: 568655.179767 messages / second
> > Node messaging rate: 142163.794942 messages / second
> > Worker messaging rate: 20309.113563 messages / second <=====================
> > Actor messaging rate: 203.091136 messages / second
> > With this, the delivery latency (at the worker level) is around 50 μs.
>
>
>
> The actor-level throughput was increased by ~80%:
>
>
> 4 nodes, 28 worker threads (4 * 7), 2800 actors (28 * 100)
> Total sent message count: 112000000 (2800 * 40000)
> Time: 107121732947 nanoseconds (107.121733 s)
> Computation messaging rate: 1045539.471019 messages / second
> Node messaging rate: 261384.867755 messages / second
> Worker messaging rate: 37340.695394 messages / second
> Actor messaging rate: 373.406954 messages / second <=====================
>
>
>
>
> > ###
> > 1x4, 100 actors per core
> > 1 nodes, 3 worker threads (1 * 3), 300 actors (3 * 100)
> > Total sent message count: 12000000 (300 * 40000)
> > Time: 49616199643 nanoseconds (49.616200 s)
> > Computation messaging rate: 241856.492161 messages / second
> > Node messaging rate: 241856.492161 messages / second
> > Worker messaging rate: 80618.830720 messages / second <=====================
> > Actor messaging rate: 806.188307 messages / second
> > With this, the delivery latency (at the worker level) is around 12 μs.
> >
> > I moved the small message multiplexing into workers. Now, the small message demultiplexing and also the message
> > recycling code path must be migrated to workers too (outside of thorium_node).
> > My main goal is to reduce 4x8 down to 12 μs (like 1x4).
> >
> > Thanks.
> >
> > _______________________________________________
> > BIOSAL mailing list
> >
> BIOSAL at lists.cels.anl.gov
> >
> https://lists.cels.anl.gov/mailman/listinfo/biosal
> _______________________________________________
> BIOSAL mailing list
> BIOSAL at lists.cels.anl.gov
> https://lists.cels.anl.gov/mailman/listinfo/biosal
>
>
>
>
>
More information about the BIOSAL
mailing list