[BIOSAL] Progress on the message delivery time

Sat Nov 1 10:18:47 CDT 2014

Seb,

This is great news. So if I am reading this correctly, we are now on par
with other actors implementations in terms of order of magnitude (10^6 for
the computation messaging rate looks promising). Which of your test
programs is being used to do this benchmark?

I spent yesterday in our department/faculty meeting. Almost back in
business.

George

George K. Thiruvathukal, PhD
*Professor of Computer Science*, Loyola University Chicago
*Director*, Center for Textual Studies and Digital Humanities
*Guest Faculty*, Argonne National Laboratory, Math and Computer Science
Division
Editor in Chief, Computing in Science and Engineering
<http://www.computer.org/portal/web/computingnow/cise> (IEEE CS/AIP)
(w) gkt.tv (v) 773.829.4872

On Fri, Oct 31, 2014 at 8:48 PM, Boisvert, Sebastien <boisvert at anl.gov>
wrote:

> > ________________________________________
> > From: biosal-bounces at lists.cels.anl.gov [
> biosal-bounces at lists.cels.anl.gov] on behalf of Boisvert, Sebastien [
> boisvert at anl.gov]
> > Sent: Thursday, October 30, 2014 9:26 PM
> > To: biosal at lists.cels.anl.gov
> > Subject: [BIOSAL] Progress on the message delivery time
> > OK, in the industry, people are getting 50 M messages / second on one
> single machine with 48 x86-64 Opteron cores with Akka. They are using
> > 2 actors per core. (96 actors). The article is online here:
> http://letitcrash.com/post/20397701710/50-million-messages-per-second-on-a-single-machine
> > They are using a throughput setting of 20, which means that when an
> actor takes control of a x86 core, it can receive (one after the other)
> > up to 20 messages. I suppose therefore that any actor in the system have
> more than 1 in-flight message to any destination, otherwise the
> > throughput configuration would not do anything because each actor would
> statistically receive at most 1 message for any time window of, say, a
> couple of
> > μs. They don't mention in-flight messages in the blog article though, so
> I might be wrong on that anyway.
> > With the LMAX Disuptor (1-consumer 1-producer), a throughput of 6 M msg
> / sec is generated.
> > According to
> http://musings-of-an-erlang-priest.blogspot.com/2012/07/i-only-trust-benchmarks-i-have-rigged.html
> :
> > Erlang can tick at 1 M msg / s and Akka can do 2 M msg / s.
> > In that same article, they are using more than 1 in-flight message per
> source actor ("so the right way to test this is to push a lot of messages
> to it").
> > In F#, one single actor is processing 4.6 M msg / s. I think, however,
> that their system is only running one actor, and also the code
> > contains some "shared memory black magic" here:
> http://zbray.com/2012/12/09/building-an-actor-in-f-with-higher-throughput-than-akka-and-erlang-actors/
> > Finally, the Gemini network (Cray XE6) is capable of moving " tens of
> millions of MPI messages per second"
> http://www.cray.com/Products/Computing/XE/Technology.aspx
> > I think this figure of for the whole system, not for an individual node.
> >
> > In Thorium:
> > Hardware: 32 x86-64 cores (Intel(R) Xeon(R) CPU E7- 4830  @ 2.13GHz)
> > Rules: 100 actor per core, each source actor is allowed only 1 in-flight
> request message at most (ACTION_PING).
> > Also, there is a timeout of 100 μs for the multiplexer, so messages
> typically wait at least 100 μs before leaving workers
> > so that they can be multiplexed.
> > ###
> > 4x8, 100 actors per core
> > 4 nodes, 28 worker threads (4 * 7), 2800 actors (28 * 100)
> > Total sent message count: 112000000 (2800 * 40000)
> > Time: 196955912801 nanoseconds (196.955913 s)
> > Computation messaging rate: 568655.179767 messages / second
> > Node messaging rate: 142163.794942 messages / second
> > Worker messaging rate: 20309.113563 messages / second
> <=====================
> > Actor messaging rate: 203.091136 messages / second
> > With this, the delivery latency (at the worker level) is around 50 μs.
>
>
> The actor-level throughput was increased by ~80%:
>
>
> 4 nodes, 28 worker threads (4 * 7), 2800 actors (28 * 100)
> Total sent message count: 112000000 (2800 * 40000)
> Time: 107121732947 nanoseconds (107.121733 s)
> Computation messaging rate: 1045539.471019 messages / second
> Node messaging rate: 261384.867755 messages / second
> Worker messaging rate: 37340.695394 messages / second
> Actor messaging rate: 373.406954 messages / second   <=====================
>
>
>
> > ###
> > 1x4, 100 actors per core
> > 1 nodes, 3 worker threads (1 * 3), 300 actors (3 * 100)
> > Total sent message count: 12000000 (300 * 40000)
> > Time: 49616199643 nanoseconds (49.616200 s)
> > Computation messaging rate: 241856.492161 messages / second
> > Node messaging rate: 241856.492161 messages / second
> > Worker messaging rate: 80618.830720 messages / second
> <=====================
> > Actor messaging rate: 806.188307 messages / second
> > With this, the  delivery latency (at the worker level) is around 12 μs.
> >
> > I moved the small message multiplexing into workers. Now, the small
> message demultiplexing and also the message
> > recycling code path must be migrated to workers too (outside of
> thorium_node).
> > My main goal is to reduce 4x8 down to 12 μs (like 1x4).
> >
> > Thanks.
> >
> > _______________________________________________
> > BIOSAL mailing list
> > BIOSAL at lists.cels.anl.gov
> > https://lists.cels.anl.gov/mailman/listinfo/biosal
> _______________________________________________
> BIOSAL mailing list
> BIOSAL at lists.cels.anl.gov
> https://lists.cels.anl.gov/mailman/listinfo/biosal
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.cels.anl.gov/pipermail/biosal/attachments/20141101/a5efe62c/attachment.html>