.NET 7.0's Performance Improvements with Dynamic PGO are Incredible

Akka.Remote is 33% faster, Akka.NET v1.5 is 75% faster in-memory.

.NET 7.0 was released to market last week and includes hundreds of major improvements across the board.

I ran Akka.NET’s RemotePingPong benchmark on .NET 7.0 shortly after installing the .NET 7.0 SDK - I’ll take every free lunch I can from the CoreCLR team.

Here’s how the numbers compare between .NET 6.0 and .NET 7.0 for RemotePingPong:

Akka.NET v1.4.45 RemotePingPong performance on .NET 6.0

Akka.NET v1.4.45 performs at roughly 340,000 msg/s on my Gen 1 AMD Ryzen machine built in 2017.

The results on .NET 7.0:

Akka.NET v1.4.45 RemotePingPong performance on .NET 7.0

360,000 msg/s, a roughly 10% improvement. Not bad.

After posting the results on Twitter another Akka.NET user suggested that I try re-running the benchmark with Dynamic Profile-Guided Optimization (PGO) enabled, which received significant investment in .NET 7.0.

The results? A shocking 33% improvement over .NET 6.0.

Akka.NET v1.4.45 RemotePingPong performance on .NET 7.0

A shocking ~450,000 msg/s - all in exchange for doing basically nothing other than adding the following line to my .csproj files for the benchmark:

<PropertyGroup>
    <TieredPGO>true</TieredPGO>
</PropertyGroup>

What’s going on with Dynamic PGO in .NET 7.0?!?

Dynamic Profile-Guided Optimization

Profile-Guided Optimization (PGO) is a compiler optimization technique that collects data on how a specific program executes and subsequently feeds that information into the compiler’s optimization process.

N.B. - I go into this in more details in the corresponding YouTube video “NET 7.0 Performance Improvements with Dynamic PGO are Incredible”.

There are two flavors of PGO used by the .NET Just-in-Time (JIT) compiler:

  • Static PGO: these are optimization decisions that are made before the program runs based on its past performance - this can be used to help reduce program startup time, optimize symbols for shared libraries, and so on.
  • Dynamic PGO: this is the big deal in .NET 7.0. Dynamic PGO allows the JIT to use profiling data to change the way code is compiled in the background at runtime, meaning that the code’s execution performance will change as the JIT utilizes information collected as a single instance of an application runs.

You can see from the PerfView data I gathered that the JIT is much more active when Dynamic PGO is enabled versus when it is not. This data is gathered using our in-memory messaging benchmark, PingPong.

.NET 7.0 JIT Stats without PGO

Jitting Trigger Num Compilations % of total jitted compilations Jit Time msec Jit Time (% of total process CPU)
TOTAL 4814 100.0 2054.1 0.5
Foreground 2232 46.4 434.1 0.1
Multicore JIT Background 0 0.0 0.0 0.0
Tiered Compilation Background 2582 53.6 1619.9 0.4

.NET 7.0 JIT Stats with PGO

Jitting Trigger Num Compilations % of total jitted compilations Jit Time msec Jit Time (% of total process CPU)
TOTAL 4793 100.0 3113.0 1.1
Foreground 2232 46.6 777.1 0.3
Multicore JIT Background 0 0.0 0.0 0.0
Tiered Compilation Background 2561 53.4 2335.9 0.8

Tradeoffs

From the data above we can see that the JIT on .NET 7.0 without PGO only used ~2s of total computation time and 0.5% of CPU. With Dynamic PGO enabled the .NET JIT took 3.1s of total computation time (30% increase) and 1.1% of total CPU (205% increase).

If I dug through the PerfView data more closely I suspect that the Dynamic PGO JIT results would also include larger allocation sizes too.

However, the benefits realized from enabling Dynamic PGO vastly outweigh the costs in our case - the extra second of JIT time was overcome by shaving off 10s of seconds from our benchmark throughput cumulatively.

What Can PGO Optimize?

We have a second benchmark in the Akka.NET project that we use for trying to measure the overhead of our dispatching system and actor base class overhead: PingPong.

This benchmark spins up 32 actors and increases the dispatcher.throughput value during each run - the greater the throughput value, the less context-switching (I explain this in “.NET Systems Programming Learned the Hard Way”), but there are some tradeoffs: when the throughput value grows too large we start seeing signs of actor starvation.

PingPong compares the performance, at each throughput run of two different actor types:

  • ActorBase - the most absolute bare-bones, no-frills actor type in Akka.NET;
  • ReceiveActor - a strongly typed actor that uses Receive<T> delegates and predicates to improve the developer experience and ease-of-use of Akka.NET.

In terms of raw throughput the ActorBase should be faster since it has fewer moving parts, but due to some clever performance optimizations made to the ReceiveActor (delegate compilation and caching) the performance has always been pretty much dead even by our measurements over the years.

Here’s what the performance graphs of these actors on PingPong look like for both .NET 6.0 and .NET 7.0:

PGO Impact: ActorBase vs. ReceiveActor

On .NET 6.0, the maximum throughput of the ActorBase and ReceiveActor both peak around 58 million messages per second.

Akka.NET v1.4.45 PingPong performance on .NET 6.0

On .NET 7.0 - about 64 million messages per second, a roughly 10% improvement over .NET 6.0.

Akka.NET v1.4.45 PingPong performance on .NET 7.0

On .NET 7.0 with Dynamic PGO enabled - 70 million messages per second for the ActorBase, a ~20% improvement over .NET 6.0. But the ReceiveActor performance is about the same as it was without PGO enabled. Hmmm.

Akka.NET v1.4.45 PingPong performance on .NET 7.0 with Dynamic PGO enabled

If we put the raw numbers into a plot, we can see that on .NET 7.0 and .NET 6.0 without PGO the performance numbers are pretty comparable between the two actor implementations.

.NET 6.0

Akka.NET v1.4.45 PingPong performance trends on .NET 6.0

View full size

.NET 7.0

Akka.NET v1.4.45 PingPong performance trends on .NET 7.0

View full size

.NET 7.0 with PGO

But with Dynamic PGO enabled on .NET 7.0, we can see a clear divergence in performance - the ActorBase consistently outperforms the ReceiveActor minus one outlier:

Akka.NET v1.4.45 PingPong performance trends on .NET 7.0 with Dynamic PGO enabled

View full size

Why the Divergence?

Why does the ActorBase now suddenly outperform the ReceiveActor after years of demonstrable parity between the two?

I think the answer lies in how simple the underlying implementations are:

public class ClientActorBase : ActorBase
{
    private readonly IActorRef _actor;
    private readonly TaskCompletionSource<bool> _latch;
    private long _received;
    private readonly long _repeat;
    private long _sent;

    public ClientActorBase(IActorRef actor, long repeat, 
    	TaskCompletionSource<bool> latch)
    {
        _actor = actor;
        _repeat = repeat;
        _latch = latch;
    }

    protected override bool Receive(object message)
    {
        if(message is Messages.Msg)
        {
            _received++;
            if(_sent < _repeat)
            {
                _actor.Tell(message);
                _sent++;
            }
            else if(_received >= _repeat)
            {
                //       Console.WriteLine("done {0}", Self.Path);
                _latch.SetResult(true);
            }
            return true;
        }
        if(message is Messages.Run)
        {
            var msg = new Messages.Msg();
            for(int i = 0; i < Math.Min(1000, _repeat); i++)
            {
                _actor.Tell(msg);
                _sent++;
            }
            return true;
        }
        if(message is Messages.Started)
        {
            Sender.Tell(message);
            return true;
        }
        return false;
    }
}

The ClientActorBase uses simple pattern matching using built-in language constructs - this could be refactored to use a switch statement, but the original structure of the code predates C#7 when that was introduced.

The ClientReceiveActor, on the other hand:

public class ClientReceiveActor : ReceiveActor
{
    public ClientReceiveActor(IActorRef actor, long repeat, TaskCompletionSource<bool> latch)
    {
        var received=0L;
        var sent=0L;
        Receive<Messages.Msg>(m =>
        {
            received++;
            if(sent < repeat)
            {
                actor.Tell(m);
                sent++;
            }
            else if(received >= repeat)
            {
                latch.SetResult(true);
            }
        });
        Receive<Messages.Run>(r =>
        {
            var msg = new Messages.Msg();
            for(int i = 0; i < Math.Min(1000, repeat); i++)
            {
                actor.Tell(msg);
                sent++;
            }
        });
        Receive<Messages.Started>(s => Sender.Tell(s));
    }
}

Why, that looks even simpler than the ActorBase implementation! Very clean, easy to read, etc… But here’s what’s going on underneath the covers:

public void Match<T>(Action<T> handler, Predicate<T> shouldHandle = null) where T : TItem
{
    EnsureCanAdd();
    var handlesType = typeof(T);
    AddHandler(handlesType, PredicateAndHandler.CreateAction(handler, shouldHandle));
    if(handlesType == _itemType && shouldHandle == null)
        _state = State.MatchAnyAdded;
}

private void AddHandler(Type handlesType, PredicateAndHandler predicateAndHandler)
{
    TypeHandler typeHandler;

    //if the previous handler handles the same type, we don't need an entirely new TypeHandler,
    //we can just add the handler to its' list of handlers

    if(!TryGetPreviousTypeHandlerIfItHandlesSameType(handlesType, out typeHandler))
    {
        //Either no previous handler had been added, or it handled a different type.
        //Create a new handler and store it.
        typeHandler = new TypeHandler(handlesType);
        _typeHandlers.Add(typeHandler);

        //The type is part of the signature
        _signature.Add(handlesType);
    }

    //Store the handler (action or func), the predicate the type of handler
    typeHandler.Handlers.Add(predicateAndHandler);

    //The kind of handler (action, action+predicate or fun) is part of the signature
    _signature.Add(predicateAndHandler.HandlerKind);

    //Capture the handler (action or func) and the predicate if specified
    _arguments.AddRange(predicateAndHandler.Arguments);
}

We are performing code emission, delegate compilation, pre-calculation of matching types, and caching of compiled delegates in their respective matching orders internally. This code was written in the era of .NET Framework 4.5 and was state of the art.

However, all of these clever constructs we created to outsmart a much simpler .NET runtime and JIT compiler are now self-defeating in an era where the JIT can simply do it better.

As Rich Lander, one of the senior members of the CoreCLR team put it on Twitter:

With Akka.NET’s internals we will need to literally go back to basics in light of these JIT improvements - clever optimizations don’t benefit from underlying runtime improvements anymore. Simpler is better.

Akka.NET v1.5 on .NET 7.0 with Dynamic PGO (Concurrency Impact on PGO)

All of the data we’ve presented so far has been using Akka.NET v1.4.45, which is the current stable version of Akka.NET - what about the alphas of Akka.NET v1.5 that are under active development?

Akka.NET v1.5 includes a significant performance improvement we can observe in our PingPong benchmarks: Add IThreadPoolWorkItem support to ThreadPoolDispatcher - this change uses newer .NET APIs that allow actors to significantly reduce the amount of context switching. We can observe how significant that is in terms of overall performance in our .NET 7.0 PingPong results using Akka.NET v1.5:

Akka.NET v1.5 PingPong performance trends on .NET 7.0

Approximately 70m messages per second for ActorBase and 64m for ReceiveActor - not bad. This performance is on-par with Akka.NET v1.4 on .NET 7.0 with PGO enabled.

What happens if we enable Dynamic PGO on Akka.NET v1.5?

Akka.NET v1.5 PingPong performance trends on .NET 7.0 with Dynamic PGO enabled

100m messages per second for ActorBase and 97m messages per second for ReceiveActor - a 75% improvement over Akka.NET v1.4 on .NET 6.0. That is incredible!

Here is what I suspect is happening here: tiered JIT compilation and PGO perform are able to gather more accurate data or generate more effective improvements when the same code is being executed on the same CPUs over and over again. I haven’t verified this by looking at the generated code myself or digging through the PerfView events, but that’s my armchair quarterback take.

.NET 7.0 is a very exciting development for Akka.NET users and .NET users alike - I strongly encourage you all to consider upgrading to it ASAP.

If you liked this post, you can share it with your followers or follow us on Twitter!
Written by Aaron Stannard on November 17, 2022

 

 

Observe and Monitor Your Akka.NET Applications with Phobos

Did you know that Phobos can automatically instrument your Akka.NET applications with OpenTelemetry?

Click here to learn more.