.Net 7 performance regression with struct fields #87720

timcassell · 2023-06-17T06:14:05Z

Description

I noticed a ~5-15% performance regression when benchmarking my library ProtoPromise in .Net 6 vs .Net 7. I wasn't sure what the cause could be, so I tried making a simplified benchmark.

public class PromiseBenchmarks
{
    private Promise.Deferred deferred;

    [Benchmark]
    public void PromiseVoid()
    {
        deferred = Promise.NewDeferred();
        deferred.Promise.Forget();
        deferred.Resolve();
        deferred = default;
    }
}

Then I realized I didn't need the field and used a local instead.

public class PromiseBenchmarks
{
    private Promise.Deferred deferred;

    [Benchmark]
    public void PromiseField()
    {
        deferred = Promise.NewDeferred();
        deferred.Promise.Forget();
        deferred.Resolve();
        deferred = default;
    }

    [Benchmark]
    public void PromiseLocal()
    {
        var deferred = Promise.NewDeferred();
        deferred.Promise.Forget();
        deferred.Resolve();
    }
}

And that yielded these surprising results

Method	Runtime	Mean	Error	StdDev	Ratio	Code Size
PromiseField	.NET 6.0	93.05 ns	1.227 ns	1.148 ns	1.00	448 B
PromiseField	.NET 7.0	100.94 ns	2.033 ns	2.342 ns	1.08	425 B

PromiseLocal	.NET 6.0	97.22 ns	1.916 ns	1.792 ns	1.00	438 B
PromiseLocal	.NET 7.0	92.66 ns	1.617 ns	1.513 ns	0.95	361 B

Clearly .Net 7 is generating better code for the actual work, but there's something weird with using the field rather than local. The Promise.Deferred struct is simply this:

public struct Deferred
{
    internal readonly Internal.PromiseRefBase.DeferredPromise<Internal.VoidResult> _ref; // class reference.
    internal readonly short _promiseId;
    internal readonly int _deferredId;
}

Configuration

I ran this benchmark with BenchmarkDotNet arguments --runtimes net6.0 net7.0 --disasm --disasmDepth 10000

OS is Windows 10 x64
CPU is AMD Phenom II X6 1055T

The text was updated successfully, but these errors were encountered:

huoyaoyuan · 2023-06-17T06:53:50Z

My result with 8.0 Preview 5:

Method	Runtime	Mean	Error	StdDev	Ratio	RatioSD	Code Size
PromiseField	.NET 6.0	45.46 ns	0.305 ns	0.286 ns	1.00	0.00	452 B
PromiseField	.NET 7.0	45.28 ns	0.523 ns	0.489 ns	1.00	0.01	429 B
PromiseField	.NET 8.0	41.82 ns	0.585 ns	0.547 ns	0.92	0.02	815 B

PromiseLocal	.NET 6.0	44.70 ns	0.296 ns	0.277 ns	1.00	0.00	442 B
PromiseLocal	.NET 7.0	44.05 ns	0.720 ns	0.673 ns	0.99	0.02	365 B
PromiseLocal	.NET 8.0	40.32 ns	0.656 ns	0.614 ns	0.90	0.01	784 B

En3Tho · 2023-06-17T07:05:43Z

@timcassell Kinda offtopic but cheers for Phenom :D Had x4 955 and then 8320 FX that somehow happened to be an overcloking beast (5.2ghz @ 1.55v and crazy 2800 nb freq)

timcassell · 2023-06-17T07:18:15Z

Results with 8.0 preview 5

Method	Runtime	Mean	Error	StdDev	Ratio	Code Size
PromiseField	.NET 6.0	94.92 ns	1.167 ns	1.091 ns	1.00	448 B
PromiseField	.NET 7.0	104.39 ns	1.926 ns	1.707 ns	1.10	425 B
PromiseField	.NET 8.0	81.05 ns	1.424 ns	1.332 ns	0.85	825 B

PromiseLocal	.NET 6.0	92.49 ns	0.241 ns	0.201 ns	1.00	438 B
PromiseLocal	.NET 7.0	91.98 ns	1.548 ns	1.448 ns	1.00	361 B
PromiseLocal	.NET 8.0	76.41 ns	0.095 ns	0.080 ns	0.83	784 B

It looks like performance is much better in .Net 8, but still weirdly slower in .Net 7 with the field. I expect it's CPU related. Maybe it's not even worth looking into though since it's apparently already solved in 8.

@timcassell Kinda offtopic but cheers for Phenom :D Had x4 955 and then 8320 FX that somehow happened to be an overcloking beast (5.2ghz @ 1.55v and crazy 2800 nb freq)

@En3Tho Nice. I never upgraded to bulldozer since I heard it was worse than Phenom in some cases, and then just kinda stuck with it. But this thing is really starting to show its age!

En3Tho · 2023-06-17T08:08:45Z

My results are quite wierd but it's still clear that .Net 8 rocks

BenchmarkDotNet=v0.13.5, OS=Windows 11 (10.0.22000.2057/21H2/SunValley)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=8.0.100-preview.5.23303.2
  [Host] : .NET 6.0.18 (6.0.1823.26907), X64 RyuJIT AVX2 DEBUG
  Net6   : .NET 6.0.18 (6.0.1823.26907), X64 RyuJIT AVX2
  Net7   : .NET 7.0.7 (7.0.723.27404), X64 RyuJIT AVX2
  Net8   : .NET 8.0.0 (8.0.23.28008), X64 RyuJIT AVX2


|       Method |  Job |  Runtime |     Mean |    Error |   StdDev | Code Size | Allocated |
|------------- |----- |--------- |---------:|---------:|---------:|----------:|----------:|
| PromiseField | Net6 | .NET 6.0 | 21.18 ns | 0.342 ns | 0.320 ns |     457 B |         - |
| PromiseLocal | Net6 | .NET 6.0 | 25.96 ns | 0.322 ns | 0.302 ns |     442 B |         - |
| PromiseField | Net7 | .NET 7.0 | 24.27 ns | 0.309 ns | 0.289 ns |     434 B |         - |
| PromiseLocal | Net7 | .NET 7.0 | 22.08 ns | 0.321 ns | 0.301 ns |     365 B |         - |
| PromiseField | Net8 | .NET 8.0 | 18.87 ns | 0.316 ns | 0.296 ns |     825 B |         - |
| PromiseLocal | Net8 | .NET 8.0 | 19.78 ns | 0.354 ns | 0.331 ns |     784 B |         - |

ghost · 2023-06-18T00:32:56Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

I noticed a ~5-15% performance regression when benchmarking my library ProtoPromise in .Net 6 vs .Net 7. I wasn't sure what the cause could be, so I tried making a simplified benchmark.

public class PromiseBenchmarks
{
    private Promise.Deferred deferred;

    [Benchmark]
    public void PromiseVoid()
    {
        deferred = Promise.NewDeferred();
        deferred.Promise.Forget();
        deferred.Resolve();
        deferred = default;
    }
}

Then I realized I didn't need the field and used a local instead.

public class PromiseBenchmarks
{
    private Promise.Deferred deferred;

    [Benchmark]
    public void PromiseField()
    {
        deferred = Promise.NewDeferred();
        deferred.Promise.Forget();
        deferred.Resolve();
        deferred = default;
    }

    [Benchmark]
    public void PromiseLocal()
    {
        var deferred = Promise.NewDeferred();
        deferred.Promise.Forget();
        deferred.Resolve();
    }
}

And that yielded these surprising results

Method	Runtime	Mean	Error	StdDev	Ratio	Code Size
PromiseField	.NET 6.0	93.05 ns	1.227 ns	1.148 ns	1.00	448 B
PromiseField	.NET 7.0	100.94 ns	2.033 ns	2.342 ns	1.08	425 B

PromiseLocal	.NET 6.0	97.22 ns	1.916 ns	1.792 ns	1.00	438 B
PromiseLocal	.NET 7.0	92.66 ns	1.617 ns	1.513 ns	0.95	361 B

Clearly .Net 7 is generating better code for the actual work, but there's something weird with using the field rather than local. The Promise.Deferred struct is simply this:

public struct Deferred
{
    internal readonly Internal.PromiseRefBase.DeferredPromise<Internal.VoidResult> _ref; // class reference.
    internal readonly short _promiseId;
    internal readonly int _deferredId;
}

Configuration

I ran this benchmark with BenchmarkDotNet arguments --runtimes net6.0 net7.0 --disasm --disasmDepth 10000

OS is Windows 10 x64
CPU is AMD Phenom II X6 1055T

Author:	timcassell
Assignees:	-
Labels:	`tenet-performance`, `area-CodeGen-coreclr`, `untriaged`
Milestone:	-

timcassell · 2023-06-20T02:27:04Z

DPGO disabled results (added --envVars DOTNET_TieredPGO:0 arg):

Method	Runtime	Mean	Error	StdDev	Ratio	Code Size
PromiseField	.NET 6.0	90.63 ns	1.215 ns	1.136 ns	1.00	448 B
PromiseField	.NET 7.0	100.46 ns	1.893 ns	1.771 ns	1.11	425 B
PromiseField	.NET 8.0	99.04 ns	1.440 ns	1.347 ns	1.09	415 B

PromiseLocal	.NET 6.0	91.17 ns	0.264 ns	0.206 ns	1.00	438 B
PromiseLocal	.NET 7.0	92.81 ns	1.060 ns	0.992 ns	1.02	361 B
PromiseLocal	.NET 8.0	89.46 ns	0.243 ns	0.227 ns	0.98	349 B

It looks like DPGO is the only thing saving the performance.

jakobbotsch · 2023-07-03T12:40:15Z

I can reproduce it with DOTNET_TieredPGO=0.

Some stats for where time is spent in the benchmark:

net6.0:

Samples for dotnet: 20929 events for Benchmark Intervals
Jitting           : 00.03% 3E+04    samples 1173 methods
  JitInterface    : 00.01% 1E+04    samples
Jit-generated code: 78.84% 8.49E+07 samples
  Jitted code     : 78.84% 8.49E+07 samples
  MinOpts code    : 00.00% 0        samples
  FullOpts code   : 00.00% 0        samples
  Tier-0 code     : 00.00% 0        samples
  Tier-1 code     : 78.84% 8.49E+07 samples
  R2R code        : 00.00% 0        samples

01.98%   2.13E+06    ?        Unknown
37.23%   4.01E+07    Tier-1   [ProtoPromise]Proto.Promises.Internal+PromiseRefBase+DeferredPromise`1[Proto.Promises.Internal+VoidResult].MaybeDispose()
19.02%   2.048E+07   native   coreclr.dll
11.93%   1.285E+07   Tier-1   [bdn-playground]Program+Benchmarks.PromiseVoid()
04.88%   5.26E+06    Tier-1   [ProtoPromise]Proto.Promises.Internal+PromiseRefBase+DeferredPromiseBase`1[Proto.Promises.Internal+VoidResult].TryIncrementDeferredIdAndUnregisterCancelation(int32)
04.22%   4.55E+06    Tier-1   [ProtoPromise]Proto.Promises.Internal+PromiseRefBase+PromiseSingleAwait`1[Proto.Promises.Internal+VoidResult].Forget(int16)
03.74%   4.03E+06    Tier-1   [ProtoPromise]Internal+PromiseRefBase+PromiseForgetSentinel.Handle(class PromiseRefBase,class System.Object,value class State)
03.66%   3.94E+06    Tier-1   [ProtoPromise]Promise+Deferred.Resolve()
03.54%   3.81E+06    Tier-1   [ProtoPromise]Proto.Promises.Internal+PromiseRefBase+PromiseSingleAwait`1[Proto.Promises.Internal+VoidResult].AddWaiterImpl(int16,class HandleablePromiseBase,class HandleablePromiseBase&)
03.31%   3.56E+06    Tier-1   [ProtoPromise]Internal+PromiseRefBase.MaybeReportUnhandledAndDispose(class System.Object,value class State)
02.91%   3.13E+06    Tier-1   [ProtoPromise]Proto.Promises.Internal+PromiseRefBase+PromiseSingleAwait`1[Proto.Promises.Internal+VoidResult].AddWaiter(int16,class HandleablePromiseBase,class HandleablePromiseBase&)
02.07%   2.23E+06    Tier-1   [ProtoPromise]Internal+PromiseRefBase.Dispose()
01.35%   1.45E+06    Tier-1   [483dcd67-ac92-478c-b9a3-f90f5feb0066]Runnable_0.WorkloadActionUnroll(int64)
00.12%   1.3E+05     native   ntoskrnl.exe

net8.0:

Samples for corerun: 23522 events for Benchmark Intervals
Jitting           : 00.06% 7E+04    samples 1472 methods
  JitInterface    : 00.00% 0        samples
Jit-generated code: 73.93% 9.26E+07 samples
  Jitted code     : 73.93% 9.26E+07 samples
  MinOpts code    : 00.00% 0        samples
  FullOpts code   : 00.00% 0        samples
  Tier-0 code     : 00.00% 0        samples
  Tier-1 code     : 73.93% 9.26E+07 samples
  R2R code        : 00.00% 0        samples

12.84%   1.61E+07    ?        Unknown
29.07%   3.64E+07    Tier-1   [ProtoPromise]Proto.Promises.Internal+PromiseRefBase+DeferredPromise`1[Proto.Promises.Internal+VoidResult].MaybeDispose()
15.05%   1.885E+07   Tier-1   [bdn-playground]Program+Benchmarks.PromiseVoid()
12.95%   1.622E+07   native   coreclr.dll
07.44%   9.32E+06    Tier-1   [ProtoPromise]Proto.Promises.Internal+PromiseRefBase+DeferredPromiseBase`1[Proto.Promises.Internal+VoidResult].TryIncrementDeferredIdAndUnregisterCancelation(int32)
05.12%   6.41E+06    Tier-1   [ProtoPromise]Proto.Promises.Internal+PromiseRefBase+PromiseSingleAwait`1[Proto.Promises.Internal+VoidResult].AddWaiterImpl(int16,class HandleablePromiseBase,class HandleablePromiseBase&)
04.08%   5.11E+06    Tier-1   [ProtoPromise]Proto.Promises.Internal+PromiseRefBase+PromiseSingleAwait`1[Proto.Promises.Internal+VoidResult].Forget(int16)
03.87%   4.85E+06    Tier-1   [ProtoPromise]Internal+PromiseRefBase.Dispose()
03.53%   4.42E+06    Tier-1   [ProtoPromise]Internal+PromiseRefBase+PromiseForgetSentinel.Handle(class PromiseRefBase,class System.Object,value class State)
02.44%   3.05E+06    Tier-1   [ProtoPromise]Proto.Promises.Internal+PromiseRefBase+PromiseSingleAwait`1[Proto.Promises.Internal+VoidResult].AddWaiter(int16,class HandleablePromiseBase,class HandleablePromiseBase&)
02.16%   2.71E+06    Tier-1   [ProtoPromise]Internal+PromiseRefBase.MaybeReportUnhandledAndDispose(class System.Object,value class State)
01.17%   1.46E+06    Tier-1   [ec707abb-32ce-47a1-a120-6eefa36a8241]Runnable_0.WorkloadActionUnroll(int64)
00.20%   2.5E+05     native   ntoskrnl.exe
00.06%   7E+04       native   clrjit.dll

Codegen diffs: https://www.diffchecker.com/emkNbM31/

Looks like we spend significantly more inside unknown code, presumably various stubs. MaybeDispose codegen looks like it's practically the same, only a bit better. Looks like we inline more in .NET 8.

I'll try to track down what the unknown code is.

jakobbotsch · 2023-07-03T13:36:47Z

Oddly I can only reproduce the regression when running in BDN. When I run my own version of the micro benchmark:

internal class Program
{
    static unsafe void Main(string[] args)
    {
        PromiseBenchmarks pb = new();
        for (int i = 0; i < 100; i++)
        {
            pb.PromiseField();
            if (i >= 30)
                Thread.Sleep(30);
        }

        Stopwatch timer = Stopwatch.StartNew();
        for (int i = 0; i < 100_000_000; i++)
        {
            pb.PromiseField();
        }
        timer.Stop();
        Console.WriteLine("{0} ms", timer.ElapsedMilliseconds);
    }

    public class PromiseBenchmarks
    {
        private Promise.Deferred deferred;

        [MethodImpl(MethodImplOptions.NoInlining)]
        public void PromiseField()
        {
            deferred = Promise.NewDeferred();
            deferred.Promise.Forget();
            deferred.Resolve();
            deferred = default;
        }
    }
}

I get results that usually show the .NET 8 to be faster, e.g. 2144ms vs 2233ms.

jakobbotsch · 2023-07-03T15:17:03Z

I don't see anything immediately obvious from our side and on my Intel CPU the .NET 8 result is better in BDN, even with TieredPGO=0. Given that the benchmark is also better in my AMD CPU in a simple custom micro benchmark I think there is some microarchitectural reason for the difference, maybe some kind of alignment when run through BDN.

timcassell · 2023-07-04T02:09:28Z

Interesting. I can also confirm I'm seeing similar results when I run that simple benchmark without BDN, even with my full benchmark.

timcassell · 2023-07-10T23:40:28Z

[Edit] This turned out to be unrelated.

outdated

This seems related. I observed simply adding a [GlobalSetup] to a benchmark class causes default(object) to take more time than without it. I only observed this in net7.0 and net8.0, no other runtime, and not with default primitives.

Type	Method	Runtime	Mean	Error	StdDev	Median
WithGlobalSetup	DefaultClass	.NET 7.0	0.8907 ns	0.1133 ns	0.2340 ns	0.8152 ns
WithoutGlobalSetup	DefaultClass	.NET 7.0	0.0519 ns	0.0560 ns	0.0995 ns	0.0000 ns

public class WithGlobalSetup
{
    [GlobalSetup] public void Setup() { }

    [Benchmark] public object DefaultClass() => default;
}

public class WithoutGlobalSetup
{
    [Benchmark] public object DefaultClass() => default;
}

It seems like a very specific issue with the combination of AMD cpu + net7/8 runtime + default(object) + [GlobalSetup], so I don't think it deserves an issue of its own in BDN repo.

AndreyAkinshin · 2023-07-11T04:28:11Z

Please, file a separate BenchmarkDotNet issue; it's not expected behavior. I believe we should try to workaround this problem somehow.
Could you please share the exact model of your CPU? I failed to reproduce it on my AMD+net7.0.
Could you please share the actual memory addresses in disasm of the target benchmark methods and the starting addresses of the main benchmarking loops in the generated code?
In order to provide a broader context for this investigation, you may try to do the following experiments and check if it makes any difference:
a. Add different amounts of additional methods in the class with different signatures and method bodies (do not annotate them with attributes, just add additional methods to the target class).
b. Add [MemoryRandomization] to both classes.
c. Increase the UnrollFactor to 32, 64, 128.

timcassell · 2023-07-11T19:47:54Z

dotnet/BenchmarkDotNet#2366

jakobbotsch · 2023-11-27T15:49:27Z

I will close this since it seems like a benchmarking artifacts and progress is being made externally on that issue.

timcassell · 2023-11-27T15:52:45Z

@jakobbotsch Progress is not being made on this issue in BDN. That one I linked was a separate issue that I thought was related, but turned out to be unrelated. The changes I made to fix that issue had no effect on this one.

jakobbotsch · 2023-11-27T16:10:05Z

I see. In any case, as I mentioned above, there seems to be a benchmarking artifact or microarchitectural reason for the difference that is causing the difference when run in BDN and only on AMD CPUs. That might be something like cache associativity, code alignment etc. that occur due to the specific way BDN runs the benchmark. These kind of micro architectural effects are next to impossible for us to model within the JIT and not something we typically try to target, unless it is particularly egregious or low-hanging.

timcassell added the tenet-performance Performance related issue label Jun 17, 2023

ghost added the untriaged New issue has not been triaged by the area owner label Jun 17, 2023

dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jun 17, 2023

danmoseley added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jun 18, 2023

JulieLeeMSFT assigned jakobbotsch Jun 19, 2023

jakobbotsch removed the untriaged New issue has not been triaged by the area owner label Jul 3, 2023

jakobbotsch added this to the Future milestone Jul 3, 2023

jakobbotsch closed this as completed Nov 27, 2023

github-actions bot locked and limited conversation to collaborators Dec 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.Net 7 performance regression with struct fields #87720

.Net 7 performance regression with struct fields #87720

timcassell commented Jun 17, 2023

huoyaoyuan commented Jun 17, 2023

En3Tho commented Jun 17, 2023 •

edited

Loading

timcassell commented Jun 17, 2023

En3Tho commented Jun 17, 2023 •

edited

Loading

ghost commented Jun 18, 2023

Description

Configuration

timcassell commented Jun 20, 2023 •

edited

Loading

jakobbotsch commented Jul 3, 2023

jakobbotsch commented Jul 3, 2023

jakobbotsch commented Jul 3, 2023

timcassell commented Jul 4, 2023

timcassell commented Jul 10, 2023 •

edited

Loading

AndreyAkinshin commented Jul 11, 2023

timcassell commented Jul 11, 2023

jakobbotsch commented Nov 27, 2023

timcassell commented Nov 27, 2023

jakobbotsch commented Nov 27, 2023

.Net 7 performance regression with struct fields #87720

.Net 7 performance regression with struct fields #87720

Comments

timcassell commented Jun 17, 2023

Description

Configuration

huoyaoyuan commented Jun 17, 2023

En3Tho commented Jun 17, 2023 • edited Loading

timcassell commented Jun 17, 2023

En3Tho commented Jun 17, 2023 • edited Loading

ghost commented Jun 18, 2023

Description

Configuration

timcassell commented Jun 20, 2023 • edited Loading

jakobbotsch commented Jul 3, 2023

jakobbotsch commented Jul 3, 2023

jakobbotsch commented Jul 3, 2023

timcassell commented Jul 4, 2023

timcassell commented Jul 10, 2023 • edited Loading

AndreyAkinshin commented Jul 11, 2023

timcassell commented Jul 11, 2023

jakobbotsch commented Nov 27, 2023

timcassell commented Nov 27, 2023

jakobbotsch commented Nov 27, 2023

En3Tho commented Jun 17, 2023 •

edited

Loading

En3Tho commented Jun 17, 2023 •

edited

Loading

timcassell commented Jun 20, 2023 •

edited

Loading

timcassell commented Jul 10, 2023 •

edited

Loading