Skip to content

.Net 7 performance regression with struct fields #87720

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
timcassell opened this issue Jun 17, 2023 · 16 comments
Closed

.Net 7 performance regression with struct fields #87720

timcassell opened this issue Jun 17, 2023 · 16 comments
Assignees
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Milestone

Comments

@timcassell
Copy link

Description

I noticed a ~5-15% performance regression when benchmarking my library ProtoPromise in .Net 6 vs .Net 7. I wasn't sure what the cause could be, so I tried making a simplified benchmark.

public class PromiseBenchmarks
{
    private Promise.Deferred deferred;

    [Benchmark]
    public void PromiseVoid()
    {
        deferred = Promise.NewDeferred();
        deferred.Promise.Forget();
        deferred.Resolve();
        deferred = default;
    }
}

Then I realized I didn't need the field and used a local instead.

public class PromiseBenchmarks
{
    private Promise.Deferred deferred;

    [Benchmark]
    public void PromiseField()
    {
        deferred = Promise.NewDeferred();
        deferred.Promise.Forget();
        deferred.Resolve();
        deferred = default;
    }

    [Benchmark]
    public void PromiseLocal()
    {
        var deferred = Promise.NewDeferred();
        deferred.Promise.Forget();
        deferred.Resolve();
    }
}

And that yielded these surprising results

Method Runtime Mean Error StdDev Ratio Code Size
PromiseField .NET 6.0 93.05 ns 1.227 ns 1.148 ns 1.00 448 B
PromiseField .NET 7.0 100.94 ns 2.033 ns 2.342 ns 1.08 425 B
PromiseLocal .NET 6.0 97.22 ns 1.916 ns 1.792 ns 1.00 438 B
PromiseLocal .NET 7.0 92.66 ns 1.617 ns 1.513 ns 0.95 361 B

Clearly .Net 7 is generating better code for the actual work, but there's something weird with using the field rather than local. The Promise.Deferred struct is simply this:

public struct Deferred
{
    internal readonly Internal.PromiseRefBase.DeferredPromise<Internal.VoidResult> _ref; // class reference.
    internal readonly short _promiseId;
    internal readonly int _deferredId;
}

Configuration

I ran this benchmark with BenchmarkDotNet arguments --runtimes net6.0 net7.0 --disasm --disasmDepth 10000

OS is Windows 10 x64
CPU is AMD Phenom II X6 1055T

@timcassell timcassell added the tenet-performance Performance related issue label Jun 17, 2023
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jun 17, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jun 17, 2023
@huoyaoyuan
Copy link
Member

My result with 8.0 Preview 5:

Method Runtime Mean Error StdDev Ratio RatioSD Code Size
PromiseField .NET 6.0 45.46 ns 0.305 ns 0.286 ns 1.00 0.00 452 B
PromiseField .NET 7.0 45.28 ns 0.523 ns 0.489 ns 1.00 0.01 429 B
PromiseField .NET 8.0 41.82 ns 0.585 ns 0.547 ns 0.92 0.02 815 B
PromiseLocal .NET 6.0 44.70 ns 0.296 ns 0.277 ns 1.00 0.00 442 B
PromiseLocal .NET 7.0 44.05 ns 0.720 ns 0.673 ns 0.99 0.02 365 B
PromiseLocal .NET 8.0 40.32 ns 0.656 ns 0.614 ns 0.90 0.01 784 B

@En3Tho
Copy link
Contributor

En3Tho commented Jun 17, 2023

@timcassell Kinda offtopic but cheers for Phenom :D Had x4 955 and then 8320 FX that somehow happened to be an overcloking beast (5.2ghz @ 1.55v and crazy 2800 nb freq)

@timcassell
Copy link
Author

Results with 8.0 preview 5

Method Runtime Mean Error StdDev Ratio Code Size
PromiseField .NET 6.0 94.92 ns 1.167 ns 1.091 ns 1.00 448 B
PromiseField .NET 7.0 104.39 ns 1.926 ns 1.707 ns 1.10 425 B
PromiseField .NET 8.0 81.05 ns 1.424 ns 1.332 ns 0.85 825 B
PromiseLocal .NET 6.0 92.49 ns 0.241 ns 0.201 ns 1.00 438 B
PromiseLocal .NET 7.0 91.98 ns 1.548 ns 1.448 ns 1.00 361 B
PromiseLocal .NET 8.0 76.41 ns 0.095 ns 0.080 ns 0.83 784 B

It looks like performance is much better in .Net 8, but still weirdly slower in .Net 7 with the field. I expect it's CPU related. Maybe it's not even worth looking into though since it's apparently already solved in 8.

@timcassell Kinda offtopic but cheers for Phenom :D Had x4 955 and then 8320 FX that somehow happened to be an overcloking beast (5.2ghz @ 1.55v and crazy 2800 nb freq)

@En3Tho Nice. I never upgraded to bulldozer since I heard it was worse than Phenom in some cases, and then just kinda stuck with it. But this thing is really starting to show its age!

@En3Tho
Copy link
Contributor

En3Tho commented Jun 17, 2023

My results are quite wierd but it's still clear that .Net 8 rocks

BenchmarkDotNet=v0.13.5, OS=Windows 11 (10.0.22000.2057/21H2/SunValley)
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=8.0.100-preview.5.23303.2
  [Host] : .NET 6.0.18 (6.0.1823.26907), X64 RyuJIT AVX2 DEBUG
  Net6   : .NET 6.0.18 (6.0.1823.26907), X64 RyuJIT AVX2
  Net7   : .NET 7.0.7 (7.0.723.27404), X64 RyuJIT AVX2
  Net8   : .NET 8.0.0 (8.0.23.28008), X64 RyuJIT AVX2


|       Method |  Job |  Runtime |     Mean |    Error |   StdDev | Code Size | Allocated |
|------------- |----- |--------- |---------:|---------:|---------:|----------:|----------:|
| PromiseField | Net6 | .NET 6.0 | 21.18 ns | 0.342 ns | 0.320 ns |     457 B |         - |
| PromiseLocal | Net6 | .NET 6.0 | 25.96 ns | 0.322 ns | 0.302 ns |     442 B |         - |
| PromiseField | Net7 | .NET 7.0 | 24.27 ns | 0.309 ns | 0.289 ns |     434 B |         - |
| PromiseLocal | Net7 | .NET 7.0 | 22.08 ns | 0.321 ns | 0.301 ns |     365 B |         - |
| PromiseField | Net8 | .NET 8.0 | 18.87 ns | 0.316 ns | 0.296 ns |     825 B |         - |
| PromiseLocal | Net8 | .NET 8.0 | 19.78 ns | 0.354 ns | 0.331 ns |     784 B |         - |

@danmoseley danmoseley added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jun 18, 2023
@ghost
Copy link

ghost commented Jun 18, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

Description

I noticed a ~5-15% performance regression when benchmarking my library ProtoPromise in .Net 6 vs .Net 7. I wasn't sure what the cause could be, so I tried making a simplified benchmark.

public class PromiseBenchmarks
{
    private Promise.Deferred deferred;

    [Benchmark]
    public void PromiseVoid()
    {
        deferred = Promise.NewDeferred();
        deferred.Promise.Forget();
        deferred.Resolve();
        deferred = default;
    }
}

Then I realized I didn't need the field and used a local instead.

public class PromiseBenchmarks
{
    private Promise.Deferred deferred;

    [Benchmark]
    public void PromiseField()
    {
        deferred = Promise.NewDeferred();
        deferred.Promise.Forget();
        deferred.Resolve();
        deferred = default;
    }

    [Benchmark]
    public void PromiseLocal()
    {
        var deferred = Promise.NewDeferred();
        deferred.Promise.Forget();
        deferred.Resolve();
    }
}

And that yielded these surprising results

Method Runtime Mean Error StdDev Ratio Code Size
PromiseField .NET 6.0 93.05 ns 1.227 ns 1.148 ns 1.00 448 B
PromiseField .NET 7.0 100.94 ns 2.033 ns 2.342 ns 1.08 425 B
PromiseLocal .NET 6.0 97.22 ns 1.916 ns 1.792 ns 1.00 438 B
PromiseLocal .NET 7.0 92.66 ns 1.617 ns 1.513 ns 0.95 361 B

Clearly .Net 7 is generating better code for the actual work, but there's something weird with using the field rather than local. The Promise.Deferred struct is simply this:

public struct Deferred
{
    internal readonly Internal.PromiseRefBase.DeferredPromise<Internal.VoidResult> _ref; // class reference.
    internal readonly short _promiseId;
    internal readonly int _deferredId;
}

Configuration

I ran this benchmark with BenchmarkDotNet arguments --runtimes net6.0 net7.0 --disasm --disasmDepth 10000

OS is Windows 10 x64
CPU is AMD Phenom II X6 1055T

Author: timcassell
Assignees: -
Labels:

tenet-performance, area-CodeGen-coreclr, untriaged

Milestone: -

@timcassell
Copy link
Author

timcassell commented Jun 20, 2023

DPGO disabled results (added --envVars DOTNET_TieredPGO:0 arg):

Method Runtime Mean Error StdDev Ratio Code Size
PromiseField .NET 6.0 90.63 ns 1.215 ns 1.136 ns 1.00 448 B
PromiseField .NET 7.0 100.46 ns 1.893 ns 1.771 ns 1.11 425 B
PromiseField .NET 8.0 99.04 ns 1.440 ns 1.347 ns 1.09 415 B
PromiseLocal .NET 6.0 91.17 ns 0.264 ns 0.206 ns 1.00 438 B
PromiseLocal .NET 7.0 92.81 ns 1.060 ns 0.992 ns 1.02 361 B
PromiseLocal .NET 8.0 89.46 ns 0.243 ns 0.227 ns 0.98 349 B

It looks like DPGO is the only thing saving the performance.

@jakobbotsch
Copy link
Member

I can reproduce it with DOTNET_TieredPGO=0.

Some stats for where time is spent in the benchmark:

net6.0:

Samples for dotnet: 20929 events for Benchmark Intervals
Jitting           : 00.03% 3E+04    samples 1173 methods
  JitInterface    : 00.01% 1E+04    samples
Jit-generated code: 78.84% 8.49E+07 samples
  Jitted code     : 78.84% 8.49E+07 samples
  MinOpts code    : 00.00% 0        samples
  FullOpts code   : 00.00% 0        samples
  Tier-0 code     : 00.00% 0        samples
  Tier-1 code     : 78.84% 8.49E+07 samples
  R2R code        : 00.00% 0        samples

01.98%   2.13E+06    ?        Unknown
37.23%   4.01E+07    Tier-1   [ProtoPromise]Proto.Promises.Internal+PromiseRefBase+DeferredPromise`1[Proto.Promises.Internal+VoidResult].MaybeDispose()
19.02%   2.048E+07   native   coreclr.dll
11.93%   1.285E+07   Tier-1   [bdn-playground]Program+Benchmarks.PromiseVoid()
04.88%   5.26E+06    Tier-1   [ProtoPromise]Proto.Promises.Internal+PromiseRefBase+DeferredPromiseBase`1[Proto.Promises.Internal+VoidResult].TryIncrementDeferredIdAndUnregisterCancelation(int32)
04.22%   4.55E+06    Tier-1   [ProtoPromise]Proto.Promises.Internal+PromiseRefBase+PromiseSingleAwait`1[Proto.Promises.Internal+VoidResult].Forget(int16)
03.74%   4.03E+06    Tier-1   [ProtoPromise]Internal+PromiseRefBase+PromiseForgetSentinel.Handle(class PromiseRefBase,class System.Object,value class State)
03.66%   3.94E+06    Tier-1   [ProtoPromise]Promise+Deferred.Resolve()
03.54%   3.81E+06    Tier-1   [ProtoPromise]Proto.Promises.Internal+PromiseRefBase+PromiseSingleAwait`1[Proto.Promises.Internal+VoidResult].AddWaiterImpl(int16,class HandleablePromiseBase,class HandleablePromiseBase&)
03.31%   3.56E+06    Tier-1   [ProtoPromise]Internal+PromiseRefBase.MaybeReportUnhandledAndDispose(class System.Object,value class State)
02.91%   3.13E+06    Tier-1   [ProtoPromise]Proto.Promises.Internal+PromiseRefBase+PromiseSingleAwait`1[Proto.Promises.Internal+VoidResult].AddWaiter(int16,class HandleablePromiseBase,class HandleablePromiseBase&)
02.07%   2.23E+06    Tier-1   [ProtoPromise]Internal+PromiseRefBase.Dispose()
01.35%   1.45E+06    Tier-1   [483dcd67-ac92-478c-b9a3-f90f5feb0066]Runnable_0.WorkloadActionUnroll(int64)
00.12%   1.3E+05     native   ntoskrnl.exe

net8.0:

Samples for corerun: 23522 events for Benchmark Intervals
Jitting           : 00.06% 7E+04    samples 1472 methods
  JitInterface    : 00.00% 0        samples
Jit-generated code: 73.93% 9.26E+07 samples
  Jitted code     : 73.93% 9.26E+07 samples
  MinOpts code    : 00.00% 0        samples
  FullOpts code   : 00.00% 0        samples
  Tier-0 code     : 00.00% 0        samples
  Tier-1 code     : 73.93% 9.26E+07 samples
  R2R code        : 00.00% 0        samples

12.84%   1.61E+07    ?        Unknown
29.07%   3.64E+07    Tier-1   [ProtoPromise]Proto.Promises.Internal+PromiseRefBase+DeferredPromise`1[Proto.Promises.Internal+VoidResult].MaybeDispose()
15.05%   1.885E+07   Tier-1   [bdn-playground]Program+Benchmarks.PromiseVoid()
12.95%   1.622E+07   native   coreclr.dll
07.44%   9.32E+06    Tier-1   [ProtoPromise]Proto.Promises.Internal+PromiseRefBase+DeferredPromiseBase`1[Proto.Promises.Internal+VoidResult].TryIncrementDeferredIdAndUnregisterCancelation(int32)
05.12%   6.41E+06    Tier-1   [ProtoPromise]Proto.Promises.Internal+PromiseRefBase+PromiseSingleAwait`1[Proto.Promises.Internal+VoidResult].AddWaiterImpl(int16,class HandleablePromiseBase,class HandleablePromiseBase&)
04.08%   5.11E+06    Tier-1   [ProtoPromise]Proto.Promises.Internal+PromiseRefBase+PromiseSingleAwait`1[Proto.Promises.Internal+VoidResult].Forget(int16)
03.87%   4.85E+06    Tier-1   [ProtoPromise]Internal+PromiseRefBase.Dispose()
03.53%   4.42E+06    Tier-1   [ProtoPromise]Internal+PromiseRefBase+PromiseForgetSentinel.Handle(class PromiseRefBase,class System.Object,value class State)
02.44%   3.05E+06    Tier-1   [ProtoPromise]Proto.Promises.Internal+PromiseRefBase+PromiseSingleAwait`1[Proto.Promises.Internal+VoidResult].AddWaiter(int16,class HandleablePromiseBase,class HandleablePromiseBase&)
02.16%   2.71E+06    Tier-1   [ProtoPromise]Internal+PromiseRefBase.MaybeReportUnhandledAndDispose(class System.Object,value class State)
01.17%   1.46E+06    Tier-1   [ec707abb-32ce-47a1-a120-6eefa36a8241]Runnable_0.WorkloadActionUnroll(int64)
00.20%   2.5E+05     native   ntoskrnl.exe
00.06%   7E+04       native   clrjit.dll

Codegen diffs: https://www.diffchecker.com/emkNbM31/

Looks like we spend significantly more inside unknown code, presumably various stubs. MaybeDispose codegen looks like it's practically the same, only a bit better. Looks like we inline more in .NET 8.

I'll try to track down what the unknown code is.

@jakobbotsch
Copy link
Member

Oddly I can only reproduce the regression when running in BDN. When I run my own version of the micro benchmark:

internal class Program
{
    static unsafe void Main(string[] args)
    {
        PromiseBenchmarks pb = new();
        for (int i = 0; i < 100; i++)
        {
            pb.PromiseField();
            if (i >= 30)
                Thread.Sleep(30);
        }

        Stopwatch timer = Stopwatch.StartNew();
        for (int i = 0; i < 100_000_000; i++)
        {
            pb.PromiseField();
        }
        timer.Stop();
        Console.WriteLine("{0} ms", timer.ElapsedMilliseconds);
    }

    public class PromiseBenchmarks
    {
        private Promise.Deferred deferred;

        [MethodImpl(MethodImplOptions.NoInlining)]
        public void PromiseField()
        {
            deferred = Promise.NewDeferred();
            deferred.Promise.Forget();
            deferred.Resolve();
            deferred = default;
        }
    }
}

I get results that usually show the .NET 8 to be faster, e.g. 2144ms vs 2233ms.

@jakobbotsch
Copy link
Member

I don't see anything immediately obvious from our side and on my Intel CPU the .NET 8 result is better in BDN, even with TieredPGO=0. Given that the benchmark is also better in my AMD CPU in a simple custom micro benchmark I think there is some microarchitectural reason for the difference, maybe some kind of alignment when run through BDN.

@jakobbotsch jakobbotsch removed the untriaged New issue has not been triaged by the area owner label Jul 3, 2023
@jakobbotsch jakobbotsch added this to the Future milestone Jul 3, 2023
@timcassell
Copy link
Author

Interesting. I can also confirm I'm seeing similar results when I run that simple benchmark without BDN, even with my full benchmark.

@timcassell
Copy link
Author

timcassell commented Jul 10, 2023

[Edit] This turned out to be unrelated.

outdated

This seems related. I observed simply adding a [GlobalSetup] to a benchmark class causes default(object) to take more time than without it. I only observed this in net7.0 and net8.0, no other runtime, and not with default primitives.

Type Method Runtime Mean Error StdDev Median
WithGlobalSetup DefaultClass .NET 7.0 0.8907 ns 0.1133 ns 0.2340 ns 0.8152 ns
WithoutGlobalSetup DefaultClass .NET 7.0 0.0519 ns 0.0560 ns 0.0995 ns 0.0000 ns
public class WithGlobalSetup
{
    [GlobalSetup] public void Setup() { }

    [Benchmark] public object DefaultClass() => default;
}

public class WithoutGlobalSetup
{
    [Benchmark] public object DefaultClass() => default;
}

It seems like a very specific issue with the combination of AMD cpu + net7/8 runtime + default(object) + [GlobalSetup], so I don't think it deserves an issue of its own in BDN repo.

@AndreyAkinshin
Copy link
Member

  1. Please, file a separate BenchmarkDotNet issue; it's not expected behavior. I believe we should try to workaround this problem somehow.
  2. Could you please share the exact model of your CPU? I failed to reproduce it on my AMD+net7.0.
  3. Could you please share the actual memory addresses in disasm of the target benchmark methods and the starting addresses of the main benchmarking loops in the generated code?
  4. In order to provide a broader context for this investigation, you may try to do the following experiments and check if it makes any difference:
    a. Add different amounts of additional methods in the class with different signatures and method bodies (do not annotate them with attributes, just add additional methods to the target class).
    b. Add [MemoryRandomization] to both classes.
    c. Increase the UnrollFactor to 32, 64, 128.

@timcassell
Copy link
Author

@jakobbotsch
Copy link
Member

I will close this since it seems like a benchmarking artifacts and progress is being made externally on that issue.

@timcassell
Copy link
Author

@jakobbotsch Progress is not being made on this issue in BDN. That one I linked was a separate issue that I thought was related, but turned out to be unrelated. The changes I made to fix that issue had no effect on this one.

@jakobbotsch
Copy link
Member

I see. In any case, as I mentioned above, there seems to be a benchmarking artifact or microarchitectural reason for the difference that is causing the difference when run in BDN and only on AMD CPUs. That might be something like cache associativity, code alignment etc. that occur due to the specific way BDN runs the benchmark. These kind of micro architectural effects are next to impossible for us to model within the JIT and not something we typically try to target, unless it is particularly egregious or low-hanging.

@github-actions github-actions bot locked and limited conversation to collaborators Dec 28, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI tenet-performance Performance related issue
Projects
None yet
Development

No branches or pull requests

6 participants