-
Notifications
You must be signed in to change notification settings - Fork 5k
.Net 7 performance regression with struct fields #87720
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
My result with 8.0 Preview 5:
|
@timcassell Kinda offtopic but cheers for Phenom :D Had x4 955 and then 8320 FX that somehow happened to be an overcloking beast (5.2ghz @ 1.55v and crazy 2800 nb freq) |
Results with 8.0 preview 5
It looks like performance is much better in .Net 8, but still weirdly slower in .Net 7 with the field. I expect it's CPU related. Maybe it's not even worth looking into though since it's apparently already solved in 8.
@En3Tho Nice. I never upgraded to bulldozer since I heard it was worse than Phenom in some cases, and then just kinda stuck with it. But this thing is really starting to show its age! |
My results are quite wierd but it's still clear that .Net 8 rocks
|
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch Issue DetailsDescriptionI noticed a ~5-15% performance regression when benchmarking my library ProtoPromise in .Net 6 vs .Net 7. I wasn't sure what the cause could be, so I tried making a simplified benchmark. public class PromiseBenchmarks
{
private Promise.Deferred deferred;
[Benchmark]
public void PromiseVoid()
{
deferred = Promise.NewDeferred();
deferred.Promise.Forget();
deferred.Resolve();
deferred = default;
}
} Then I realized I didn't need the field and used a local instead. public class PromiseBenchmarks
{
private Promise.Deferred deferred;
[Benchmark]
public void PromiseField()
{
deferred = Promise.NewDeferred();
deferred.Promise.Forget();
deferred.Resolve();
deferred = default;
}
[Benchmark]
public void PromiseLocal()
{
var deferred = Promise.NewDeferred();
deferred.Promise.Forget();
deferred.Resolve();
}
} And that yielded these surprising results
Clearly .Net 7 is generating better code for the actual work, but there's something weird with using the field rather than local. The public struct Deferred
{
internal readonly Internal.PromiseRefBase.DeferredPromise<Internal.VoidResult> _ref; // class reference.
internal readonly short _promiseId;
internal readonly int _deferredId;
} ConfigurationI ran this benchmark with BenchmarkDotNet arguments OS is Windows 10 x64
|
DPGO disabled results (added
It looks like DPGO is the only thing saving the performance. |
I can reproduce it with Some stats for where time is spent in the benchmark: net6.0: Samples for dotnet: 20929 events for Benchmark Intervals
Jitting : 00.03% 3E+04 samples 1173 methods
JitInterface : 00.01% 1E+04 samples
Jit-generated code: 78.84% 8.49E+07 samples
Jitted code : 78.84% 8.49E+07 samples
MinOpts code : 00.00% 0 samples
FullOpts code : 00.00% 0 samples
Tier-0 code : 00.00% 0 samples
Tier-1 code : 78.84% 8.49E+07 samples
R2R code : 00.00% 0 samples
01.98% 2.13E+06 ? Unknown
37.23% 4.01E+07 Tier-1 [ProtoPromise]Proto.Promises.Internal+PromiseRefBase+DeferredPromise`1[Proto.Promises.Internal+VoidResult].MaybeDispose()
19.02% 2.048E+07 native coreclr.dll
11.93% 1.285E+07 Tier-1 [bdn-playground]Program+Benchmarks.PromiseVoid()
04.88% 5.26E+06 Tier-1 [ProtoPromise]Proto.Promises.Internal+PromiseRefBase+DeferredPromiseBase`1[Proto.Promises.Internal+VoidResult].TryIncrementDeferredIdAndUnregisterCancelation(int32)
04.22% 4.55E+06 Tier-1 [ProtoPromise]Proto.Promises.Internal+PromiseRefBase+PromiseSingleAwait`1[Proto.Promises.Internal+VoidResult].Forget(int16)
03.74% 4.03E+06 Tier-1 [ProtoPromise]Internal+PromiseRefBase+PromiseForgetSentinel.Handle(class PromiseRefBase,class System.Object,value class State)
03.66% 3.94E+06 Tier-1 [ProtoPromise]Promise+Deferred.Resolve()
03.54% 3.81E+06 Tier-1 [ProtoPromise]Proto.Promises.Internal+PromiseRefBase+PromiseSingleAwait`1[Proto.Promises.Internal+VoidResult].AddWaiterImpl(int16,class HandleablePromiseBase,class HandleablePromiseBase&)
03.31% 3.56E+06 Tier-1 [ProtoPromise]Internal+PromiseRefBase.MaybeReportUnhandledAndDispose(class System.Object,value class State)
02.91% 3.13E+06 Tier-1 [ProtoPromise]Proto.Promises.Internal+PromiseRefBase+PromiseSingleAwait`1[Proto.Promises.Internal+VoidResult].AddWaiter(int16,class HandleablePromiseBase,class HandleablePromiseBase&)
02.07% 2.23E+06 Tier-1 [ProtoPromise]Internal+PromiseRefBase.Dispose()
01.35% 1.45E+06 Tier-1 [483dcd67-ac92-478c-b9a3-f90f5feb0066]Runnable_0.WorkloadActionUnroll(int64)
00.12% 1.3E+05 native ntoskrnl.exe net8.0: Samples for corerun: 23522 events for Benchmark Intervals
Jitting : 00.06% 7E+04 samples 1472 methods
JitInterface : 00.00% 0 samples
Jit-generated code: 73.93% 9.26E+07 samples
Jitted code : 73.93% 9.26E+07 samples
MinOpts code : 00.00% 0 samples
FullOpts code : 00.00% 0 samples
Tier-0 code : 00.00% 0 samples
Tier-1 code : 73.93% 9.26E+07 samples
R2R code : 00.00% 0 samples
12.84% 1.61E+07 ? Unknown
29.07% 3.64E+07 Tier-1 [ProtoPromise]Proto.Promises.Internal+PromiseRefBase+DeferredPromise`1[Proto.Promises.Internal+VoidResult].MaybeDispose()
15.05% 1.885E+07 Tier-1 [bdn-playground]Program+Benchmarks.PromiseVoid()
12.95% 1.622E+07 native coreclr.dll
07.44% 9.32E+06 Tier-1 [ProtoPromise]Proto.Promises.Internal+PromiseRefBase+DeferredPromiseBase`1[Proto.Promises.Internal+VoidResult].TryIncrementDeferredIdAndUnregisterCancelation(int32)
05.12% 6.41E+06 Tier-1 [ProtoPromise]Proto.Promises.Internal+PromiseRefBase+PromiseSingleAwait`1[Proto.Promises.Internal+VoidResult].AddWaiterImpl(int16,class HandleablePromiseBase,class HandleablePromiseBase&)
04.08% 5.11E+06 Tier-1 [ProtoPromise]Proto.Promises.Internal+PromiseRefBase+PromiseSingleAwait`1[Proto.Promises.Internal+VoidResult].Forget(int16)
03.87% 4.85E+06 Tier-1 [ProtoPromise]Internal+PromiseRefBase.Dispose()
03.53% 4.42E+06 Tier-1 [ProtoPromise]Internal+PromiseRefBase+PromiseForgetSentinel.Handle(class PromiseRefBase,class System.Object,value class State)
02.44% 3.05E+06 Tier-1 [ProtoPromise]Proto.Promises.Internal+PromiseRefBase+PromiseSingleAwait`1[Proto.Promises.Internal+VoidResult].AddWaiter(int16,class HandleablePromiseBase,class HandleablePromiseBase&)
02.16% 2.71E+06 Tier-1 [ProtoPromise]Internal+PromiseRefBase.MaybeReportUnhandledAndDispose(class System.Object,value class State)
01.17% 1.46E+06 Tier-1 [ec707abb-32ce-47a1-a120-6eefa36a8241]Runnable_0.WorkloadActionUnroll(int64)
00.20% 2.5E+05 native ntoskrnl.exe
00.06% 7E+04 native clrjit.dll Codegen diffs: https://www.diffchecker.com/emkNbM31/ Looks like we spend significantly more inside unknown code, presumably various stubs. I'll try to track down what the unknown code is. |
Oddly I can only reproduce the regression when running in BDN. When I run my own version of the micro benchmark: internal class Program
{
static unsafe void Main(string[] args)
{
PromiseBenchmarks pb = new();
for (int i = 0; i < 100; i++)
{
pb.PromiseField();
if (i >= 30)
Thread.Sleep(30);
}
Stopwatch timer = Stopwatch.StartNew();
for (int i = 0; i < 100_000_000; i++)
{
pb.PromiseField();
}
timer.Stop();
Console.WriteLine("{0} ms", timer.ElapsedMilliseconds);
}
public class PromiseBenchmarks
{
private Promise.Deferred deferred;
[MethodImpl(MethodImplOptions.NoInlining)]
public void PromiseField()
{
deferred = Promise.NewDeferred();
deferred.Promise.Forget();
deferred.Resolve();
deferred = default;
}
}
} I get results that usually show the .NET 8 to be faster, e.g. 2144ms vs 2233ms. |
I don't see anything immediately obvious from our side and on my Intel CPU the .NET 8 result is better in BDN, even with TieredPGO=0. Given that the benchmark is also better in my AMD CPU in a simple custom micro benchmark I think there is some microarchitectural reason for the difference, maybe some kind of alignment when run through BDN. |
Interesting. I can also confirm I'm seeing similar results when I run that simple benchmark without BDN, even with my full benchmark. |
[Edit] This turned out to be unrelated. outdated
This seems related. I observed simply adding a
public class WithGlobalSetup
{
[GlobalSetup] public void Setup() { }
[Benchmark] public object DefaultClass() => default;
}
public class WithoutGlobalSetup
{
[Benchmark] public object DefaultClass() => default;
} It seems like a very specific issue with the combination of AMD cpu + net7/8 runtime + |
|
I will close this since it seems like a benchmarking artifacts and progress is being made externally on that issue. |
@jakobbotsch Progress is not being made on this issue in BDN. That one I linked was a separate issue that I thought was related, but turned out to be unrelated. The changes I made to fix that issue had no effect on this one. |
I see. In any case, as I mentioned above, there seems to be a benchmarking artifact or microarchitectural reason for the difference that is causing the difference when run in BDN and only on AMD CPUs. That might be something like cache associativity, code alignment etc. that occur due to the specific way BDN runs the benchmark. These kind of micro architectural effects are next to impossible for us to model within the JIT and not something we typically try to target, unless it is particularly egregious or low-hanging. |
Description
I noticed a ~5-15% performance regression when benchmarking my library ProtoPromise in .Net 6 vs .Net 7. I wasn't sure what the cause could be, so I tried making a simplified benchmark.
Then I realized I didn't need the field and used a local instead.
And that yielded these surprising results
Clearly .Net 7 is generating better code for the actual work, but there's something weird with using the field rather than local. The
Promise.Deferred
struct is simply this:Configuration
I ran this benchmark with BenchmarkDotNet arguments
--runtimes net6.0 net7.0 --disasm --disasmDepth 10000
OS is Windows 10 x64
CPU is AMD Phenom II X6 1055T
The text was updated successfully, but these errors were encountered: