Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable DX12 coopvec tests on CI #6250

Open
jkwak-work opened this issue Feb 3, 2025 · 12 comments
Open

Enable DX12 coopvec tests on CI #6250

jkwak-work opened this issue Feb 3, 2025 · 12 comments
Assignees
Labels
goal:quality & productivity Quality issues and issues that impact our productivity coding day to day inside slang

Comments

@jkwak-work
Copy link
Collaborator

DXC repo that supports coopvec is released publicly.
https://github.com/NVIDIA-RTX/DirectXShaderCompiler/tree/CooperativeVector

We should start using dxcompiler.dll from the repo and enable the related tests.

@jkwak-work jkwak-work added the goal:quality & productivity Quality issues and issues that impact our productivity coding day to day inside slang label Feb 3, 2025
@jkwak-work jkwak-work self-assigned this Feb 3, 2025
@jkwak-work
Copy link
Collaborator Author

When I used the dxcompiler.dll, I observed crash on one of runners, SLANGWIN10X64-1.

When I debugged with visual studio, the call stack looked almost same to one of known issues on DXC repo.
microsoft/DirectXShaderCompiler#6916

The suggested solution is to upgrade MSVC runtime binaries of v143.
And when I upgraded it, it started working.

But the same solution didn't work for another runner, SlangWin10-2.
Since the repro is 100% consistent, I will be able to collect more debugging information if I debug it with Visual Studio.

@jkwak-work
Copy link
Collaborator Author

To reproduce the issue, I used the following steps,

  1. Login to the build machine, "SlangWin10X64-1", with RDP.
  2. Clone "https://github.com/shader-slang/slang.git"
  3. Replace external/slang-binaries/bin/windows-x64/dxcompiler.dll with one that support coopvec
  4. Delete external/slang-binaries/bin/windows-x64/dxil.dll
  5. Build as usual and run the test.
  6. Observe a few tests failing.
cmake.exe --preset vs2022
cmake.exe --build build --preset release
build/Release/bin/slang-test.exe -use-test-server -server-count 8
.............
.............

===
99% of tests passed (4012/4021), 830 tests ignored
===

failing tests:
---
gfx-unit-test-tool/RayTracingTestAD3D12.internal
gfx-unit-test-tool/mutableRootShaderObjectD3D12.internal
gfx-unit-test-tool/shaderCacheSourceStringD3D12.internal
gfx-unit-test-tool/shaderCacheEvictionD3D12.internal
gfx-unit-test-tool/RayTracingTestBD3D12.internal
gfx-unit-test-tool/shaderCacheSpecializationD3D12.internal
gfx-unit-test-tool/computeSmokeD3D12.internal
gfx-unit-test-tool/mutableShaderObjectD3D12.internal
gfx-unit-test-tool/precompiledTargetModule2Vulkan.internal
---

@jkwak-work
Copy link
Collaborator Author

I realized that there is a build configuration called "MinSizeRel" when compiling dxcompiler.dll
It generates 17MB version of dxcompiler.dll, which is much smaller than 70MB version of dxcompiler.dll generated with "Release" target.

@jkwak-work
Copy link
Collaborator Author

It seems like there was one more problem for enabling dx12 coopvec tests.
The steps are:

  1. replace the old dxcompiler.dll and dxil.dll with ones that support coopvec
  2. enable "DX12 experimental feature" on slang-rhi side.
  3. enable DX12 tests for coopvec on the github workflow.

Annoyingly and strangely, just replacing dxcompiler.dll and dxil.dll causes a few tests to fail.

And it looks like enabling "DX12 experimental feature" itself is not as easy as I thought.
PR 6290 shows that bunch, if not all, dx12 tests are failing when the experimental feature is enabled; and this is without the step 1.

@jkwak-work
Copy link
Collaborator Author

It looks like when "DX12 experimental feature" is enabled, I cannot run wgpu tests with DX12 together.
They should run independently on the CI workflow.

I cannot reproduce it on my local machine.
But I was able to reproduce the problem on the runner machine.
The issue is observed only when -use-test-server is used and both wgpu and DX12 tests are ran togther.
But that is not 100% repro; it is about 80%; in other words, it sometimes works fine.

I start to think that the initialization of dxcompiler.dll may have bugs when multiple processes try to initialize simultaneously.
But that still doesn't fully explain what I have seen.

@jkwak-work
Copy link
Collaborator Author

I think the "DX12 experimental feature" should be turned on only when needed.
I am gonna add an option to toggle it on render-test.

@jkwak-work
Copy link
Collaborator Author

When I locally tested on each runner machine with RDP, I got a few tests failures.

The command I used it

build/Release/bin/slang-test.exe -use-test-server -server-count 8

I used commit "a4b538282c5d8ffc6ce2f54597c132d11a52edd2" for the testing.
The commit uses dxcompiler.dll and dxil.dll that supports coopvec; nothing else.

[SLANGWIN10X64-1]

gfx-unit-test-tool/precompiledTargetModule2Vulkan.internal
gfx-unit-test-tool/RayTracingTestAD3D12.internal
gfx-unit-test-tool/RayTracingTestBD3D12.internal
tests/bugs/specialize-function-array-args.slang.2 syn (wgpu)
tests/hlsl-intrinsic/ray-tracing/rt-pipeline-intrinsics-ahit.slang.2
tests/hlsl-intrinsic/ray-tracing/rt-pipeline-intrinsics-int.slang.2
tests/metal/texture.slang.4 (dx12)

[SlangWin10-2]

gfx-unit-test-tool/precompiledTargetModule2Vulkan.internal
gfx-unit-test-tool/RayTracingTestBD3D12.internal
gfx-unit-test-tool/RayTracingTestAD3D12.internal
tests/hlsl-intrinsic/ray-tracing/rt-pipeline-intrinsics-ahit.slang.2
tests/hlsl-intrinsic/ray-tracing/rt-pipeline-intrinsics-int.slang.2
tests/metal/texture.slang.4 (dx12)

[SlangWin4-2]

gfx-unit-test-tool/precompiledTargetModule2Vulkan.internal
gfx-unit-test-tool/RayTracingTestAD3D12.internal
gfx-unit-test-tool/RayTracingTestBD3D12.internal
tests/hlsl-intrinsic/ray-tracing/rt-pipeline-intrinsics-ahit.slang.2
tests/hlsl-intrinsic/ray-tracing/rt-pipeline-intrinsics-int.slang.2
tests/metal/texture.slang.4 (dx12)

[SLANGWIN5]

gfx-unit-test-tool/RayTracingTestBD3D12.internal
gfx-unit-test-tool/precompiledTargetModule2Vulkan.internal
gfx-unit-test-tool/RayTracingTestAD3D12.internal
tests/diagnostics/syntax-error-op-line-3.slang.4
tests/diagnostics/syntax-error-intrinsic.slang.4
tests/diagnostics/syntax-error-op-line-2.slang.4
tests/diagnostics/local-line.slang.4
tests/diagnostics/syntax-error-op-line.slang.4
tests/hlsl-intrinsic/ray-tracing/rt-pipeline-intrinsics-ahit.slang.2
tests/hlsl-intrinsic/ray-tracing/rt-pipeline-intrinsics-int.slang.2
tests/metal/texture.slang.4 (dx12)

[horde]

gfx-unit-test-tool/RayTracingTestBD3D12.internal
gfx-unit-test-tool/mutableRootShaderObjectD3D12.internal
gfx-unit-test-tool/precompiledTargetModule2Vulkan.internal
tests/hlsl-intrinsic/ray-tracing/rt-pipeline-intrinsics-ahit.slang.2
tests/hlsl-intrinsic/ray-tracing/rt-pipeline-intrinsics-int.slang.2
tests/metal/texture.slang.4 (dx12)

@jkwak-work
Copy link
Collaborator Author

jkwak-work commented Feb 6, 2025

When I ran the same test on my local machine, three tests are failing,

tests/hlsl-intrinsic/ray-tracing/rt-pipeline-intrinsics-ahit.slang.2
tests/hlsl-intrinsic/ray-tracing/rt-pipeline-intrinsics-int.slang.2
tests/metal/texture.slang.4 (dx12)

I guess these may actually have problems.
I will investigate it.

And I also noticed that the following three tests are expected to fail.

gfx-unit-test-tool/precompiledTargetModule2Vulkan.internal
gfx-unit-test-tool/RayTracingTestAD3D12.internal
gfx-unit-test-tool/RayTracingTestBD3D12.internal

That means the following tests are the only ones failed unexpectedly.

[SLANGWIN10X64-1]

tests/bugs/specialize-function-array-args.slang.2 syn (wgpu) // this looks to be intermittent failure

[SLANGWIN5]

tests/diagnostics/syntax-error-op-line-3.slang.4
tests/diagnostics/syntax-error-intrinsic.slang.4
tests/diagnostics/syntax-error-op-line-2.slang.4
tests/diagnostics/local-line.slang.4
tests/diagnostics/syntax-error-op-line.slang.4

[horde]

gfx-unit-test-tool/mutableRootShaderObjectD3D12.internal   // this looks to be intermittent failure

@jkwak-work
Copy link
Collaborator Author

I identified problems on the 100% failing tests and prepared a fix.

tests/hlsl-intrinsic/ray-tracing/rt-pipeline-intrinsics-ahit.slang.2
tests/hlsl-intrinsic/ray-tracing/rt-pipeline-intrinsics-int.slang.2
tests/metal/texture.slang.4 (dx12)

I am looking into failing Falcor tests.

@jkwak-work
Copy link
Collaborator Author

There are 12 failing Falcor tests.

[  FAILED  ] 12 tests, listed below.
[  FAILED  ] AABBTests.cpp:AABB (D3D12)
[  FAILED  ] BSDFTests.cpp:TestBsdf_DiffuseSpecularBRDF (D3D12)
[  FAILED  ] BSDFTests.cpp:TestBsdf_DisneyDiffuseBRDF (D3D12)
[  FAILED  ] BSDFTests.cpp:TestBsdf_FrostbiteDiffuseBRDF (D3D12)
[  FAILED  ] BSDFTests.cpp:TestBsdf_LambertDiffuseBRDF (D3D12)
[  FAILED  ] BSDFTests.cpp:TestBsdf_LambertDiffuseBTDF (D3D12)
[  FAILED  ] BSDFTests.cpp:TestBsdf_OrenNayarBRDF (D3D12)
[  FAILED  ] BSDFTests.cpp:TestBsdf_SheenBSDF (D3D12)
[  FAILED  ] BSDFTests.cpp:TestBsdf_SpecularMicrofacetBRDF (D3D12)
[  FAILED  ] GeometryHelpersTests.cpp:BoxSubtendedConeAngleAverage (D3D12)
[  FAILED  ] GeometryHelpersTests.cpp:BoxSubtendedConeAngleAverageRandoms (D3D12)
[  FAILED  ] GeometryHelpersTests.cpp:ComputeRayOrigin (D3D12)
12 FAILED TESTS

https://github.com/shader-slang/slang/actions/runs/13173367844/job/36767538167?pr=6302

I am not sure how many will fail on the Falcor image test too.

@jkwak-work
Copy link
Collaborator Author

Here are the error messages for each failing Falcor test:

AABBTests.cpp:AABB (D3D12)

(Info) Created GPU device 'NVIDIA GeForce RTX 3090' using 'Direct3D 12' API (SM6.6).
(Error) GFX Error: dxc 1.9: D:/sbf/git/slang_gitlab/falcor/Falcor/Source/Falcor/Utils/Math/AABB.slang(108): error :  operands for short-circuiting logical binary operator must be scalar, for non-scalar types use 'and'
dxc 1.9: note :     return (AABB_valid_0(this_6)) && (all((p_2 >= (this_6.minPoint_0)) && (p_2 <= (this_6.maxPoint_0))));
dxc 1.9: note :                                           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dxc 1.9: note :                                           and((p_2 >= (this_6.minPoint_0)), (p_2 <= (this_6.maxPoint_0)))
dxc 1.9: D:/sbf/git/slang_gitlab/falcor/Falcor/Source/Falcor/Utils/Math/AABB.slang(153): error :  operands for short-circuiting logical binary operator must be scalar, for non-scalar types use 'and'
dxc 1.9: note :     return all(((this_12.maxPoint_0) >= (other_0.minPoint_0)) && ((this_12.minPoint_0) <= (other_0.maxPoint_0)));
dxc 1.9: note :                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
dxc 1.9: note :                and(((this_12.maxPoint_0) >= (other_0.minPoint_0)), ((this_12.minPoint_0) <= (other_0.maxPoint_0)))

BSDFTests.cpp:TestBsdf_DiffuseSpecularBRDF (D3D12)

(Info) Created GPU device 'NVIDIA GeForce RTX 3090' using 'Direct3D 12' API (SM6.6).
(Error) GFX Error: dxc 1.9: D:/sbf/git/slang_gitlab/falcor/Falcor/Source/Tools/FalcorTest/Tests/Scene/Material/BSDFTests.cs.slang(169): error :  condition for short-circuiting ternary operator must be scalar, for non-scalar types use 'select'
dxc 1.9: note :             float3 weightErrors_0 = abs(weight_1 - weightRef_0) / (weightRef_0 > 0.0 ? weightRef_0 : _S26);
dxc 1.9: note :                                                                    ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~
dxc 1.9: note :                                                                    select(weightRef_0 > 0., weightRef_0, _S26)

BSDFTests.cpp:TestBsdf_DisneyDiffuseBRDF (D3D12)

(Info) Created GPU device 'NVIDIA GeForce RTX 3090' using 'Direct3D 12' API (SM6.6).
(Error) GFX Error: dxc 1.9: D:/sbf/git/slang_gitlab/falcor/Falcor/Source/Tools/FalcorTest/Tests/Scene/Material/BSDFTests.cs.slang(169): error :  condition for short-circuiting ternary operator must be scalar, for non-scalar types use 'select'
dxc 1.9: note :             float3 weightErrors_0 = abs(weight_1 - weightRef_0) / (weightRef_0 > 0.0 ? weightRef_0 : _S17);
dxc 1.9: note :                                                                    ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~
dxc 1.9: note :                                                                    select(weightRef_0 > 0., weightRef_0, _S17)

BSDFTests.cpp:TestBsdf_FrostbiteDiffuseBRDF (D3D12)

(Info) Created GPU device 'NVIDIA GeForce RTX 3090' using 'Direct3D 12' API (SM6.6).
(Error) GFX Error: dxc 1.9: D:/sbf/git/slang_gitlab/falcor/Falcor/Source/Tools/FalcorTest/Tests/Scene/Material/BSDFTests.cs.slang(169): error :  condition for short-circuiting ternary operator must be scalar, for non-scalar types use 'select'
dxc 1.9: note :             float3 weightErrors_0 = abs(weight_1 - weightRef_0) / (weightRef_0 > 0.0 ? weightRef_0 : _S17);
dxc 1.9: note :                                                                    ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~
dxc 1.9: note :                                                                    select(weightRef_0 > 0., weightRef_0, _S17)

BSDFTests.cpp:TestBsdf_LambertDiffuseBRDF (D3D12)

(Info) Created GPU device 'NVIDIA GeForce RTX 3090' using 'Direct3D 12' API (SM6.6).
(Error) GFX Error: dxc 1.9: D:/sbf/git/slang_gitlab/falcor/Falcor/Source/Tools/FalcorTest/Tests/Scene/Material/BSDFTests.cs.slang(169): error :  condition for short-circuiting ternary operator must be scalar, for non-scalar types use 'select'
dxc 1.9: note :             float3 weightErrors_0 = abs(weight_1 - weightRef_0) / (weightRef_0 > 0.0 ? weightRef_0 : _S17);
dxc 1.9: note :                                                                    ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~
dxc 1.9: note :                                                                    select(weightRef_0 > 0., weightRef_0, _S17)

BSDFTests.cpp:TestBsdf_LambertDiffuseBTDF (D3D12)

(Info) Created GPU device 'NVIDIA GeForce RTX 3090' using 'Direct3D 12' API (SM6.6).
(Error) GFX Error: dxc 1.9: D:/sbf/git/slang_gitlab/falcor/Falcor/Source/Tools/FalcorTest/Tests/Scene/Material/BSDFTests.cs.slang(169): error :  condition for short-circuiting ternary operator must be scalar, for non-scalar types use 'select'
dxc 1.9: note :             float3 weightErrors_0 = abs(weight_1 - weightRef_0) / (weightRef_0 > 0.0 ? weightRef_0 : _S17);
dxc 1.9: note :                                                                    ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~
dxc 1.9: note :                                                                    select(weightRef_0 > 0., weightRef_0, _S17)

BSDFTests.cpp:TestBsdf_OrenNayarBRDF (D3D12)

(Info) Created GPU device 'NVIDIA GeForce RTX 3090' using 'Direct3D 12' API (SM6.6).
(Error) GFX Error: dxc 1.9: D:/sbf/git/slang_gitlab/falcor/Falcor/Source/Tools/FalcorTest/Tests/Scene/Material/BSDFTests.cs.slang(169): error :  condition for short-circuiting ternary operator must be scalar, for non-scalar types use 'select'
dxc 1.9: note :             float3 weightErrors_0 = abs(weight_1 - weightRef_0) / (weightRef_0 > 0.0 ? weightRef_0 : _S26);
dxc 1.9: note :                                                                    ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~
dxc 1.9: note :                                                                    select(weightRef_0 > 0., weightRef_0, _S26)

BSDFTests.cpp:TestBsdf_SheenBSDF (D3D12)

(Info) Created GPU device 'NVIDIA GeForce RTX 3090' using 'Direct3D 12' API (SM6.6).
(Error) GFX Error: dxc 1.9: D:/sbf/git/slang_gitlab/falcor/Falcor/Source/Tools/FalcorTest/Tests/Scene/Material/BSDFTests.cs.slang(169): error :  condition for short-circuiting ternary operator must be scalar, for non-scalar types use 'select'
dxc 1.9: note :             float3 weightErrors_0 = abs(weight_1 - weightRef_0) / (weightRef_0 > 0.0 ? weightRef_0 : _S21);
dxc 1.9: note :                                                                    ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~
dxc 1.9: note :                                                                    select(weightRef_0 > 0., weightRef_0, _S21)

BSDFTests.cpp:TestBsdf_SpecularMicrofacetBRDF (D3D12)

(Info) Created GPU device 'NVIDIA GeForce RTX 3090' using 'Direct3D 12' API (SM6.6).
(Error) GFX Error: dxc 1.9: D:/sbf/git/slang_gitlab/falcor/Falcor/Source/Tools/FalcorTest/Tests/Scene/Material/BSDFTests.cs.slang(169): error :  condition for short-circuiting ternary operator must be scalar, for non-scalar types use 'select'
dxc 1.9: note :             float3 weightErrors_0 = abs(weight_1 - weightRef_0) / (weightRef_0 > 0.0 ? weightRef_0 : _S16);
dxc 1.9: note :                                                                    ~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~
dxc 1.9: note :                                                                    select(weightRef_0 > 0., weightRef_0, _S16)

GeometryHelpersTests.cpp:BoxSubtendedConeAngleAverage (D3D12)

(Info) Created GPU device 'NVIDIA GeForce RTX 3090' using 'Direct3D 12' API (SM6.6).
(Error) GFX Error: dxc 1.9: D:/sbf/git/slang_gitlab/falcor/Falcor/Source/Falcor/Utils/Geometry/GeometryHelpers.slang(196): error :  operands for short-circuiting logical binary operator must be scalar, for non-scalar types use 'and'
dxc 1.9: note :     if(all((origin_1 >= aabbMin_1) && (origin_1 <= aabbMax_1)))
dxc 1.9: note :            ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~
dxc 1.9: note :            and((origin_1 >= aabbMin_1), (origin_1 <= aabbMax_1))

GeometryHelpersTests.cpp:BoxSubtendedConeAngleAverageRandoms (D3D12)

(Info) Created GPU device 'NVIDIA GeForce RTX 3090' using 'Direct3D 12' API (SM6.6).
(Error) GFX Error: dxc 1.9: D:/sbf/git/slang_gitlab/falcor/Falcor/Source/Falcor/Utils/Geometry/GeometryHelpers.slang(196): error :  operands for short-circuiting logical binary operator must be scalar, for non-scalar types use 'and'
dxc 1.9: note :     if(all((origin_1 >= aabbMin_1) && (origin_1 <= aabbMax_1)))
dxc 1.9: note :            ~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~
dxc 1.9: note :            and((origin_1 >= aabbMin_1), (origin_1 <= aabbMax_1))

GeometryHelpersTests.cpp:ComputeRayOrigin (D3D12)

(Info) Created GPU device 'NVIDIA GeForce RTX 3090' using 'Direct3D 12' API (SM6.6).
(Error) GFX Error: D:/sbf/git/slang_gitlab/falcor/Falcor/Source\Falcor/Utils/Geometry/GeometryHelpers.slang(94): warning 42050: bwd_computeRayOrigin has [PreferRecompute] and may have side effects. side effects may execute multiple times. use [PreferRecompute(SideEffectBehavior.Allow)], or mark function with [__NoSideEffect]
void bwd_computeRayOrigin(inout DifferentialPair<float3> pos, inout DifferentialPair<float3> normal, float3.Differential dOut)
     ^~~~~~~~~~~~~~~~~~~~
D:/sbf/git/slang_gitlab/falcor/Falcor/Source\Falcor/Utils/Geometry/GeometryHelpers.slang(87): warning 42050: fwd_computeRayOrigin has [PreferRecompute] and may have side effects. side effects may execute multiple times. use [PreferRecompute(SideEffectBehavior.Allow)], or mark function with [__NoSideEffect]
DifferentialPair<float3> fwd_computeRayOrigin(DifferentialPair<float3> pos, DifferentialPair<float3> normal)
                         ^~~~~~~~~~~~~~~~~~~~
dxc 1.9: D:/sbf/git/slang_gitlab/falcor/Falcor/Source/Falcor/Utils/Geometry/GeometryHelpers.slang(82): error :  condition for short-circuiting ternary operator must be scalar, for non-scalar types use 'select'
dxc 1.9: note :     return (abs(pos_1)) < 0.0625 ? pos_1 + normal_1 * 0.0000457763671875 : asfloat(asint(pos_1) + (pos_1 < 0.0 ? - iOff_0 : iOff_0));
dxc 1.9: note :                                                                                                    ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~
dxc 1.9: note :                                                                                                    select(pos_1 < 0., -iOff_0, iOff_0)

It appears that all of error messages are from a same problem.
An issue for it is filed:

@jkwak-work
Copy link
Collaborator Author

Until the issue is resolved, we cannot upgrade dxcompiler.dll to support coopvec.

Because if we upgrade without addressing the problem, upgrading dxcompiler.dll will be a breaking change, which we should avoid.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
goal:quality & productivity Quality issues and issues that impact our productivity coding day to day inside slang
Projects
None yet
Development

No branches or pull requests

2 participants