Skip to content

Conversation

@martin-frbg
Copy link
Collaborator

No description provided.

@martin-frbg
Copy link
Collaborator Author

@ChipKerchner do we expect casts from bfloat16 to float32 to "just work" for C code on RISCV64 ? AFAICT this is not implemented at least in the cross-compiler setup that this gh workflow uses (even when using latest LLVM with latest riscv-gnu-toolchain), causing test failures as the intermediate result0 = (float)A[ai] * (float) B[bi] in your sbgemm kernel turns the small bfloat16 numbers into huge floats...

@ChipKerchner
Copy link
Contributor

Scalar casting should just work from bfloat16 to float. I don't see any issue. These are the qemu flags I use.

qemu-riscv64 -cpu rv64,g=true,f=true,d=true,c=true,v=true,vlen=256,elen=64,vext_spec=v1.0,zfh=true,zvfh=true,zvfbfwma=true,rvv_ma_all_1s=true,rvv_ta_all_1s=true,zbc=true,zvbc=true -L /home/ckerchner/tools/tt-riscv-toolchain-ae8a01f3/sysroot

@ChipKerchner
Copy link
Contributor

Actually after I sync, I'm seeing a failure in sbgemm - sbgemv seems fine. BTW, I didn't write sbgemm.

@martin-frbg
Copy link
Collaborator Author

Thanks for the flags - unfortunately adding the missing ones did not change the outcome for me. And I'm getting SGEMV FAILURES: 789504 as well with that setup, while the BGEMM test passes (as do all float16 ones). Most likely your TT toolchain is more advanced, and I should just leave out the SB tests in this CI job for now ?
I just noticed the use of plain (float) casts in some of the code, while the tests all go to sbf16tos() for conversions.

@ChipKerchner
Copy link
Contributor

Are you saying that some architectures besides RISC-V are using plain casts to float while others are using a external function?

@ChipKerchner
Copy link
Contributor

BTW, I tried an external function and I'm still getting failures.

@martin-frbg
Copy link
Collaborator Author

Are you saying that some architectures besides RISC-V are using plain casts to float while others are using a external function?

No, on the contrary I see RISC-V using plain casts while everything else uses an external function.
And at least the first few intermediate calculations in the sbgemm_kernel_16x8_zvl256 seem to make more sense now that I've changed them from casts to using the float16to32 wrapper around sbf16tos as in the test helper header

@ChipKerchner
Copy link
Contributor

Strange thing is SHGEMM uses the same type casting and all pass there.

@martin-frbg
Copy link
Collaborator Author

Yes, this got me thinking that maybe there is a conflict between the compiler having (or being expected to have) some "native" support for a floating point "bf16" type and OpenBLAS' fallback solution of assuming bfloat16 is an uint_16.
Replacing all obvious casts with calls to the conversion function did not solve the test errors for me, however - a lot of the result matrix elements became similar enough to their SGEMM counterparts, but not all. And I have no way of finding out if it is the cross-compiler at fault, or qemu-riscv64 10.1 not handling all aspects of bfloat16 correctly. My Banana PI F3 does great for checking fp16 code but appears to lack support for the bfloat16 extensions

@ChipKerchner
Copy link
Contributor

Yes, unfortunately the BananaPi does NOT support the bf16 format.

Another weird thing is the test pass for sizes 1 -> 100 but fail for size = 256.

@ChipKerchner
Copy link
Contributor

Are you sure you don't need to set this environment variable instead of LD_LIBRARY_PATH?

QEMU_LD_PREFIX=/proj_sw/user_dev/ckerchner/tmp/tt-riscv-toolchain-20250709/sysroot

@martin-frbg
Copy link
Collaborator Author

Agree that QEMU_LD_PREFIX would be more elegant than abusing LD_LIBRARY_PATH combined with the ugly hack of crosslinking the riscv64 ld-linux into the host system path. But unfortunately this has no bearing on the main issue that this toolchain (or the most recent stable qemu) appears to produce completely bogus intermediate results (in the 2e5 to 2e6 range) from __riscv_vfwmaccfb16_vf_f32m1(result0, B0,A0,gvl). I trust your statement that it works on actual hardware, but having this in the CI job is going to be useless if basically every matrix element gets flagged as wrong.

@ChipKerchner
Copy link
Contributor

Actually I don't have actual HW for BF16 - it's all QEMU.

Maybe the initialization values should be between [-0.5, +0.5] for the test rather than [+0.5,+1.5]

@martin-frbg
Copy link
Collaborator Author

Curious - that would leave the toolchain difference if you're also using a regular release version of qemu.
No particular preference for the test values, but I note that the range provided by a simple rand/rand_max+0.5 worked well on all other platforms so far (and works for the BGEMM test too). Maybe the conversion between OpenBLAS' bfloat16 and the __bf16 type is doing something unexpected in clang-21.4+riscv-gnu-toolchain

@ChipKerchner
Copy link
Contributor

Maybe additional extensions are required.

https://github.com/riscv/riscv-bfloat16/blob/main/doc/riscv-bfloat16-zvfbfwma.adoc

Zvfbfwma - Vector BF16 widening mul-add
This extension provides a vector widening BF16 mul-add instruction that accumulates into FP32.

This extension requires the Zvfbfmin extension and the Zfbfmin extension.

@martin-frbg
Copy link
Collaborator Author

Hmm, I had always assumed these to be implied by the zvfbfwma. And indeed adding them to the compiler options does not change anything (and they were already in the qemu options).

@martin-frbg martin-frbg merged commit 3a9da52 into OpenMathLib:develop Nov 4, 2025
86 of 88 checks passed
@ChipKerchner
Copy link
Contributor

ChipKerchner commented Nov 4, 2025

I have 2 ideas of why test_sbgemm/v is failing.

  1. There is some conflict between BUILD_BFLOAT16 and BUILD_HFLOAT16
  2. On RISC-V, BF16 type is __bf16 and not bfloat16. Maybe it affects the conversions?

Maybe we should only test BUILD_BFLOAT16 and see if it still fails.

@ChipKerchner
Copy link
Contributor

ChipKerchner commented Nov 4, 2025

This looks wrong in gemmkernel_2x2.c:

             C0[0] = TO_OUTPUT(TO_F32(C0[0])+res0);

C0 is already a float32 and TO_F32 converts BF16 -> F32.

The same in gemv_n/t.c

            y[iy] = TO_OUTPUT(ALPHA * temp + BETA * TO_F32(y[iy]));

y is already a F32.

P.S. FP16 seems correct for TO_F32

@martin-frbg
Copy link
Collaborator Author

I have 2 ideas of why test_sbgemm/v is failing.

  1. There is some conflict between BUILD_BFLOAT16 and BUILD_HFLOAT16
  2. On RISC-V, BF16 type is __bf16 and not bfloat16. Maybe it affects the conversions?

Maybe we should only test BUILD_BFLOAT16 and see if it still fails.

  1. I don't think so - I see the same problem when building just BFLOAT16
  2. I had commented on the __bf16 type 4 days ago. In the sbgemm kernel, assignments from bfloat16 to this type produce an additional truncation of the value to just a single decimal digit. but so far I do not see why subsequent calculations then produce six-digit float32 results

@ChipKerchner
Copy link
Contributor

ChipKerchner commented Nov 4, 2025

There shouldn't be a conversion from bfloat16 to __bf16. It should be picked up as a __bf16 - something like *(__bf16 *)(&B[0]) instead of a cast.

Though maybe it would be easy to change the pointer types to __bf16?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants