Conversation
This removes the redundant `fma_llvm` function, and makes it so systems with Float16 fma can actually use it rather than the Float32 fallback path.
| function fma(a::Float16, b::Float16, c::Float16) | ||
| Float16(muladd(Float32(a), Float32(b), Float32(c))) #don't use fma if the hardware doesn't have it. | ||
| @assume_effects :consistent function fma(x::T, y::T, z::T) where {T<:IEEEFloat} | ||
| Core.Intrinsics.have_fma(T) ? fma_float(x,y,z) : fma_emulated(x,y,z) |
There was a problem hiding this comment.
My understanding is that Core.Intrinsics.have_fma(Float16) is always false at the moment because
julia/src/llvm-cpufeatures.cpp
Lines 50 to 76 in 9b1ea1a
There was a problem hiding this comment.
Actually, this may work on riscv64 with #57043, but I'm still not entirely sure about what's going on there.
There was a problem hiding this comment.
sure, at which point this is nfc for such architectures, but that can be fixed in a separate pr.
| """ | ||
| function fma end | ||
| function fma_emulated(a::Float16, b::Float16, c::Float16) | ||
| Float16(muladd(Float32(a), Float32(b), Float32(c))) #don't use fma if the hardware doesn't have it. |
There was a problem hiding this comment.
I think this can be simplified to
| Float16(muladd(Float32(a), Float32(b), Float32(c))) #don't use fma if the hardware doesn't have it. | |
| muladd(a, b, c) #don't use fma if the hardware doesn't have it. |
LLVM would automatically do the demotion to float as necessary nowadays.
There was a problem hiding this comment.
Ironically, on aarch64 with fp16 extension muladd is better than fma on Float16 because it doesn't force the Float16 -> Float32 -> Float16 dance:
julia> code_native(muladd, NTuple{3,Float16}; debuginfo=:none) .text
.file "muladd"
.globl julia_muladd_1256 // -- Begin function julia_muladd_1256
.p2align 4
.type julia_muladd_1256,@function
julia_muladd_1256: // @julia_muladd_1256
; Function Signature: muladd(Float16, Float16, Float16)
// %bb.0: // %top
//DEBUG_VALUE: muladd:x <- $h0
//DEBUG_VALUE: muladd:x <- $h0
//DEBUG_VALUE: muladd:y <- $h1
//DEBUG_VALUE: muladd:y <- $h1
//DEBUG_VALUE: muladd:z <- $h2
//DEBUG_VALUE: muladd:z <- $h2
stp x29, x30, [sp, #-16]! // 16-byte Folded Spill
mov x29, sp
fmadd h0, h0, h1, h2
ldp x29, x30, [sp], #16 // 16-byte Folded Reload
ret
.Lfunc_end0:
.size julia_muladd_1256, .Lfunc_end0-julia_muladd_1256
// -- End function
.type ".L+Core.Float16#1258",@object // @"+Core.Float16#1258"
.section .rodata,"a",@progbits
.p2align 3, 0x0
".L+Core.Float16#1258":
.xword ".L+Core.Float16#1258.jit"
.size ".L+Core.Float16#1258", 8
.set ".L+Core.Float16#1258.jit", 281472349230944
.size ".L+Core.Float16#1258.jit", 8
.section ".note.GNU-stack","",@progbitsjulia> code_native(fma, NTuple{3,Float16}; debuginfo=:none) .text
.file "fma"
.globl julia_fma_1259 // -- Begin function julia_fma_1259
.p2align 4
.type julia_fma_1259,@function
julia_fma_1259: // @julia_fma_1259
; Function Signature: fma(Float16, Float16, Float16)
// %bb.0: // %top
//DEBUG_VALUE: fma:a <- $h0
//DEBUG_VALUE: fma:a <- $h0
//DEBUG_VALUE: fma:b <- $h1
//DEBUG_VALUE: fma:b <- $h1
//DEBUG_VALUE: fma:c <- $h2
//DEBUG_VALUE: fma:c <- $h2
stp x29, x30, [sp, #-16]! // 16-byte Folded Spill
fcvt s0, h0
fcvt s1, h1
mov x29, sp
fcvt s2, h2
fmadd s0, s0, s1, s2
fcvt h0, s0
ldp x29, x30, [sp], #16 // 16-byte Folded Reload
ret
.Lfunc_end0:
.size julia_fma_1259, .Lfunc_end0-julia_fma_1259
// -- End function
.type ".L+Core.Float16#1261",@object // @"+Core.Float16#1261"
.section .rodata,"a",@progbits
.p2align 3, 0x0
".L+Core.Float16#1261":
.xword ".L+Core.Float16#1261.jit"
.size ".L+Core.Float16#1261", 8
.set ".L+Core.Float16#1261.jit", 281472349230944
.size ".L+Core.Float16#1261.jit", 8
.section ".note.GNU-stack","",@progbitsThere was a problem hiding this comment.
no. muladd doesn't guarantee the accuracy of fma requires
There was a problem hiding this comment.
I will also point out that pure f16 fma is not super useful as an operation. Most of the accelerators will do fp16 multiply with an f32 accumulator (and then potentially round back to f16 at the end).
There was a problem hiding this comment.
sure, but that's not what Base.fma does.
There was a problem hiding this comment.
That’s not really true of fp16, on aarch64 it’s a true type which supports everything (with twice the throughput on SIMD), bf16 is that though
|
The runtime intrinsic will need updating to be modelling this correctly? It currently doesn't handle Float16 at all. julia/src/runtime_intrinsics.c Line 1724 in 3ba504a Line 1094 in 3ba504a |
|
@vchuravy can we leave that to a separate PR? this is a correct change even if our modeling is overly conservative. |
|
talked with valentin and said that this is good to go. |
This removes the redundant
fma_llvmfunction, and makes it so systems with Float16 fma can actually use it rather than the Float32 fallback path.