-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
minor fma cleanup #57041
base: master
Are you sure you want to change the base?
minor fma cleanup #57041
Conversation
This removes the redundant `fma_llvm` function, and makes it so systems with Float16 fma can actually use it rather than the Float32 fallback path.
function fma(a::Float16, b::Float16, c::Float16) | ||
Float16(muladd(Float32(a), Float32(b), Float32(c))) #don't use fma if the hardware doesn't have it. | ||
@assume_effects :consistent function fma(x::T, y::T, z::T) where {T<:IEEEFloat} | ||
Core.Intrinsics.have_fma(T) ? fma_float(x,y,z) : fma_emulated(x,y,z) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that Core.Intrinsics.have_fma(Float16)
is always false at the moment because
julia/src/llvm-cpufeatures.cpp
Lines 50 to 76 in 9b1ea1a
static bool have_fma(Function &intr, Function &caller, const Triple &TT) JL_NOTSAFEPOINT { | |
auto unconditional = always_have_fma(intr, TT); | |
if (unconditional) | |
return *unconditional; | |
auto intr_name = intr.getName(); | |
auto typ = intr_name.substr(strlen("julia.cpu.have_fma.")); | |
Attribute FSAttr = caller.getFnAttribute("target-features"); | |
StringRef FS = | |
FSAttr.isValid() ? FSAttr.getValueAsString() : jl_ExecutionEngine->getTargetFeatureString(); | |
SmallVector<StringRef, 128> Features; | |
FS.split(Features, ','); | |
for (StringRef Feature : Features) | |
if (TT.isARM()) { | |
if (Feature == "+vfp4") | |
return typ == "f32" || typ == "f64"; | |
else if (Feature == "+vfp4sp") | |
return typ == "f32"; | |
} else if (TT.isX86()) { | |
if (Feature == "+fma" || Feature == "+fma4") | |
return typ == "f32" || typ == "f64"; | |
} | |
return false; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, this may work on riscv64 with #57043, but I'm still not entirely sure about what's going on there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, at which point this is nfc for such architectures, but that can be fixed in a separate pr.
@@ -276,6 +276,9 @@ significantly more expensive than `x*y+z`. `fma` is used to improve accuracy in | |||
algorithms. See [`muladd`](@ref). | |||
""" | |||
function fma end | |||
function fma_emulated(a::Float16, b::Float16, c::Float16) | |||
Float16(muladd(Float32(a), Float32(b), Float32(c))) #don't use fma if the hardware doesn't have it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be simplified to
Float16(muladd(Float32(a), Float32(b), Float32(c))) #don't use fma if the hardware doesn't have it. | |
muladd(a, b, c) #don't use fma if the hardware doesn't have it. |
LLVM would automatically do the demotion to float
as necessary nowadays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ironically, on aarch64 with fp16 extension muladd
is better than fma
on Float16 because it doesn't force the Float16 -> Float32 -> Float16 dance:
julia> code_native(muladd, NTuple{3,Float16}; debuginfo=:none)
.text
.file "muladd"
.globl julia_muladd_1256 // -- Begin function julia_muladd_1256
.p2align 4
.type julia_muladd_1256,@function
julia_muladd_1256: // @julia_muladd_1256
; Function Signature: muladd(Float16, Float16, Float16)
// %bb.0: // %top
//DEBUG_VALUE: muladd:x <- $h0
//DEBUG_VALUE: muladd:x <- $h0
//DEBUG_VALUE: muladd:y <- $h1
//DEBUG_VALUE: muladd:y <- $h1
//DEBUG_VALUE: muladd:z <- $h2
//DEBUG_VALUE: muladd:z <- $h2
stp x29, x30, [sp, #-16]! // 16-byte Folded Spill
mov x29, sp
fmadd h0, h0, h1, h2
ldp x29, x30, [sp], #16 // 16-byte Folded Reload
ret
.Lfunc_end0:
.size julia_muladd_1256, .Lfunc_end0-julia_muladd_1256
// -- End function
.type ".L+Core.Float16#1258",@object // @"+Core.Float16#1258"
.section .rodata,"a",@progbits
.p2align 3, 0x0
".L+Core.Float16#1258":
.xword ".L+Core.Float16#1258.jit"
.size ".L+Core.Float16#1258", 8
.set ".L+Core.Float16#1258.jit", 281472349230944
.size ".L+Core.Float16#1258.jit", 8
.section ".note.GNU-stack","",@progbits
julia> code_native(fma, NTuple{3,Float16}; debuginfo=:none)
.text
.file "fma"
.globl julia_fma_1259 // -- Begin function julia_fma_1259
.p2align 4
.type julia_fma_1259,@function
julia_fma_1259: // @julia_fma_1259
; Function Signature: fma(Float16, Float16, Float16)
// %bb.0: // %top
//DEBUG_VALUE: fma:a <- $h0
//DEBUG_VALUE: fma:a <- $h0
//DEBUG_VALUE: fma:b <- $h1
//DEBUG_VALUE: fma:b <- $h1
//DEBUG_VALUE: fma:c <- $h2
//DEBUG_VALUE: fma:c <- $h2
stp x29, x30, [sp, #-16]! // 16-byte Folded Spill
fcvt s0, h0
fcvt s1, h1
mov x29, sp
fcvt s2, h2
fmadd s0, s0, s1, s2
fcvt h0, s0
ldp x29, x30, [sp], #16 // 16-byte Folded Reload
ret
.Lfunc_end0:
.size julia_fma_1259, .Lfunc_end0-julia_fma_1259
// -- End function
.type ".L+Core.Float16#1261",@object // @"+Core.Float16#1261"
.section .rodata,"a",@progbits
.p2align 3, 0x0
".L+Core.Float16#1261":
.xword ".L+Core.Float16#1261.jit"
.size ".L+Core.Float16#1261", 8
.set ".L+Core.Float16#1261.jit", 281472349230944
.size ".L+Core.Float16#1261.jit", 8
.section ".note.GNU-stack","",@progbits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no. muladd
doesn't guarantee the accuracy of fma
requires
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will also point out that pure f16 fma is not super useful as an operation. Most of the accelerators will do fp16 multiply with an f32 accumulator (and then potentially round back to f16 at the end).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, but that's not what Base.fma
does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That’s not really true of fp16, on aarch64 it’s a true type which supports everything (with twice the throughput on SIMD), bf16 is that though
The runtime intrinsic will need updating to be modelling this correctly? It currently doesn't handle Float16 at all. julia/src/runtime_intrinsics.c Line 1724 in 3ba504a
Line 1094 in 3ba504a
|
@vchuravy can we leave that to a separate PR? this is a correct change even if our modeling is overly conservative. |
This removes the redundant
fma_llvm
function, and makes it so systems with Float16 fma can actually use it rather than the Float32 fallback path.