minor fma cleanup #57041

oscardssmith · 2025-01-14T05:51:59Z

This removes the redundant fma_llvm function, and makes it so systems with Float16 fma can actually use it rather than the Float32 fallback path.

This removes the redundant `fma_llvm` function, and makes it so systems with Float16 fma can actually use it rather than the Float32 fallback path.

giordano · 2025-01-14T09:09:16Z

base/floatfuncs.jl

-function fma(a::Float16, b::Float16, c::Float16)
-    Float16(muladd(Float32(a), Float32(b), Float32(c))) #don't use fma if the hardware doesn't have it.
+@assume_effects :consistent function fma(x::T, y::T, z::T) where {T<:IEEEFloat}
+    Core.Intrinsics.have_fma(T) ? fma_float(x,y,z) : fma_emulated(x,y,z)


My understanding is that Core.Intrinsics.have_fma(Float16) is always false at the moment because

julia/src/llvm-cpufeatures.cpp

Lines 50 to 76 in 9b1ea1a

static bool have_fma(Function &intr, Function &caller, const Triple &TT) JL_NOTSAFEPOINT {

auto unconditional = always_have_fma(intr, TT);

if (unconditional)

return *unconditional;

auto intr_name = intr.getName();

auto typ = intr_name.substr(strlen("julia.cpu.have_fma."));

Attribute FSAttr = caller.getFnAttribute("target-features");

StringRef FS =

FSAttr.isValid() ? FSAttr.getValueAsString() : jl_ExecutionEngine->getTargetFeatureString();

SmallVector<StringRef, 128> Features;

FS.split(Features, ',');

for (StringRef Feature : Features)

if (TT.isARM()) {

if (Feature == "+vfp4")

return typ == "f32" || typ == "f64";

else if (Feature == "+vfp4sp")

return typ == "f32";

} else if (TT.isX86()) {

if (Feature == "+fma" || Feature == "+fma4")

return typ == "f32" || typ == "f64";

}

return false;

}

only checks Float32/Float64 extensions. @gbaraldi is that correct?

Actually, this may work on riscv64 with #57043, but I'm still not entirely sure about what's going on there.

sure, at which point this is nfc for such architectures, but that can be fixed in a separate pr.

giordano · 2025-01-14T12:30:53Z

base/floatfuncs.jl

@@ -276,6 +276,9 @@ significantly more expensive than `x*y+z`. `fma` is used to improve accuracy in
 algorithms. See [`muladd`](@ref).
 """
 function fma end
+function fma_emulated(a::Float16, b::Float16, c::Float16)
+    Float16(muladd(Float32(a), Float32(b), Float32(c))) #don't use fma if the hardware doesn't have it.


I think this can be simplified to

Suggested change

Float16(muladd(Float32(a), Float32(b), Float32(c))) #don't use fma if the hardware doesn't have it.

muladd(a, b, c) #don't use fma if the hardware doesn't have it.

LLVM would automatically do the demotion to float as necessary nowadays.

Ironically, on aarch64 with fp16 extension muladd is better than fma on Float16 because it doesn't force the Float16 -> Float32 -> Float16 dance:

julia> code_native(muladd, NTuple{3,Float16}; debuginfo=:none)

.text .file "muladd" .globl julia_muladd_1256 // -- Begin function julia_muladd_1256 .p2align 4 .type julia_muladd_1256,@function julia_muladd_1256: // @julia_muladd_1256 ; Function Signature: muladd(Float16, Float16, Float16) // %bb.0: // %top //DEBUG_VALUE: muladd:x <- $h0 //DEBUG_VALUE: muladd:x <- $h0 //DEBUG_VALUE: muladd:y <- $h1 //DEBUG_VALUE: muladd:y <- $h1 //DEBUG_VALUE: muladd:z <- $h2 //DEBUG_VALUE: muladd:z <- $h2 stp x29, x30, [sp, #-16]! // 16-byte Folded Spill mov x29, sp fmadd h0, h0, h1, h2 ldp x29, x30, [sp], #16 // 16-byte Folded Reload ret .Lfunc_end0: .size julia_muladd_1256, .Lfunc_end0-julia_muladd_1256 // -- End function .type ".L+Core.Float16#1258",@object // @"+Core.Float16#1258" .section .rodata,"a",@progbits .p2align 3, 0x0 ".L+Core.Float16#1258": .xword ".L+Core.Float16#1258.jit" .size ".L+Core.Float16#1258", 8 .set ".L+Core.Float16#1258.jit", 281472349230944 .size ".L+Core.Float16#1258.jit", 8 .section ".note.GNU-stack","",@progbits

julia> code_native(fma, NTuple{3,Float16}; debuginfo=:none)

.text .file "fma" .globl julia_fma_1259 // -- Begin function julia_fma_1259 .p2align 4 .type julia_fma_1259,@function julia_fma_1259: // @julia_fma_1259 ; Function Signature: fma(Float16, Float16, Float16) // %bb.0: // %top //DEBUG_VALUE: fma:a <- $h0 //DEBUG_VALUE: fma:a <- $h0 //DEBUG_VALUE: fma:b <- $h1 //DEBUG_VALUE: fma:b <- $h1 //DEBUG_VALUE: fma:c <- $h2 //DEBUG_VALUE: fma:c <- $h2 stp x29, x30, [sp, #-16]! // 16-byte Folded Spill fcvt s0, h0 fcvt s1, h1 mov x29, sp fcvt s2, h2 fmadd s0, s0, s1, s2 fcvt h0, s0 ldp x29, x30, [sp], #16 // 16-byte Folded Reload ret .Lfunc_end0: .size julia_fma_1259, .Lfunc_end0-julia_fma_1259 // -- End function .type ".L+Core.Float16#1261",@object // @"+Core.Float16#1261" .section .rodata,"a",@progbits .p2align 3, 0x0 ".L+Core.Float16#1261": .xword ".L+Core.Float16#1261.jit" .size ".L+Core.Float16#1261", 8 .set ".L+Core.Float16#1261.jit", 281472349230944 .size ".L+Core.Float16#1261.jit", 8 .section ".note.GNU-stack","",@progbits

no. muladd doesn't guarantee the accuracy of fma requires

I will also point out that pure f16 fma is not super useful as an operation. Most of the accelerators will do fp16 multiply with an f32 accumulator (and then potentially round back to f16 at the end).

sure, but that's not what Base.fma does.

That’s not really true of fp16, on aarch64 it’s a true type which supports everything (with twice the throughput on SIMD), bf16 is that though

vchuravy · 2025-01-14T13:16:15Z

The runtime intrinsic will need updating to be modelling this correctly? It currently doesn't handle Float16 at all.

julia/src/runtime_intrinsics.c

Line 1724 in 3ba504a

JL_DLLEXPORT jl_value_t *jl_have_fma(jl_value_t *typ)

julia/src/processor_x86.cpp

Line 1094 in 3ba504a

JL_DLLEXPORT jl_value_t *jl_cpu_has_fma(int bits)

oscardssmith · 2025-01-14T13:31:59Z

@vchuravy can we leave that to a separate PR? this is a correct change even if our modeling is overly conservative.

oscardssmith · 2025-01-21T18:19:51Z

talked with valentin and said that this is good to go.

minor fma cleanup

3bcaddd

This removes the redundant `fma_llvm` function, and makes it so systems with Float16 fma can actually use it rather than the Float32 fallback path.

oscardssmith added performance Must go faster maths Mathematical functions float16 labels Jan 14, 2025

oscardssmith requested review from giordano and gbaraldi January 14, 2025 05:52

giordano reviewed Jan 14, 2025

View reviewed changes

oscardssmith merged commit cb55389 into master Jan 21, 2025
10 checks passed

oscardssmith deleted the os/minor-fma-cleanup branch January 21, 2025 18:19

giordano mentioned this pull request Mar 15, 2025

Wrong Core.Intrinsics.have_fma(Float16) test #57783

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

minor fma cleanup #57041

minor fma cleanup #57041

Uh oh!

oscardssmith commented Jan 14, 2025

Uh oh!

giordano Jan 14, 2025

Uh oh!

giordano Jan 14, 2025

Uh oh!

oscardssmith Jan 14, 2025

Uh oh!

giordano Jan 14, 2025 •

edited

Loading

Uh oh!

giordano Jan 14, 2025 •

edited

Loading

Uh oh!

oscardssmith Jan 14, 2025

Uh oh!

Keno Jan 15, 2025

Uh oh!

oscardssmith Jan 15, 2025

Uh oh!

gbaraldi Jan 15, 2025

Uh oh!

vchuravy commented Jan 14, 2025

Uh oh!

oscardssmith commented Jan 14, 2025

Uh oh!

oscardssmith commented Jan 21, 2025

Uh oh!

Uh oh!

Uh oh!

	static bool have_fma(Function &intr, Function &caller, const Triple &TT) JL_NOTSAFEPOINT {
	auto unconditional = always_have_fma(intr, TT);
	if (unconditional)
	return *unconditional;

	auto intr_name = intr.getName();
	auto typ = intr_name.substr(strlen("julia.cpu.have_fma."));

	Attribute FSAttr = caller.getFnAttribute("target-features");
	StringRef FS =
	FSAttr.isValid() ? FSAttr.getValueAsString() : jl_ExecutionEngine->getTargetFeatureString();

	SmallVector<StringRef, 128> Features;
	FS.split(Features, ',');
	for (StringRef Feature : Features)
	if (TT.isARM()) {
	if (Feature == "+vfp4")
	return typ == "f32" \|\| typ == "f64";
	else if (Feature == "+vfp4sp")
	return typ == "f32";
	} else if (TT.isX86()) {
	if (Feature == "+fma" \|\| Feature == "+fma4")
	return typ == "f32" \|\| typ == "f64";
	}

	return false;
	}

	Float16(muladd(Float32(a), Float32(b), Float32(c))) #don't use fma if the hardware doesn't have it.
	muladd(a, b, c) #don't use fma if the hardware doesn't have it.

Uh oh!

minor fma cleanup #57041

minor fma cleanup #57041

Uh oh!

Conversation

oscardssmith commented Jan 14, 2025

Uh oh!

giordano Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

giordano Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

oscardssmith Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

giordano Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

giordano Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

oscardssmith Jan 14, 2025

Choose a reason for hiding this comment

Uh oh!

Keno Jan 15, 2025

Choose a reason for hiding this comment

Uh oh!

oscardssmith Jan 15, 2025

Choose a reason for hiding this comment

Uh oh!

gbaraldi Jan 15, 2025

Choose a reason for hiding this comment

Uh oh!

vchuravy commented Jan 14, 2025

Uh oh!

oscardssmith commented Jan 14, 2025

Uh oh!

oscardssmith commented Jan 21, 2025

Uh oh!

Uh oh!

Uh oh!

giordano Jan 14, 2025 •

edited

Loading

giordano Jan 14, 2025 •

edited

Loading