You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Calling value_ptr() and memcpy to correctly build buffers of vectors is a pain, build some maths helpers to manipulate vectors, implemented as raw blocks of memory instead. This'll allow GLM to be dropped as a dependency.
Return a reference to the operand modified, if applicable
C++26 port:
Wait for libstdc++ to support <simd>
Swap <experimental/simd> for <simd>, drop linter override and experimental regex
Convert the existing functions to the new syntax
Mark functions as constexpr where possible
Make use of partial loads / stores, permute and gather / scatter to allow vectorisation
C++26 optimisations:
Make use of new C++26 SIMD features:
Partial loads and stores
Gather / scatter loads and stores
Permute
<cmath> overloads
sum_to and multiply_to
Promote vectors, matrices and quaternions with bit_ceil
This isn't necessary if compilers can handle this automatically
This includes templated sizes and fixed-size vectors
Determine whether a helper for automatic full / partial stores is required
Speed up set(vector, vector, scalar)
Load the second vector into the width of the first with a partial load and default value of the scalar
Store to the first vector
Handle equivalent sizes or cases where the destination is smaller
Consider reordering arguments for destination argument position consistency
Speed up set([vector / matrix], scalar)
Broadcast the scalar to a width-rounded vector, then use a parial store
Matrix transpose using (load -> permute -> store) or ([gather / load] -> [store / scatter])
(load -> permute -> store) might be easier with width rounding
Vectorise equals by comparing floating point vectors
Started on simp-cmp, seems to have an issue with the experimental SIMD library
Speed up cross
Load the inputs, then permute a copy of each
Apply the arithmetic
Reorder and store the output
Speed up normalise
Square the vector elements
Rotate by one element to a copy, sum the vectors
Repeat with a two element rotation for lengths 3 and 4
Investigate guarding this behind checks for AVX and sufficient native vector size
Fall back to a reduce by addition otherwise
std::sqrt on the vector, then divide the original by it and store
Look for related functions
List out all functions to vectorise / optimise and wrk through them
Look into assembly optimisations
_mm_dp_ps(), _mm_dp_pd() and _mm256_dp_ps() for vector and quaternion dot products
These might be faster than vector and quaternion length calculation, by taking the dot product with itself, then calling _mm_sqrt_ps(), _mm_sqrt_pd(), _mm256_sqrt_ps() or _mm256_sqrt_pd()
Verify performance and inspect assembly for optimisations
Target znver4, znver3 and ARM SVE(2)
Rewrite GLM-dependent functions:
Rewrite matrix operations to be GLM-independent:
transpose, multiply
Use scatter / gather or load, permute, store
determinant, inverse
Hard-coding an equation for each element, specialised for each size might be faster than using a general algorithm
rotate, scale, translate
These can be derived, or have well known algorithms
lookAt, perspective
These both have well-known algorithms
Rewrite quaternion operations to be GLM-independent:
Calling
value_ptr()andmemcpyto correctly build buffers of vectors is a pain, build some maths helpers to manipulate vectors, implemented as raw blocks of memory instead. This'll allow GLM to be dropped as a dependency.Vectorise with C++26's upcoming
<simd>header.Dependencies:
enable-arch-controlARCH=x86-64for debugging until it's fixed properly, tracked in Compiler and tooling workarounds / fixes #33Introduction:
formatVector(), random vector fill) could then be simplified2x2up to4x4, floating-point types onlydata,copy,copyCast,equal,set,diagonal,identityadd,subtranspose,multiplydeterminant,inverserotate,scale,translatelookAt,perspectivedata,copy,copyCastfromEuler,toEulerfromPitchYawRoll,toPitchYawRolldot,conjugate,length,normalise,inversemultiply (quaternion-quaternion),multiply (quaternion-vector)toMatrixC++26 port:
<simd><experimental/simd>for<simd>, drop linter override and experimental regexC++26 optimisations:
<cmath>overloadssum_toandmultiply_tobit_ceilset(vector, vector, scalar)set([vector / matrix], scalar)load -> permute -> store) or ([gather / load] -> [store / scatter])load -> permute -> store) might be easier with width roundingequalsby comparing floating point vectorssimp-cmp, seems to have an issue with the experimental SIMD librarycrossnormalisestd::sqrton the vector, then divide the original by it and store_mm_dp_ps(),_mm_dp_pd()and_mm256_dp_ps()for vector and quaternion dot products_mm_sqrt_ps(),_mm_sqrt_pd(),_mm256_sqrt_ps()or_mm256_sqrt_pd()znver4,znver3and ARM SVE(2)Rewrite GLM-dependent functions:
transpose,multiplydeterminant,inverserotate,scale,translatelookAt,perspectivefromEuler,toEulermultiply (quaternion-quaternion),multiply (quaternion-vector)toMatrix<linalg>instead of manual implementationsGLM removal:
remove-glmClean up:
-fno-math-errnoand-march=native