Skip to content

Replace GLM with custom vectorised maths #37

@stuarthayhurst

Description

@stuarthayhurst

Calling value_ptr() and memcpy to correctly build buffers of vectors is a pain, build some maths helpers to manipulate vectors, implemented as raw blocks of memory instead. This'll allow GLM to be dropped as a dependency.

Vectorise with C++26's upcoming <simd> header.

Dependencies:

Introduction:

  • Add vector types and operations
  • Disable 8-bit and 16-bit types
    • This should reduce build times and time spent fixing them
    • Helper functions (formatVector(), random vector fill) could then be simplified
    • Type promotions wouldn't have to be considered either
  • Add column-first matrix types and operations
    • Sizes from 2x2 up to 4x4, floating-point types only
    • Operations should include:
      • data, copy, copyCast, equal, set, diagonal, identity
      • add, sub
      • transpose, multiply
      • determinant, inverse
      • rotate, scale, translate
      • lookAt, perspective
  • Add quaternion types and operations
    • Templated type for floats and doubles
    • Operations should include:
      • data, copy, copyCast
      • fromEuler, toEuler
      • fromPitchYawRoll, toPitchYawRoll
      • dot, conjugate, length, normalise, inverse
      • multiply (quaternion-quaternion), multiply (quaternion-vector)
      • toMatrix
  • Provide degrees and radians conversions
  • Return a reference to the operand modified, if applicable

C++26 port:

  • Wait for libstdc++ to support <simd>
  • Swap <experimental/simd> for <simd>, drop linter override and experimental regex
  • Convert the existing functions to the new syntax
  • Mark functions as constexpr where possible
  • Make use of partial loads / stores, permute and gather / scatter to allow vectorisation

C++26 optimisations:

  • Make use of new C++26 SIMD features:
    • Partial loads and stores
    • Gather / scatter loads and stores
    • Permute
    • <cmath> overloads
    • sum_to and multiply_to
  • Promote vectors, matrices and quaternions with bit_ceil
    • This isn't necessary if compilers can handle this automatically
    • This includes templated sizes and fixed-size vectors
  • Determine whether a helper for automatic full / partial stores is required
  • Speed up set(vector, vector, scalar)
    • Load the second vector into the width of the first with a partial load and default value of the scalar
    • Store to the first vector
    • Handle equivalent sizes or cases where the destination is smaller
    • Consider reordering arguments for destination argument position consistency
  • Speed up set([vector / matrix], scalar)
    • Broadcast the scalar to a width-rounded vector, then use a parial store
  • Matrix transpose using (load -> permute -> store) or ([gather / load] -> [store / scatter])
    • (load -> permute -> store) might be easier with width rounding
  • Vectorise equals by comparing floating point vectors
    • Started on simp-cmp, seems to have an issue with the experimental SIMD library
  • Speed up cross
    • Load the inputs, then permute a copy of each
    • Apply the arithmetic
    • Reorder and store the output
  • Speed up normalise
    • Square the vector elements
    • Rotate by one element to a copy, sum the vectors
      • Repeat with a two element rotation for lengths 3 and 4
      • Investigate guarding this behind checks for AVX and sufficient native vector size
        • Fall back to a reduce by addition otherwise
    • std::sqrt on the vector, then divide the original by it and store
    • Look for related functions
  • List out all functions to vectorise / optimise and wrk through them
  • Look into assembly optimisations
    • _mm_dp_ps(), _mm_dp_pd() and _mm256_dp_ps() for vector and quaternion dot products
    • These might be faster than vector and quaternion length calculation, by taking the dot product with itself, then calling _mm_sqrt_ps(), _mm_sqrt_pd(), _mm256_sqrt_ps() or _mm256_sqrt_pd()
  • Verify performance and inspect assembly for optimisations
    • Target znver4, znver3 and ARM SVE(2)

Rewrite GLM-dependent functions:

  • Rewrite matrix operations to be GLM-independent:
    • transpose, multiply
      • Use scatter / gather or load, permute, store
    • determinant, inverse
      • Hard-coding an equation for each element, specialised for each size might be faster than using a general algorithm
    • rotate, scale, translate
      • These can be derived, or have well known algorithms
    • lookAt, perspective
      • These both have well-known algorithms
  • Rewrite quaternion operations to be GLM-independent:
  • Benchmark alternative implementations for operations that can be implemented with scatter / gather / permute
    • Consider using <linalg> instead of manual implementations

GLM removal:

  • Started on remove-glm
  • Convert codebase to new maths functions
    • Remove explicit client code / build system dependencies
  • Rewrite any maths functions that are dependent on GLM
  • Remove from build system, docs, workflows and pkgconf
  • Remove GLM header filter for clang-tidy

Clean up:

  • Compile with -fno-math-errno and -march=native
  • Mention the vectorised maths in the README
  • Bump minimum versions (GCC, clang, clang-tidy) for docs and workflows

Metadata

Metadata

Assignees

No one assigned

    Labels

    feature requestNew feature or requestwaitingWaiting on something else to progress

    Projects

    Status

    In Progress

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions