This document captures the architectural decisions, design patterns, and implementation strategies for the libneo Fortran library.
Problem: Inner subroutines that access variables from their parent scope create trampolines, which cause:
- Performance overhead from indirect function calls
- Poor compiler optimization opportunities
- Security issues on some systems (executable stack required)
- Cache misses and instruction pipeline stalls
Solution: Refactor all inner subroutines to module-level procedures:
- Move inner subroutines to module scope
- Pass all required data explicitly as arguments
- Use derived types to bundle related parameters when argument lists become long
Implementation Strategy:
- Phase 1 - Discovery: Identify all inner subroutines accessing outer variables
- Phase 2 - Tests: Ensure adequate tests exist before refactoring
- Phase 3 - Refactoring: Move inner subroutines to module level
- Phase 4 - Validation: Verify performance improvements and correctness
- Location:
integrate_RZ_along_fieldlinecontains inner subroutinefieldline_derivative - Issue:
fieldline_derivativeaccessesfieldparameter from outer scope - Solution: Move to module level, pass
fieldas argument or use context parameter - Tests: Exists in
test/poincare/test_poincare.f90 - Priority: HIGH - Used in field line integration (performance critical)
Issue: #119 Approach:
- The inner subroutine
fieldline_derivativeis already using the context pattern correctly - It receives
fieldthrough thecontextparameter in the interface - However, it's still an inner subroutine which can cause compiler issues
- Solution: Keep using context pattern but move subroutine to module level
Refactoring Steps:
- Move
fieldline_derivativefrom insideintegrate_RZ_along_fieldlineto module level - Rename to
poincare_fieldline_derivativeto avoid naming conflicts - Update the call to
odeint_allroutinesto use the module-level procedure - Verify tests still pass
Issue: #119 Results:
- Comprehensive scan completed: Only ONE instance found
- Location:
src/poincare.f90-fieldline_derivativeinner subroutine - No other inner subroutines with contains blocks in the codebase
- Many files have multiple contains blocks but they are type definitions or module-level (both OK)
Issue: #119 Approach:
- Create benchmark for poincare integration before refactoring
- Measure cache misses using perf tools
- Compare performance after refactoring
- Document improvements in issue
The library uses an abstract field interface (field_t) with concrete implementations for different field types. This allows polymorphic behavior while maintaining performance.
For ODE integration and similar callbacks, we use a context parameter pattern that allows passing arbitrary data to callback functions without global state.
- Use column-major ordering for arrays (Fortran default)
- Prefer Structure of Arrays (SoA) over Array of Structures (AoS)
- Allocate large arrays as allocatable for heap storage
- Avoid inner subroutines accessing outer variables
- Use
pureandelementalwhere possible - Pass data explicitly rather than through module variables
- Keep functions under 100 lines (target 50)
- All refactored code must have tests before modification
- Tests must verify correctness and key performance characteristics where applicable
- Use the existing test framework in
test/
- Primary: CMake with Ninja
- Secondary: fpm (Fortran Package Manager)
- All changes must pass CI pipeline
Support GPU-ready inner loops without forcing a specific backend on downstream codes, and without accidental device-host copies inside tight iteration loops.
The batch spline layer now includes many-point evaluation APIs alongside the existing single-point batch-over-quantities routines:
- 1D:
evaluate_batch_splines_1d_many(spl, x(:), y(:, :)) - 2D:
evaluate_batch_splines_2d_many(spl, x(2, :), y(:, :)) - 3D:
evaluate_batch_splines_3d_many(spl, x(3, :), y(:, :))
For each dimension there is also a resident variant intended for use inside a downstream-managed device data region:
- 1D:
evaluate_batch_splines_1d_many_resident - 2D:
evaluate_batch_splines_2d_many_resident - 3D:
evaluate_batch_splines_3d_many_resident
The resident variants contain !$acc kernels and use present(...) so they can
run without implicit transfers when the caller has already placed the arrays on
device.
The default construct_batch_splines_{1,2,3}d(...) entry points remain host-safe and
produce coefficients that match the existing scalar spline construction used in the
test suite. When built with OpenACC enabled, these constructors also place the
coefficient arrays on the OpenACC device via !$acc enter data copyin(...) so the
subsequent many-point evaluation wrappers can offload without additional setup.
For build-on-device performance experiments and for downstream code that wants to keep construction on the GPU, the explicit device constructor entry points remain available:
construct_batch_splines_2d_resident_deviceconstruct_batch_splines_3d_resident_device
Downstream must keep the data resident across the whole accelerated algorithm:
- Enter a data region that places the coefficient arrays and inputs on the device.
- Call the resident routine inside the region.
- Only update back to host at the boundary where host-side code needs the results.
The same compiler and OpenACC runtime must be used for both libneo and downstream. For Fortran, this effectively means compiling both projects with the same compiler because module files are not ABI compatible across compilers.
- If a kernel uses
present(...), calling it without an active device data region will fail at runtime. - Prefer explicit resident variants and keep host-safe variants separate to avoid silent per-call transfers.
- Avoid implicit copies in hot loops: create persistent data regions in the downstream algorithm and keep arrays present across iterations.
OpenMP target offload can be competitive, but libgomp NVPTX may clamp launch geometry
based on kernel metadata and resource usage, which can reduce occupancy and hurt
performance versus OpenACC for the same math kernel. Control data lifetime with
explicit target enter data / target exit data and avoid implicit maps inside
tight loops.
See CODING_STANDARD.md for detailed coding standards.