DESIGN.md - libneo Architecture and Design

Overview

This document captures the architectural decisions, design patterns, and implementation strategies for the libneo Fortran library.

Performance Optimizations

Trampoline Elimination (Inner Subroutines)

Problem: Inner subroutines that access variables from their parent scope create trampolines, which cause:

Performance overhead from indirect function calls
Poor compiler optimization opportunities
Security issues on some systems (executable stack required)
Cache misses and instruction pipeline stalls

Solution: Refactor all inner subroutines to module-level procedures:

Move inner subroutines to module scope
Pass all required data explicitly as arguments
Use derived types to bundle related parameters when argument lists become long

Implementation Strategy:

Phase 1 - Discovery: Identify all inner subroutines accessing outer variables
Phase 2 - Tests: Ensure adequate tests exist before refactoring
Phase 3 - Refactoring: Move inner subroutines to module level
Phase 4 - Validation: Verify performance improvements and correctness

Identified Cases

poincare.f90

Location: integrate_RZ_along_fieldline contains inner subroutine fieldline_derivative
Issue: fieldline_derivative accesses field parameter from outer scope
Solution: Move to module level, pass field as argument or use context parameter
Tests: Exists in test/poincare/test_poincare.f90
Priority: HIGH - Used in field line integration (performance critical)

Implementation Tasks

Task 1: Refactor poincare.f90

Issue: #119 Approach:

The inner subroutine fieldline_derivative is already using the context pattern correctly
It receives field through the context parameter in the interface
However, it's still an inner subroutine which can cause compiler issues
Solution: Keep using context pattern but move subroutine to module level

Refactoring Steps:

Move fieldline_derivative from inside integrate_RZ_along_fieldline to module level
Rename to poincare_fieldline_derivative to avoid naming conflicts
Update the call to odeint_allroutines to use the module-level procedure
Verify tests still pass

Task 2: Codebase Audit (COMPLETED)

Issue: #119 Results:

Comprehensive scan completed: Only ONE instance found
Location: src/poincare.f90 - fieldline_derivative inner subroutine
No other inner subroutines with contains blocks in the codebase
Many files have multiple contains blocks but they are type definitions or module-level (both OK)

Task 3: Performance Benchmarking

Issue: #119 Approach:

Create benchmark for poincare integration before refactoring
Measure cache misses using perf tools
Compare performance after refactoring
Document improvements in issue

Architecture Patterns

Field Interface Pattern

The library uses an abstract field interface (field_t) with concrete implementations for different field types. This allows polymorphic behavior while maintaining performance.

Context Parameter Pattern

For ODE integration and similar callbacks, we use a context parameter pattern that allows passing arbitrary data to callback functions without global state.

Performance Guidelines

Memory Layout

Use column-major ordering for arrays (Fortran default)
Prefer Structure of Arrays (SoA) over Array of Structures (AoS)
Allocate large arrays as allocatable for heap storage

Function Design

Avoid inner subroutines accessing outer variables
Use pure and elemental where possible
Pass data explicitly rather than through module variables
Keep functions under 100 lines (target 50)

Testing Strategy

Testing Requirements

All refactored code must have tests before modification
Tests must verify correctness and key performance characteristics where applicable
Use the existing test framework in test/

Build System

Primary: CMake with Ninja
Secondary: fpm (Fortran Package Manager)
All changes must pass CI pipeline

GPU Offload (OpenACC / OpenMP target)

Goal

Support GPU-ready inner loops without forcing a specific backend on downstream codes, and without accidental device-host copies inside tight iteration loops.

Current production shape

The batch spline layer now includes many-point evaluation APIs alongside the existing single-point batch-over-quantities routines:

1D: evaluate_batch_splines_1d_many(spl, x(:), y(:, :))
2D: evaluate_batch_splines_2d_many(spl, x(2, :), y(:, :))
3D: evaluate_batch_splines_3d_many(spl, x(3, :), y(:, :))

For each dimension there is also a resident variant intended for use inside a downstream-managed device data region:

1D: evaluate_batch_splines_1d_many_resident
2D: evaluate_batch_splines_2d_many_resident
3D: evaluate_batch_splines_3d_many_resident

The resident variants contain !$acc kernels and use present(...) so they can run without implicit transfers when the caller has already placed the arrays on device.

Construction and data residency

The default construct_batch_splines_{1,2,3}d(...) entry points remain host-safe and produce coefficients that match the existing scalar spline construction used in the test suite. When built with OpenACC enabled, these constructors also place the coefficient arrays on the OpenACC device via !$acc enter data copyin(...) so the subsequent many-point evaluation wrappers can offload without additional setup.

For build-on-device performance experiments and for downstream code that wants to keep construction on the GPU, the explicit device constructor entry points remain available:

construct_batch_splines_2d_resident_device
construct_batch_splines_3d_resident_device

How downstream should use OpenACC with no copies

Downstream must keep the data resident across the whole accelerated algorithm:

Enter a data region that places the coefficient arrays and inputs on the device.
Call the resident routine inside the region.
Only update back to host at the boundary where host-side code needs the results.

The same compiler and OpenACC runtime must be used for both libneo and downstream. For Fortran, this effectively means compiling both projects with the same compiler because module files are not ABI compatible across compilers.

OpenACC directive pitfalls

If a kernel uses present(...), calling it without an active device data region will fail at runtime.
Prefer explicit resident variants and keep host-safe variants separate to avoid silent per-call transfers.
Avoid implicit copies in hot loops: create persistent data regions in the downstream algorithm and keep arrays present across iterations.

OpenMP target offload lessons

OpenMP target offload can be competitive, but libgomp NVPTX may clamp launch geometry based on kernel metadata and resource usage, which can reduce occupancy and hurt performance versus OpenACC for the same math kernel. Control data lifetime with explicit target enter data / target exit data and avoid implicit maps inside tight loops.

Code Standards

See CODING_STANDARD.md for detailed coding standards.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DESIGN.md - libneo Architecture and Design

Overview

Performance Optimizations

Trampoline Elimination (Inner Subroutines)

Identified Cases

poincare.f90

Implementation Tasks

Task 1: Refactor poincare.f90

Task 2: Codebase Audit (COMPLETED)

Task 3: Performance Benchmarking

Architecture Patterns

Field Interface Pattern

Context Parameter Pattern

Performance Guidelines

Memory Layout

Function Design

Testing Strategy

Testing Requirements

Build System

GPU Offload (OpenACC / OpenMP target)

Goal

Current production shape

Construction and data residency

How downstream should use OpenACC with no copies

OpenACC directive pitfalls

OpenMP target offload lessons

Code Standards

FilesExpand file tree

DESIGN.md

Latest commit

History

DESIGN.md

File metadata and controls

DESIGN.md - libneo Architecture and Design

Overview

Performance Optimizations

Trampoline Elimination (Inner Subroutines)

Identified Cases

poincare.f90

Implementation Tasks

Task 1: Refactor poincare.f90

Task 2: Codebase Audit (COMPLETED)

Task 3: Performance Benchmarking

Architecture Patterns

Field Interface Pattern

Context Parameter Pattern

Performance Guidelines

Memory Layout

Function Design

Testing Strategy

Testing Requirements

Build System

GPU Offload (OpenACC / OpenMP target)

Goal

Current production shape

Construction and data residency

How downstream should use OpenACC with no copies

OpenACC directive pitfalls

OpenMP target offload lessons

Code Standards