GitHub - iamjosephmj/Slim: Pure-Kotlin ARM64 NEON runtime for Android. Within 7% of hand-tuned JNI native code on 7 devices (Android 11–17, Pixel/Samsung/Oppo), 6.95× over scalar Kotlin. No per-call JNI, no NDK build, no separate .so.

   ███████╗██╗     ██╗███╗   ███╗
   ██╔════╝██║     ██║████╗ ████║
   ███████╗██║     ██║██╔████╔██║
   ╚════██║██║     ██║██║╚██╔╝██║
   ███████║███████╗██║██║ ╚═╝ ██║
   ╚══════╝╚══════╝╚═╝╚═╝     ╚═╝

Write ARM64 NEON code in Kotlin. Run it on Android. No JNI per call.

Installation · Quick start · Production readiness · Architecture · Cookbook

Important

Slim is a runtime SIMD compiler for Android that lets you write ARM64 NEON instructions inline in Kotlin and have them executed by the Android Runtime (ART) as if they were JIT-compiled Kotlin — no JNI hop, no separate .so, no NDK build, no scheduler in the way. The kernel runs at NEON-native throughput; the framing is plain Kotlin function calls.

Table of contents

🧬 What it looks like
⚡ Why
📦 Installation
🚀 Quick start
🧩 Core concepts
🍳 Examples
🛡️ Production readiness
📊 Performance
📱 Supported devices
🔬 How it works
📖 Background
⚠️ Caveats and limitations
📚 Documentation
🤝 Contributing
📄 License
🙏 Acknowledgments

🧬 What it looks like

val pixels = Floats(myFloatArray)

slim(pixels) {
    loadImm32(W4, java.lang.Float.floatToRawIntBits(0.5f))
    dup(V0, X4, S4)              // v0 = 0.5 × 4 (broadcast)
    loadImm32(W3, pixels.size)
    mov(X1, X0)

    val loop = bindLabel()
    ld1(V2, X1, S4)              // v2 = pixels[i..i+3]
    fmul(V2, V2, V0, S4)         // v2 *= 0.5
    st1(V2, X1, S4)
    add(X1, X1, 16)
    sub(W3, W3, 4)
    cbnz(W3, loop)
}

println(pixels[0])               // result

That's the whole API. Two functions — Slim.initialize(context) once at startup, then slim(data) { ... } anywhere. Inside the block, raw ARM64 NEON: registers, instructions, vector arrangements, condition codes. The runtime handles JIT memory, ART internals, and dispatch.

slim { ... }   →   encode NEON   →   memfd R/X   →   exec
   Kotlin DSL        ~5 µs         shared pages    native

⚡ Why

If you've written SIMD on Android, you've used one of these:

Approach	Problem
JNI + NDK + `<arm_neon.h>`	Per-call JNI overhead (~100 ns), C++ build pipeline, separate `.so` per ABI, no runtime codegen.
RenderScript	Deprecated since API 31. Compute kernels only, opaque scheduler.
Vulkan compute	Powerful but verbose. ~200 lines of boilerplate for a SAXPY. Driver overhead on small kernels.
Pure Kotlin/Java	JIT tries hard, but no auto-vectorization for ARM. 5–10× slower than NEON for tight loops.

Slim sits in a gap. You write NEON instructions in Kotlin, the runtime JIT-compiles them into native code, and ART dispatches the kernel via a hijacked entry-point — no JNI, no separate build artifact, no scheduler in the way.

Note

Measured on Samsung S24 (Android 16, Cortex-X4). SAXPY-style brightness kernel over a 16 MB float buffer.

Path	Time	Throughput	Speedup
Hot-path Kotlin scalar (JIT-compiled)	5.32 ms	3.0 GB/s	1.0×
Slim with `FloatArray` (eager copy)	2.22 ms	7.2 GB/s	2.4×
Slim with `Floats` (zero-copy)	0.76 ms	23.4 GB/s	6.95×

Concurrency: 200 dispatches across 4 coroutines complete in 67 ms with zero races. Probe-pool serves up to 8 in-flight kernels.

Tip

Slim ships with a built-in disassembler that decodes compiled kernels back to canonical ARM64 assembly with resolved label names and Kotlin file:line annotations — see Disassembler & debug view.

📦 Installation

Slim ships via JitPack — no Maven Central account required, no signing keys, builds straight from GitHub tags.

// settings.gradle.kts
dependencyResolutionManagement {
    repositories {
        mavenCentral()
        maven(url = "https://jitpack.io")
    }
}

// app/build.gradle.kts
android {
    defaultConfig {
        minSdk = 26
        ndk { abiFilters += "arm64-v8a" }
    }
}

dependencies {
    implementation("com.github.iamjosephmj:Slim:0.1.2")
}

Warning

Status: 0.1.2 — V1 internal release. Public API shape is stable (the Slim / slim {} surface won't change incompatibly), but the underlying engine is still validating against new Android releases.

🚀 Quick start

import io.simdkt.slim.Slim
import io.simdkt.slim.slim
import io.simdkt.slim.Floats

class MyApplication : Application() {
    override fun onCreate() {
        super.onCreate()
        Slim.initialize(this)             // once
    }
}

class MyViewModel : ViewModel() {
    suspend fun darken(input: FloatArray): FloatArray {
        val pixels = Floats(input)         // wrap once for zero-copy
        slim(pixels) {
            loadImm32(W3, pixels.size)
            mov(X1, X0)
            val loop = bindLabel()
            ld1(V0, X1, S4)
            fmul(V0, V0, V0, S4)           // square each lane
            st1(V0, X1, S4)
            add(X1, X1, 16)
            sub(W3, W3, 4)
            cbnz(W3, loop)
        }
        return pixels.toFloatArray()
    }
}

🧩 Core concepts

`Slim.initialize(context)`

One-time runtime setup. Call from Application.onCreate or before any slim {} call. Idempotent. Returns false on devices where the runtime can't bring up a working dispatch path; lastError has the diagnostic. Always check the return value and have a scalar fallback ready — see Production readiness.

`slim(data) { ... }`

The kernel entry point. suspend function. The body is the kernel — one ARM64 NEON instruction per Kotlin statement. The runtime auto-injects a prologue (sets x0 to the data buffer's native address) and an epilogue (ret), so your code is pure NEON.

Accepts data in several shapes:

Type	Cost per call	Use when
`FloatArray` / `IntArray` / `ByteArray`	2 heap↔native copies (~2 ms / 16 MB)	One-shot kernels. Convenient.
`Floats` / `Ints` / `Bytes`	Zero copy	Hot paths, repeated calls.
`ByteBuffer` (direct)	Zero copy	Already managing your own native buffer.
`Long` (raw native pointer)	Zero copy	JNI / `Unsafe` / `mmap` callers.

`Floats` / `Ints` / `Bytes`

Direct-buffer-backed array substitutes. Look like Kotlin arrays (data[i], data[i] = x, fill { ... }); pass to slim {} zero-copy.

val pixels = Floats(width * height * 4)            // zero-filled
val pixels = Floats(width * height * 4) { it.toFloat() }  // generator-filled
val pixels = Floats(myFloatArray)                  // copy from heap (one-time)

pixels[0] = 1.0f
val out: FloatArray = pixels.toFloatArray()

Inside `slim {}`

Every ARM64 register, vector arrangement, condition code, and instruction helper is in scope:

slim(data) {
    // 96 registers in scope: X0..X30, W0..W30, V0..V31, XZR, SP, WZR, WSP
    mov(X1, X0)
    movz(W3, 1024)

    // 8 vector arrangements: B8/B16/H4/H8/S2/S4/D1/D2
    ld1(V0, X1, S4)
    fmla(V1, V0, V2, S4)

    // 16 condition codes: EQ, NE, CS, CC, MI, PL, VS, VC, HI, LS, GE, LT, GT, LE, AL, NV
    csel(X4, X5, X6, GT)

    // Labels for branches (forward and backward)
    val loop = bindLabel()
    sub(W3, W3, 1)
    cbnz(W3, loop)
}

For instructions not yet bound (specialized SVE, crypto, etc.), use raw(opcode) with the underlying encoder helper:

slim(data) {
    raw(io.simdkt.nativekt.engine.Arm64.someExoticInstruction(X0, X1))
}

Disassembler & debug view

When a kernel produces unexpected output, inspect what actually got compiled. Slim ships with a full ARM64 disassembler that decodes the emitted bytes back to canonical assembly and — with Slim.debug = true — annotates every instruction with the originating Kotlin source line.

Slim.debug = true                         // opt-in source-line capture

val asm: String = Slim.preview {
    mov(X1, X0)
    val loop = bindLabel("loop")          // named labels resolve in branch operands
    ld1(V0, X1, S4)
    fmul(V0, V0, V0, S4)
    st1(V0, X1, S4)
    add(X1, X1, 16)
    sub(W3, W3, 4)
    cbnz(W3, loop)
}

println(asm)

Output:

  0000  aa0003e1  mov    x1, x0               // MyKernel.kt:42
loop:
  0004  4cc07c20  ld1    {v0.4s}, [x1]        // MyKernel.kt:44
  0008  6e20dc00  fmul   v0.4s, v0.4s, v0.4s  // MyKernel.kt:45
  000c  4c007c20  st1    {v0.4s}, [x1]        // MyKernel.kt:46
  0010  91004021  add    x1, x1, #0x10        // MyKernel.kt:47
  0014  51001063  sub    w3, w3, #4           // MyKernel.kt:48
  0018  35ffff83  cbnz   w3, loop             // MyKernel.kt:49

Each line shows: byte offset, hex opcode, mnemonic, operands, and the originating file:line. Forward and backward branch targets are resolved to label names when bound with bindLabel("name"); anonymous labels render as L0, L1, … For an already-compiled kernel, call disassemble() on the handle directly:

val handle = compileMyKernel(...)
println(handle.disassemble())

What the disassembler covers:

All 207 encoder helpers — branches, data-processing (immediate and register), GP and SIMD load/store, NEON FP and integer, system / hint.
Canonical alias rewriting — emits mov xN, xM instead of orr xN, xzr, xM, cmp instead of subs xzr, tst instead of ands xzr, mul / mneg instead of madd / msub with xzr, lsl / lsr / asr immediate instead of ubfm / sbfm, cset / csetm, sign- and zero-extends. Output matches llvm-objdump defaults.
Resolved label names in branch operands (cbnz w3, loop instead of cbnz w3, .-20).
Source file:line annotation when Slim.debug == true. Off by default — overhead is ~1–3 µs per emitted instruction (stack-walk to identify the user frame), so leave it off in production.

Correctness guarantees:

150+ paired golden-byte tests — every encoder assertEnc has a paired assertDec, cross-validated against clang+llvm-objdump.
14 property-based round-trip tests (random valid inputs per family, encode → decode → assert).
1000-opcode negative test — random 32-bit ints never throw; unknown encodings return Operand.Unknown and a ? mnemonic.

See the Cookbook → Debugging your kernel section for the worked example, and the rest of the cookbook for the full integration story (closure capture, suspend composition, reactive pipelines, value extraction).

🍳 Examples

Brightness adjustment (SAXPY: y = a·x + b)

suspend fun brighten(pixels: Floats, a: Float, b: Float) {
    val aBits = java.lang.Float.floatToRawIntBits(a)
    val bBits = java.lang.Float.floatToRawIntBits(b)
    slim(pixels) {
        loadImm32(W4, aBits)
        dup(V0, X4, S4)              // v0 = a × 4
        loadImm32(W4, bBits)
        dup(V1, X4, S4)              // v1 = b × 4
        loadImm32(W3, pixels.size)
        mov(X1, X0)

        val loop = bindLabel()
        ld1(V2, X1, S4)
        fmul(V2, V2, V0, S4)         // v2 *= a
        fadd(V2, V2, V1, S4)         // v2 += b
        st1(V2, X1, S4)
        add(X1, X1, 16)
        sub(W3, W3, 4)
        cbnz(W3, loop)
    }
}

Color invert (uint8 RGBA, byte lanes)

suspend fun invertRgb(pixels: Bytes) {
    slim(pixels) {
        loadImm32(W3, pixels.size)
        mov(X1, X0)
        // Load 0xFF into every lane via dup of imm
        movz(W4, 0xFF)
        dup(V1, X4, B16)             // v1 = 0xFF × 16

        val loop = bindLabel()
        ld1(V0, X1, B16)             // 16 bytes
        sub(V0, V1, V0, B16)         // v0 = 255 - v0 (vector sub)
        st1(V0, X1, B16)
        add(X1, X1, 16)
        sub(W3, W3, 16)
        cbnz(W3, loop)
    }
}

Concurrent dispatch from coroutines

suspend fun processFrames(frames: List<Floats>) = coroutineScope {
    frames.map { frame ->
        async(Dispatchers.Default) {
            slim(frame) {
                // kernel — runs on the coroutine's worker thread
            }
        }
    }.awaitAll()
}

The probe pool serves up to 8 concurrent dispatches; beyond that, threads block on slot acquisition.

More recipes

The Cookbook is a long read explaining the SIMD↔Kotlin integration model — closure capture, encode-time evaluation, suspend composition, value extraction patterns, reactive Flow pipelines, conditional dispatch — alongside worked recipes for SAXPY, dot product, color filters, threshold, blur, and more.

🛡️ Production readiness

Slim takes liberties with the runtime to deliver native-throughput SIMD without JNI. The engineering bet is that the runtime is allowed to refuse, and you handle that. Three things make this safe to ship:

Kill-switch via `Slim.initialize()`

initialize() returns false on devices where the dispatch path can't be brought up. Slim.lastError reports which step gave up. Wire a scalar fallback in one line:

class MyApplication : Application() {
    override fun onCreate() {
        super.onCreate()
        val ready = Slim.initialize(this)
        FastPath.brighten = if (ready) ::slimBrighten else ::scalarBrighten
        if (!ready) Log.i("Slim", "fell back: ${Slim.lastError}")
    }
}

No exception, no crash — just a clean signal to use your fallback path. You get NEON throughput where it works and JIT'd Kotlin where it doesn't.

Four-tier graceful bypass cascade

On API 28+, ART blocks reflective access to internals. Slim attempts four progressively-narrower techniques, falling through tier by tier until one succeeds. The first tier that survives wins; offsets are cached at <cacheDir>/nk_policy.bin so subsequent app launches skip discovery entirely (~3 ms warm cold-start vs ~10 ms cold).

Tier	Mechanism	Used on
1	`setHiddenApiExemptions` reflection	API 28–29 stock
2	Meta-reflection via `ClassLoader` chain	API 30–32
3	`Os.mmap` direct ELF parse	API 33–35
4	`art::Runtime::instance_` ELF lookup + memory probe	API 36+, novel ROMs

Confirmed across AOSP-derived Android 8–16 (Pixel, Samsung One UI). If all four fail, initialize() returns false cleanly.

Anti-tamper compatibility

SDK / runtime check	Likely outcome
Google Play Integrity API	✅ No interaction (no native lib, no DEX modification)
Stock ROMs (no anti-tamper)	✅ Confirmed compatible

If you ship with a hardening / RASP SDK, run a smoke test on your build pipeline before shipping — the reflection Slim performs is the kind such SDKs are designed to flag. The kill-switch above means worst case is a quiet fallback to scalar code — not a crash.

What "production" means here

Slim 0.1.2 is shipping inside internal apps. The dispatch mechanism has been validated across the supported API range. The encoder ships with 150+ paired golden-byte tests, 14 property-based round-trip tests, and 1000-opcode negative tests (random bytes never throw). The remaining risks are vendor-ROM novelty (mitigated by the cascade

kill-switch) and anti-tamper interaction (above).

Tip

For any code path where Slim might be invoked, gate it on Slim.initialize()'s return value. If the dispatch path fails on a user's device, you want a scalar fallback, not a 1-star review.

📊 Performance

Slim sits between scalar Kotlin (the floor — JIT-compiled, no SIMD) and hand-tuned native C++ with NEON intrinsics (the ceiling — what you'd otherwise ship in a .so). Two numbers worth knowing: how close to the native ceiling, and how much over the scalar floor.

vs hand-tuned native NEON — multi-device

How close does Slim's runtime-emitted code get to hand-written native NEON compiled with clang -O3? The :bench module answers fair-and-square:

Same algorithm. Both backends run an 8-stage fused NEON pipeline (invert → contrast → brighten(40) → darken(20), repeated). All 8 stages execute back-to-back in NEON registers per 16-byte chunk — one load, one store per byte for the whole pipeline.
Same instruction stream. Slim's kernel is emitted at runtime from a Kotlin DSL; JNI's is written by hand using arm_neon.h and compiled with clang -O3 -march=armv8-a+simd. Algorithmically identical.
Same memory pattern, same thread, same input. Both backends operate on the same direct ByteBuffer, on the bench thread, with no fan-out or thread pool. A correctness gate enforces byte-identical output before any timing happens.
Only variable: dispatch mechanism. JNI uses a registered native trampoline. Slim uses the ART entry-point hijack described in How it works.

Ran on 7 real devices via a cloud test farm — tap any thumbnail for the full-resolution screenshot:

_{Pixel 10 Pro XL Android 17}	_{Galaxy A54 5G Android 16}	_{Oppo Reno13 F Android 15}	_{Galaxy A23 5G Android 14}
_{Galaxy Note20 Android 13}	_{Galaxy S20 FE 2022 Android 12}	_{Oppo A94 5G Android 11}

Device	Android	Dispatch baseline (JNI / Slim)	1080p (2 MB)	4K (8 MB)
Pixel 10 Pro XL	17	0.73 µs / 10.5 µs	11% slower	6% slower
Galaxy A54 5G	16	1.15 µs / 16.7 µs	13% slower	7% slower
Oppo Reno13 F	15	— / 20.8 µs	12% slower	TIE
Galaxy A23 5G	14	1.41 µs / 19.2 µs	9% slower	TIE
Galaxy Note20	13	1.42 µs / 26.4 µs	TIE	TIE
Galaxy S20 FE 2022	12	— / 15.7 µs	13% slower	TIE
Oppo A94 5G	11	1.54 µs / 20.2 µs	7% slower	TIE

TIE = Slim within 5% of JNI on that cell. Across 7 devices, three vendors (Google, Samsung, Oppo), and six Android versions (11 → 17), Slim never loses by more than 13% at 1080p, and matches JNI on 5 of 7 devices at 4K.

The "dispatch baseline" column is an empty-kernel call (placeholder prologue + ret on the Slim side, no-op native function on the JNI side). It measures pure call overhead. JNI is structurally faster there — sub-2 µs vs 10–26 µs — and that gap is real and stable. At production workload sizes, the gap amortizes into the noise.

Run the bench yourself

./gradlew :bench:installDebug
adb shell am start -n com.example.slim.bench/.BenchActivity

Tap Run. The on-device UI shows a dispatch-baseline card and one card per image size with median, p95, p99, and throughput. Copy CSV exports the raw rows to your clipboard.

vs scalar Kotlin

The reason a SIMD runtime exists at all. On a Samsung S24 (Cortex-X4, Android 16), a 1024×1024 RGBA-as-float kernel applying y = 0.5·x:

xychart-beta
    title "Throughput (GB/s) — higher is better"
    x-axis ["Kotlin scalar", "Slim (FloatArray)", "Slim (Floats, zero-copy)"]
    y-axis "GB/s" 0 --> 25
    bar [3.0, 7.2, 23.4]

Path	Time (ms)	Throughput	Notes
Kotlin scalar	5.32	3.0 GB/s	Hot-path JIT'd, best of 10
Slim with `FloatArray` (eager copy)	2.22	7.2 GB/s	Includes 2× heap↔native copy
Slim with `Floats` (zero-copy)	0.76	23.4 GB/s	6.95× over Kotlin

Operational characteristics

Cold start	Per-call overhead	Concurrent dispatch
~3 ms with warm caches ~10 ms uncached	~3 µs (probe-slot + EP patch + invoke)	~3 K calls/sec 4 coroutines × 50 calls = 67 ms

Tip

Probe pool serves up to 8 in-flight kernels before blocking. Different kernels run in parallel; same-kernel calls serialize via a per-handle Mutex.

📱 Supported devices


API	26+ (Android 8.0 and up)
ABI	`arm64-v8a` only
Confirmed on-device	AOSP-derived Android 8–17 across Pixel, Samsung One UI, and Oppo ColorOS — see the JNI-parity bench for a 7-device sweep. The bypass cascade gracefully falls through technique-by-technique on novel ROMs; if all four fail, `Slim.initialize` returns `false` and `lastError` reports which step gave up.

The runtime requires:

A library libart.so whose .dynsym exports art::Runtime::instance_ (universally true on AOSP-derived ART since API 28).
ART's "quick" dispatch path with entry_point_from_quick_compiled_code_ in ArtMethod at offset 0x18 (or 0x10/0x20/0x28/0x30/0x08 — the probe walks them).
memfd_create syscall (Linux 3.17+, present on all supported APIs).

🔬 How it works

Slim sits on top of three pieces of ART internals plumbing — setup once, encode per kernel, dispatch per call:

flowchart TB
    subgraph Setup["Slim.initialize() — once per process"]
        H1["Hidden-API bypass<br/>(4-tier cascade)"]
        H2["Locate ArtMethod offsets<br/>(probe entry_point_ field)"]
        H1 --> H2
    end

    subgraph Encode["slim &#123; ... &#125; — per kernel"]
        E1["Encode NEON instructions<br/>(two-pass label fixup)"]
        E2["Write bytes → memfd R/W"]
        E3["mmap memfd R/X<br/>(shared physical pages, no flush)"]
        E1 --> E2 --> E3
    end

    subgraph Dispatch["Per call — no JNI"]
        D1["Patch ArtMethod.entry_point_<br/>→ R/X page address"]
        D2["ART quick-dispatch jumps<br/>into shellcode"]
        D3["NEON kernel runs at native speed"]
        D4["ret → restore entry_point_"]
        D1 --> D2 --> D3 --> D4
    end

    Setup ==> Encode
    Encode ==> Dispatch

The dispatch path itself, traced as a sequence:

sequenceDiagram
    autonumber
    participant K as Kotlin call site
    participant ART as ART runtime
    participant AM as ArtMethod
    participant RX as memfd R/X page

    Note over K,RX: Slim.initialize() done once — bypass passed, offsets cached

    K->>+ART: slim(data) { ... }  (suspend)
    ART->>AM: peek entry_point_from_quick_compiled_code_
    Note over AM: original pointer saved
    ART->>AM: poke entry_point_ → R/X page
    ART->>+RX: jump (zero JNI hop)
    Note over RX: NEON kernel runs at native speed
    RX-->>-ART: ret
    ART->>AM: restore entry_point_
    ART-->>-K: resume coroutine

1. memfd dual-map JIT memory — A memfd is mapped twice: once R/W (for writing instruction bytes) and once R/X (for execution). The pages share physical memory; allocating the R/X mapping after the R/W writes complete dodges I-cache staleness without an explicit flush. This is the "JIT executor" everyone reinvents on Android.

2. ART entry-point hijack dispatch — Every Java/Kotlin method has an ArtMethod struct in the runtime; the field entry_point_from_quick_compiled_code_ is a function pointer that ART's "quick" dispatch path jumps through. Slim overwrites that pointer with the address of your shellcode, calls the corresponding Method reflectively (which jumps directly into the JIT'd code via ART's normal dispatch), then restores the pointer. Zero JNI on the dispatch path. The patch/unpatch is ~200 ns of Unsafe.peekLong / pokeLong calls.

3. Hidden-API bypass — On API 28+, ART blocks reflective access to libcore.io.Os.mmap, ArtMethod fields, and setHiddenApiExemptions. Slim defeats this with the four-tier cascade described in Production readiness. The last tier — used on API 36 — locates the art::Runtime singleton by ELF-parsing libart.so for art::Runtime::instance_, then probes the Runtime's memory for the hidden_api_policy_ field and writes kDisabled. The discovered offset is cached at <cacheDir>/nk_policy.bin.

For the full architectural walkthrough — including how the encoder's two-pass label fixup works, how the kernel cache is keyed, and the concurrency model — see docs/ARCHITECTURE.md.

📖 Background

This is the longer story behind Slim — the boundaries it crosses, why they're crossable, and what it felt like figuring that out. The SDK quick-start is up top; everything below is for systems-curious readers.

This started as something I bumped into while reading about userspace boundaries on Android — the invisible lines the OS and the runtime draw inside your own process. The JNI boundary between Kotlin and native code. The W^X boundary that says no page is both writable and executable. The hidden-API boundary that locks you out of ART's internals starting on API 28.

What grabbed me is that most of these boundaries are convention, not silicon — they're enforced by checks running in the same process you are. ART, for instance, links every Kotlin method to a function pointer (entry_point_from_quick_compiled_code_) that the dispatcher reads on every call. If you can flip that pointer to a page of your own ARM64 machine code, the runtime jumps into your code instead of the JIT-compiled body — no JNI hop, no separate .so, no NDK build. Your kernel returns, the runtime keeps going like nothing happened.

Slim is a working answer to "what if the boundary between Kotlin and native code is just a writable pointer?" — packaged as a small SDK so I could reuse the trick for tight SIMD kernels without paying NDK's startup cost on every project.

Tip

If you came for the architecture story, keep going to How it works and the Architecture doc, which walks every line we cross: memfd dual-map (W^X), entry-point hijack (managed/native), four-tier hidden-API bypass.

⚠️ Caveats and limitations

Note

Most "is this safe to ship?" questions are answered in Production readiness. The items below are scope and feature limits, not safety concerns.

Single-writer per KernelHandle

The high-level slim {} API serializes calls on the same compiled kernel via a per-handle Mutex — different kernels run in parallel, same kernel does not. To get true parallelism on the same workload, give each worker its own data buffer (the cache will produce the same kernel handle, but each worker pays the mutex on its turn — buffer-parallel, kernel-serial).

No kernel preemption

Once dispatched, a kernel runs to its ret. Coroutine cancellation only takes effect when control returns.

arm64 only

ARMv7 (armeabi-v7a) is not supported. The encoder is AArch64-specific; ARMv7 would be a parallel effort (~3,400 new lines). Most current Android phones are arm64; Wear / TV / IoT may not be.

No SVE / SME / crypto instructions in the encoder

ARMv8.2-A scope (FP16, dot product, saturating arithmetic) is covered; SVE2 is a V3-class addition.

No compile-time codegen

Kernel encoding happens at runtime (~5 µs per slim {} body). For sub-µs hot paths a Kotlin compiler plugin that pre-encodes slim {} blocks at build time is on the roadmap.

📚 Documentation

Doc	What's in it
`README.md`	This file — overview + quick start.
Guide (live)	"Write your first NEON kernel" — line-by-line walkthrough, disassembly, runnable benchmark. Start here if you're new to NEON.
Cookbook (live)	The integration model — closure capture, suspend composition, value extraction, reactive `Flow` pipelines — plus worked recipes (SAXPY, dot product, color filters, blur, threshold, debugging). Long read.
`docs/ARCHITECTURE.md`	How the runtime works internally: memfd dual-map, EP hijack, hidden-API bypass, encoder, label assembler, kernel cache.
`docs/CONTRIBUTING.md`	Adding encoder helpers, the testing pattern, ART-internals work.

🤝 Contributing

PRs welcome. The most common contributions:

New encoder helpers — adding to the ARM64 instruction coverage. See docs/CONTRIBUTING.md for the golden-byte test pattern.
New slim {} recipes — interesting NEON kernels for the cookbook.
Per-vendor bypass tweaks — if the four-tier cascade fails on your device, the logcat from Slim.initialize tells us which tier; PRs with new fallback paths or per-vendor fixes are great.

For larger work (encoder restructuring, V3 compile-time plugin), open an issue first to discuss design.

Star history

Contributors

📄 License

Apache 2.0. See LICENSE.

🙏 Acknowledgments

The hidden-API bypass cascade builds on techniques from the broader Android reflection community — particularly LSPosed's AndroidHiddenApiBypass (meta-reflection technique) and Pine (@canyie/pine) for ART internals documentation. The ARM64 instruction encoder cross-checks against LLVM's AArch64InstPrinter golden bytes via clang+llvm-objdump.

Slim · Built for tight SIMD on Android.

⬆ back to top

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.github/workflows		.github/workflows
app		app
bench		bench
docs		docs
gradle		gradle
nativekt		nativekt
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.gradle.kts		build.gradle.kts
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
jitpack.yml		jitpack.yml
mkdocs.yml		mkdocs.yml
requirements-docs.txt		requirements-docs.txt
settings.gradle.kts		settings.gradle.kts

Folders and files

Latest commit

History

Repository files navigation

Write ARM64 NEON code in Kotlin. Run it on Android. No JNI per call.

🧬 What it looks like

⚡ Why

📦 Installation

🚀 Quick start

🧩 Core concepts

Slim.initialize(context)

slim(data) { ... }

Floats / Ints / Bytes

Inside slim {}

Disassembler & debug view

🍳 Examples

Brightness adjustment (SAXPY: y = a·x + b)

Color invert (uint8 RGBA, byte lanes)

Concurrent dispatch from coroutines

More recipes

🛡️ Production readiness

Kill-switch via Slim.initialize()

Four-tier graceful bypass cascade

Anti-tamper compatibility

What "production" means here

📊 Performance

vs hand-tuned native NEON — multi-device

Run the bench yourself

vs scalar Kotlin

Operational characteristics

📱 Supported devices

🔬 How it works

📖 Background

⚠️ Caveats and limitations

📚 Documentation

🤝 Contributing

Star history

Contributors

📄 License

🙏 Acknowledgments

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`Slim.initialize(context)`

`slim(data) { ... }`

`Floats` / `Ints` / `Bytes`

Inside `slim {}`

Kill-switch via `Slim.initialize()`

Packages