add some alternative autodiff approaches

ZuseZ4 · ZuseZ4 · commit 08644b08c875 · 2024-04-28T19:09:38.000-04:00
diff --git a/src/ecosystem.md b/src/ecosystem.md
@@ -1,6 +1,6 @@
 # History and ecosystem
 
-Enzyme started as a PhD project of William Moses and Valentin Churavy, that was able to differentiate the LLVM-IR generated by a subset of C and Julia. (...)
+Enzyme started as a PhD project of William Moses and Valentin Churavy, that was able to differentiate the LLVM-IR generated by a subset of C and Julia. It has since been extended by frontends for additional languages. Enzyme is an LLVM Incubator projects and intends to ask for upstreaming later in 2024.
 
 ## Enzyme frontends
 
@@ -13,6 +13,33 @@ We hope that as part of the nightly releases Rust-Enzyme can mature relatively f
 
 ## Non-alternatives
 
-TODO: Talk about why this can't be done reasonably in any other way than adding it to the language.
+The key aspect for the performance of our solution is that AD is performed after compiler optimizations have been applied 
+(and is able to run additional optimizations). This observation is mostly language independent and motivated in the 
+first Enzyme paper (covering C/C++/Julia), and also mentioned towards the end of this java autodiff [case-study](https://github.com/openjdk/babylon-docs/blob/master/site/articles/auto-diff.md).  
 
-##
+### Wrapping cargo instead of modifying rustc
+
+This can be achieved with some modifications without modifying rustc, and was demonstrated in [oxide-enzyme](https://github.com/enzymeAD/oxide-enzyme). 
+
+0) We let users specify a list of functions which they want to differentiate, together with the corresponding info [example](https://github.com/EnzymeAD/oxide-enzyme/blob/main/example/rev/build.rs).
+1) We manually emit the optimized llmv-ir of our rust programm and all dependencies.
+2) We llvm-link all files into a single module (equivalent to fat-lto). 
+3) We call Enzyme to differentiate functions. 
+4) We adjust linker visibility of the new functions and create an archive that exports those new functions.
+5) We termintate this cargo invocation (can e.g. be achieved by -Zlink-only).
+6) We call cargo a second time, this time providing our archive as additional linker argument. The functions provided by the archive exactly match the extern fn declarations created through our macro [here](https://github.com/EnzymeAD/oxide-enzyme/blob/main/example/rev/src/main.rs).
+
+This PoC required the use of `build-std`, to be able to see the llvm-ir of functions from the std lib.  
+An alternative would have been to provide rules for Enzyme on how to differentiate every function from the Rust std, which seems undesirable.  
+
+This approach also assumes that linking llvm-ir generated by two different cargo invocations and passing Rust objects between those works fine.  
+
+This approach is further limited in compile times and reliability. See the example at the bottom left of this [poster](https://c.wsmoses.com/posters/Enzyme-llvmdev.pdf). LLVM types are often too limited to determine the correct derivative (e.g. opaque ptr), 
+and as such Enzyme has to run a usage analysis to determine the relevant type of a variable. This can be time consuming 
+(we encountered multiple cases with > 1000x longer compile times) and it can be unreliable, if Enzyme fails to deduce the correct type 
+of a variable due to insufficient usages. When calling Enzyme from within rustc, we are able to provide high-level type information to Enzyme.
+
+### Rust level autodiff 
+Various Rust libraries for the training of Neural Networks exist (burn/candle/dfdx/rai/autograph).
+We talked with developers from burn, rai, and autograph to compare the autodiff performance under the Microsoft [ADBench](https://github.com/microsoft/ADBench/) Benchmark suite. After some investigation all three decided that supporting such cases would require significant redesigns of their projects, which they can't afford in the forseeable future.  
+When training Neural Networks, we often look at few large variables (tensors) and a small set of functions (layers) which dominate the runtime. Using these properties it's possible to amortize some inefficiencies by getting the most expensive operations efficient. Such optimizations stop working, once we look at the larger set of applications for scientific computing or HPC.