Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AOT] Memory Info Restore Mechanism with Better Performance #4113

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Jiax-cn
Copy link
Contributor

@Jiax-cn Jiax-cn commented Feb 26, 2025

I found that my program runs slower in WAMR AOT mode compared to other WASM runtimes, e.g. WAVM. I compared their LLVM IRs and found that WAMR emits more load operations of memory base.

In WAMR, functions with mem_space_unchanged keep memory base address in mem_base_addr of AOTMemInfo, while others keep the address of memory base address in that field. When emit instructions like load/store, the former use the base address directly, while the later should load base address from its address at first. This reload is redundant when there is no possibility of changing memory between two consecutive load/store instructions:

%mem_base0 = load ptr, ptr %mem_base_addr_offset, align 8 ; reload
; load/store on %mem_base0
%mem_base1 = load ptr, ptr %mem_base_addr_offset, align 8 ; redundant reload
; load/store on %mem_base1

Optimization passes won’t recognize this redundancy because the reloaded memory base is accessed within the context.

In WAVM, the base address is reloaded when the memory possibly changes, e.g. after calling another function or after memory.grow. This can be redundant if there are no subsequent load/store instructions, but the dead code elimination pass handles this:

%mem_base0 = load ptr, ptr %mem_base_addr_offset, align 8 ; reload
; load/store on %mem_base0
; emit memory.grow
%mem_base1 = load ptr, ptr %mem_base_addr_offset, align 8 ; reload
; load/store on %mem_base1

Performance

Here is a sample C++ program substr.cc:

#include <string>
extern "C" void TestPerformance(const std::string& s) {
    s.substr(5);
}

int main() {
    for (int i = 0; i < 10007989; i++) {
        TestPerformance("Hello World");
    }
    return 0;
}

Compiled with emcc (version: 3.1.59 (0e4c5994eb5b8defd38367a416d0703fd506ad81))

emcc -O3 -g2 -s EXPORTED_FUNCTIONS='["_TestPerformance"]' ./substr.cc -o substr.wasm

Then ran wamrc and iwasm(linux) and compared the performance:

product-mini/platforms/posix/main.c:

#include <time.h>
static const void *
app_instance_main(wasm_module_inst_t module_inst)
{
    const char *exception;
    struct timespec s, e;
    clock_gettime(CLOCK_MONOTONIC, &s);
    wasm_application_execute_main(module_inst, app_argc, app_argv);
    clock_gettime(CLOCK_MONOTONIC, &e);
    printf("cost: %ld us", e.tv_sec * 1000000 + e.tv_nsec / 1000 - s.tv_sec * 1000000 - s.tv_nsec / 1000);
    exception = wasm_runtime_get_exception(module_inst);
    return exception;
}

result:

commit e3dcf4f

cost: 349913 us

commit 3f268e5

cost: 308787 us

IR(optimized) comparison:

image

@lum1n0us
Copy link
Collaborator

Thank you for the keen observation and the intriguing analysis of the root cause.

IIUC, the rationale behind storing mem_base_addr in AOTMemInfo is to avoid unnecessary reloading.

// in aot_check_memory_overflow()
    /* Get memory base address and memory data size */
    if (func_ctx->mem_space_unchanged
#if WASM_ENABLE_SHARED_MEMORY != 0
        || is_shared_memory
#endif
    ) {
        mem_base_addr = func_ctx->mem_info[0].mem_base_addr;  // This branch should be used in the majority of cases.
    }
    else {
        if (!(mem_base_addr = LLVMBuildLoad2(
                  comp_ctx->builder, OPQ_PTR_TYPE,
                  func_ctx->mem_info[0].mem_base_addr, "mem_base"))) {
            aot_set_last_error("llvm build load failed.");
            goto fail;
        }
    }

Using the address of the memory base address does not allow optimization passes to recognize the pattern and decide to eliminate superfluous load instructions.

However, I concur that the conditions for setting mem_space_unchanged to true are so stringent that it rarely occurs in practice, particularly due to the has_op_func_call condition. If you simply set mem_space_unchanged to true in create_memory_info(), you would achieve the desired improvement.

// in create_memory_info
    bool mem_space_unchanged = true; // (!func->has_op_memory_grow && !func->has_op_func_call) || (!module->possible_memory_grow);

Therefore, I believe the concept of "reloading the base address when the memory might change" is excellent.

If you're in agreement with my perspective, we can begin refactoring the PR by concentrating on "reloading the base address when the memory might change" and eliminating "keeping the address of the memory base address."

@Jiax-cn
Copy link
Contributor Author

Jiax-cn commented Feb 27, 2025

@lum1n0us Completely agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants