Skip to content

Conversation

@YiyanZhai
Copy link

This PR adds the cachedPrefixes field in MLCEngineConfig, allowing users to cache system prompts when creating MLCEngine. It reduces redundant processing of repeated instructions.

Example usage in CreateMLCEngine:

await webllm.CreateMLCEngine(
  selectedModel,
  {
    initProgressCallback: initProgressCallback,
    logLevel: "INFO",
    cachedPrefixes: [
      [ { role: "system", content: "You are a helpful assistant running in the user's browser. You need to answer questions ... " }, ]
    ],
  },
  {
    context_window_size: 2048,
  }
);

Copy link
Member

@CharlieFRuan CharlieFRuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the hard work! Added some comments. Please add an E2E example under examples/. I will take another pass afterwards. Thanks again!

initProgressCallback?: InitProgressCallback;
logitProcessorRegistry?: Map<string, LogitProcessor>;
logLevel?: LogLevel;
cachedPrefixes?: ChatCompletionMessageParam[][];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add docs to MLCEngineConfig, specifying the behavior of cachedPrefixes (e.g. will prefill when loading the engine to create the prefixes' KV, will only dispose these KV when reloading the engine). Perhaps we can also mark this as experimental to signify potential future API/behavior change

initProgressCallback?: InitProgressCallback;
logitProcessorRegistry?: Map<string, LogitProcessor>;
logLevel?: LogLevel;
cachedPrefixes?: ChatCompletionMessageParam[][];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add an examples/cached_prefixes? Where we can demonstrate the prefill time difference between using cachedPrefixes and not using it. We should also test whether the behavior is expected in multi-turn conversation.

if (this.seqIdToPrefix.size === 0) {
this.fclearKVCaches(this.kvCache);
} else {
this.fKVCacheRemoveSequence!(this.kvCache, new tvmjs.Scalar(0, "int64"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have multiple sequence IDs, let's make a constant, say CHAT_SEQUENCE_ID=0 (or maybe a better naming), instead of using a magic number 0 that may be hard to keep track of

src/llm_chat.ts Outdated

// If a match is found, fork the sequence
if (matchedSeqId !== -1 && maxMatchedLen > 0) {
console.log(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use log.info() instead of console.log()

src/llm_chat.ts Outdated
this.tvm.endScope();
} else if (seqID !== 0) {
// If no match is found, add the new sequence to the KV cache
console.log("Adding prefix to KV cache: ", seqID);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use log.info() instead of console.log()

@YiyanZhai
Copy link
Author

Thanks for the feedback! I’ve addressed the comments. Let me know if any further changes are needed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants