Skip to content

Optimize speed for LLM inference #1859

@HalfienP

Description

@HalfienP

How I can optimize speed for LLM infernce using prefix caching??
Should I use dynamic cache or static cache?
And when use prefix caching, Can I change the order

  • llm_input = sos_emb, embedding, text (prompt_text + target_text), task_id_emb, prompt_speech_token_emb
    to
  • llm_input : sos_emb, embedding, prompt_text , task_id_emb, prompt_speech_token_emb, target_text

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions