Skip to content

Allow new tokens to be initialized to an existing token's embeddings #1061

@benjaminking

Description

@benjaminking

When we add a new token to the tokenizer, it currently gets initialized with a random embedding. We should build a new feature that allows the user to specify that when a specific new token is added to the tokenizer (for example, zero width joiner), its embedding should be initialized to match the embedding of an existing character (for example, space). These mappings should be stored in the config file.

(If successful, we can consider deeper integration, or further abilities like averaging multiple embeddings)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestpipeline 4: trainIssue related to training a model.
    No fields configured for Enhancement.

    Projects

    Status

    🔖 Ready

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions