Allow new tokens to be initialized to an existing token's embeddings

When we add a new token to the tokenizer, it currently gets initialized with a random embedding.  We should build a new feature that allows the user to specify that when a specific new token is added to the tokenizer (for example, zero width joiner), its embedding should be initialized to match the embedding of an existing character (for example, space).  These mappings should be stored in the config file.

(If successful, we can consider deeper integration, or further abilities like averaging multiple embeddings)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow new tokens to be initialized to an existing token's embeddings #1061

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Allow new tokens to be initialized to an existing token's embeddings #1061

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions