When we add a new token to the tokenizer, it currently gets initialized with a random embedding. We should build a new feature that allows the user to specify that when a specific new token is added to the tokenizer (for example, zero width joiner), its embedding should be initialized to match the embedding of an existing character (for example, space). These mappings should be stored in the config file.
(If successful, we can consider deeper integration, or further abilities like averaging multiple embeddings)
When we add a new token to the tokenizer, it currently gets initialized with a random embedding. We should build a new feature that allows the user to specify that when a specific new token is added to the tokenizer (for example, zero width joiner), its embedding should be initialized to match the embedding of an existing character (for example, space). These mappings should be stored in the config file.
(If successful, we can consider deeper integration, or further abilities like averaging multiple embeddings)