-
-
Notifications
You must be signed in to change notification settings - Fork 2
FLORES‐200 Language Code Resolution for NMT Engine
John Lambert edited this page Jan 30, 2024
·
6 revisions
The NMT engine in Serval is based on the NLLB-200 model. In the NLLB model, languages are identified by a FLORES-200 code of the form {language}_{script}, where the language is an ISO 639-3 code and the script is an ISO 15924 code. In order to have the best chance of matching a language code used in NLLB-200, Serval attempts to convert the IETF language tags specified for an engine to a FLORES-200 code.
The language tag is converted to a FLORES-200 code using the following algorithm:
- Extract the language and script subtags from the language tag.
- Find the correct ISO 639-3 language code:
- If the language subtag is already an ISO 639-3 code, then use as-is.
- If the language subtag is a macrolanguage, then convert it to the closest ISO 639-3 code according to the following mapping:
-
ar->arb -
ms->zsm -
lv->lvs -
ne->npi -
sw->swh
-
- If the language subtag is
cmn(Mandarin Chinese), then convert tozho. - If the language subtag is an ISO-639-1 code, then convert it to the corresponding ISO 639-3 code.
- Find the correct ISO 15924 script code:
- If the script subtag is specified, then use as-is.
- Find the default script for a language tag by searching the SLDR langtags.json file. If the language tag or language subtag matches a tag in the
tagsfield of a language entry, then use the correspondingscriptfield. - If the script is
Kore, then convert toHang.
- If an ISO 639-3 code or ISO 15924 code cannot be found, then use the language tag as the FLORES-200 code.
- Construct the FLORES-200 code from the ISO 639-3 language code and ISO 15924 script code:
{language}_{script}.