Skip to content

Commit

Permalink
TICO-19 Terminologies from Facebook and Google added; partial normali…
Browse files Browse the repository at this point in the history
  • Loading branch information
fititnt committed Nov 11, 2021
1 parent 8ae6918 commit 4c946e2
Show file tree
Hide file tree
Showing 224 changed files with 61,489 additions and 51 deletions.
3 changes: 0 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,3 @@ data/original/tico19-testset.zip
!.gitignore
!README.md
tmp/

# temp
data/original/terminology/facebook/*.csv
21 changes: 19 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,26 @@

## [Unreleased]
### Added
- TODO
- TODO: Fix Facebook terminology usage of "_XX" as suffix

## [0.9.0] - 2020-11-11
## [1.0.0] - 2021-11-11
### Added
- **Fiat lux!**
- Draft of scripts to download data from TICO-19 original sources
- `data/original/terminology/facebook`: TICO-19 terminology from Facebook
- Uses data from `tico-19/tico-19.github.io/data/terminologies/f_*`, with
following data normalizations, using as example `f_en-pt_XX.csv` to
`en_pt-XX.csv`:
- Restrict `-` language tags delimiter, as per
[IETF Best Current Practice 47](https://tools.ietf.org/rfc/bcp/bcp47.txt)
an common usage in industry.
- Use single `_` for other types of delimiter when necessary. No known
industry convention on this decision.
- In the case of language pair on file names this means unambiguously
separating one language code from another.
- Remove prefix `f_`, since now is inferred from folder path.
- `data/original/terminology/google`: TICO-19 terminology from Google
- Uses data from `tico-19/tico-19.github.io/data/terminologies/g_*`, with
following data normalizations, using as example `g_en_pt-BR.csv` to
`en_pt-BR.csv`:
- Remove prefix `g_`, since now is inferred from folder path.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
# tico-19-hxltm
**[draft] Public domain datasets from
**[working-draft] Public domain datasets from
[Translation Initiative for COVID-19](tico-19.github.io) on the format
HXLTM (Multilingual Terminology in Humanitarian Language Exchange).**

Expand All @@ -8,7 +8,7 @@ HXLTM (Multilingual Terminology in Humanitarian Language Exchange).**

## License

[![Public Domain](https://i.creativecommons.org/p/zero/1.0/88x31.png)](UNLICENSE)
[![Public Domain](https://i.creativecommons.org/p/zero/1.0/88x31.png)](LICENSE)

To the extent possible under law, [Etica.AI](https://github.com/EticaAI),
already based on the work of the
Expand Down
1 change: 0 additions & 1 deletion data/original
Submodule original deleted from d97615
Loading

0 comments on commit 4c946e2

Please sign in to comment.