You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: projects.md
+4-15
Original file line number
Diff line number
Diff line change
@@ -17,23 +17,13 @@ header:
17
17
We will initiate a hackathon to centralize many NLP datasets in Indonesian and local languages. Indonesian languages are diverse and scattered, so a unified location that joins multiple sources while preserving the data closest to the original form can greatly help accessibility. We propose a unified schema for dataset extraction to implement as many datasets as possible to enable reproducibility in data processing. Stay tuned for the next update!
18
18
19
19
# Past Projects
20
-
Currently, we have built **5 new benchmarks** to support NLP research on Indonesian languages and published papers in top NLP conferences. You can check this page for more details.
Currently, we have built **5 new benchmarks** to support NLP research on Indonesian languages and published papers in top NLP conferences, as well as providing overview of the current state of NLP research for Indonesia. You can check this page for more details.
31
21
32
22
## 2022
33
23
34
-
### NusaX
24
+
### Enabling NLP research in local languages
35
25
36
-
NusaX is a high-quality multilingual parallel corpus for Indonesian local languages elicited by native speakers. NusaX covers 12 languages, Indonesian, English, and 10 Indonesian local languages, namely Acehnese, Balinese, Banjarese, Buginese, Madurese, Minangkabau, Javanese, Ngaju, Sundanese, and Toba Batak.
26
+
NLP research in regional languages is still limited. In this project, we initiated to kickstart NLP in regional languages by creating *NusaX*: a high-quality multilingual parallel corpus for Indonesian local languages elicited by native speakers. NusaX covers 12 languages, Indonesian, English, and 10 Indonesian local languages, namely Acehnese, Balinese, Banjarese, Buginese, Madurese, Minangkabau, Javanese, Ngaju, Sundanese, and Toba Batak.
37
27
38
28
<iclass="fas fa-book"aria-hidden="true"></i> **Paper:** NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Language [Preprint arXiv 2022](https://arxiv.org/pdf/2205.15960.pdf){: .btn .btn--info .btn--small }
39
29
{: .notice}
@@ -44,9 +34,8 @@ NusaX is a high-quality multilingual parallel corpus for Indonesian local langua
We provide an overview of the current state of NLP research for Indonesia’s 700+ languages. We highlight challenges in Indonesian NLP and how these affect the performance of current NLP systems. Finally, we provide general recommendations to help develop NLP technology not only for languages of Indonesia but also other underrepresented languages.
38
+
Additionally, We provide an overview of the current state of NLP research for Indonesia’s 700+ local languages. We highlight challenges in Indonesian NLP and how these affect the performance of current NLP systems. Finally, we provide general recommendations to help develop NLP technology not only for languages of Indonesia but also other underrepresented languages.
50
39
51
40
<iclass="fas fa-book"aria-hidden="true"></i> **Paper:** One Country, 700+ Languages: NLP Challenges for Underrepresented Languages and Dialects in Indonesia [ACL 2022](https://aclanthology.org/2022.acl-long.500.pdf){: .btn .btn--info .btn--small }
0 commit comments