Data is Better Together

Data is Better Together is a collaboration between 🤗 Hugging Face, 🏓 Argilla, and the Open-Source ML community. We aim to empower the open-source community to build impactful datasets collectively.

What have we done so far?

The community has created a dataset of 10k prompts DIBT/10k_prompts_ranked ranked by quality as part of Data is Better Together.
The community has translated these prompts into the following languages (see here for current efforts)
- Dutch

What are we currently working on?

We are working on several strands of work. Here are current active projects.

1. Prompt ranking

Our first DIBT activity is focused on ranking the quality of prompts. We have already released version 1.0 of this dataset DIBT/10k_prompts_ranked. So far, over 385 people have contributed annotations to this dataset, but we are continuing to collect more annotations!

Follow the progress of this effort in this dashboard
You can contribute to the ranking of prompts here

2. Multilingual Prompt Evaluation Project (MPEP)

There are not enough language-specific benchmarks for open LLMs! We want to create a leaderboard for more languages by leveraging the community! You can find more information about this project in the MPEP README.

Contribute translations

Want to contribute translations? Currently, these translation efforts are underway:

Current Translation Efforts

Want to work on a language that's not listed? You can follow the steps to set up a new annotation effort by going to prompt_translation/ and checking out the three notebooks:

In the first one, you'll learn how to set up a prompt translation space using Argilla and Hugging Face Spaces.
In the second one, you'll see how to upload the prompt translation data for the language of your choice.
In the third one, we show how to set up a dashboard to track the annotation efforts easily.

3. Domain Specific Datasets

This project aims to bootstrap the creation of more domain-specific datasets for training models. The goal is to create a set of tools that help users to collaborate with domain experts. Find out more in the Domain Specific Datasets README.

4. DPO/ORPO datasets for more languages

Currently, many languages do not have DPO datasets openly shared on the Hugging Face Hub. The DIBT/preference_data_by_language Space gives you an overview of language coverage of DPO datasets for different languages. At the time of this commit, there are 14 languages with DPO datasets available on the Hugging Face Hub.

The goal of this project is to help foster a community of people building more DPO datasets for different languages. Find out more in this DPO/ORPO datasets README.

Other guides

The Data is Better Together community has created several guides to support efforts to create valuable datasets via the open source community. Currently, we have the following guides:

Creating a KTO preference dataset

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
domain-specific-datasets		domain-specific-datasets
dpo		dpo
kto-preference		kto-preference
prompt_translation		prompt_translation
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data is Better Together

What have we done so far?

What are we currently working on?

1. Prompt ranking

2. Multilingual Prompt Evaluation Project (MPEP)

Contribute translations

3. Domain Specific Datasets

4. DPO/ORPO datasets for more languages

Other guides

About

Releases

Packages

Languages

argilla-io/data-is-better-together

Folders and files

Latest commit

History

Repository files navigation

Data is Better Together

What have we done so far?

What are we currently working on?

1. Prompt ranking

2. Multilingual Prompt Evaluation Project (MPEP)

Contribute translations

3. Domain Specific Datasets

4. DPO/ORPO datasets for more languages

Other guides

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages