Skip to content

Auto dataset concatenation prototype#128

Merged
jlamypoirier merged 4 commits intomainfrom
auto_concatenate
Jan 27, 2025
Merged

Auto dataset concatenation prototype#128
jlamypoirier merged 4 commits intomainfrom
auto_concatenate

Conversation

@jlamypoirier
Copy link
Collaborator

✨ Description

Fixes: #120. A basic approach, to be refined in #123 .

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

@jlamypoirier jlamypoirier marked this pull request as ready for review January 23, 2025 01:00
Copy link
Collaborator

@tscholak tscholak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +169 to +171
class GPTComposedDatasetConfig(GPTIndexedDatasetConfig):
_abstract: typing.ClassVar[bool] = False
type_: typing.ClassVar[str | None] = "composed"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class GPTComposedDatasetConfig(GPTIndexedDatasetConfig):
_abstract: typing.ClassVar[bool] = False
type_: typing.ClassVar[str | None] = "composed"
class GPTConcatenatedMemmapConfig(GPTIndexedDatasetConfig):
_abstract: typing.ClassVar[bool] = False
type_: typing.ClassVar[str | None] = "concatenated_memmap"

for your convenience, so that we can merge this easily.

from fast_llm.data.data.gpt.data import GPTData
from fast_llm.data.dataset.gpt.config import (
GPTBlendedDatasetConfig,
GPTComposedDatasetConfig,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
GPTComposedDatasetConfig,
GPTConcatenatedMemmapConfig,

Comment on lines +408 to +409
{"type": "composed", "path": _DATASET_PREFIX_MIX_COMPOSED},
GPTComposedDatasetConfig,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{"type": "composed", "path": _DATASET_PREFIX_MIX_COMPOSED},
GPTComposedDatasetConfig,
{"type": "concatenated_memmap", "path": _DATASET_PREFIX_MIX_CONCATENATED_MEMMAP},
GPTConcatenatedMemmapConfig,


DATASET_PREFIX_MIX_1 = DATASET_PREFIX.with_name("blended_mix_1")
_DATASET_PREFIX_MIX_1 = DATASET_PREFIX.with_name("blended_mix_1")
_DATASET_PREFIX_MIX_COMPOSED = DATASET_CACHE / "composed"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
_DATASET_PREFIX_MIX_COMPOSED = DATASET_CACHE / "composed"
_DATASET_PREFIX_MIX_CONCATENATED_MEMMAP = DATASET_CACHE / "concatenated_memmap"

@jlamypoirier jlamypoirier merged commit 6dc77a0 into main Jan 27, 2025
3 of 4 checks passed
@jlamypoirier jlamypoirier deleted the auto_concatenate branch January 27, 2025 20:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[feat] Generate concatenated datasets automatically

2 participants