-
Notifications
You must be signed in to change notification settings - Fork 284
feat(core): add Intl.Segmenter for grapheme cluster handling #482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds Unicode grapheme cluster handling using Intl.Segmenter to properly detect character boundaries for emoji, CJK characters, and other complex Unicode sequences. This is foundational work for improving input handling in the terminal UI.
Key Changes
- New
grapheme-segmenter.tsutility module with polyfill support forIntl.Segmenter - Integration with
parse.keypress.tsto replace simple length checks with proper grapheme cluster detection - Added
@formatjs/intl-segmenterdependency for environments lacking native support
Reviewed changes
Copilot reviewed 3 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
| packages/core/src/lib/grapheme-segmenter.ts | New utility module providing grapheme segmentation with polyfill fallback and singleton pattern for performance |
| packages/core/src/lib/parse.keypress.ts | Updated keypress parser to use isSingleGrapheme() for proper Unicode character boundary detection |
| packages/core/package.json | Added @formatjs/intl-segmenter polyfill dependency |
| bun.lock | Lock file updates for new dependency and unrelated Vue package upgrades |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let segmenter: Intl.Segmenter | null = null | ||
|
|
||
| if (typeof Intl === "undefined" || typeof (Intl as any).Segmenter !== "function") { | ||
| await import("@formatjs/intl-segmenter/polyfill-force.js").catch(() => {}) | ||
| } | ||
|
|
||
| export function getGraphemeSegmenter(): Intl.Segmenter { | ||
| if (segmenter) return segmenter | ||
|
|
||
| if (typeof Intl !== "undefined" && typeof (Intl as any).Segmenter === "function") { | ||
| segmenter = new Intl.Segmenter(undefined, { granularity: "grapheme" }) | ||
| return segmenter | ||
| } | ||
|
|
||
| throw new Error( | ||
| "Intl.Segmenter is not available. Please ensure your runtime supports it or install @formatjs/intl-segmenter", | ||
| ) | ||
| } | ||
|
|
||
| export function isSingleGrapheme(s: string): boolean { | ||
| if (s.length === 0) return false | ||
| if (s.length === 1) return true | ||
|
|
||
| const first = s.charCodeAt(0) | ||
| if (first < 128) { | ||
| const second = s.charCodeAt(1) | ||
| if (second < 128) return false | ||
| } | ||
|
|
||
| const iter = getGraphemeSegmenter().segment(s)[Symbol.iterator]() | ||
| iter.next() | ||
| return iter.next().done === true | ||
| } | ||
|
|
||
| export function firstGrapheme(str: string): string { | ||
| if (str.length === 0) return "" | ||
| const firstCode = str.charCodeAt(0) | ||
| if (firstCode < 128) { | ||
| if (str.length === 1 || str.charCodeAt(1) < 128) return str[0]! | ||
| } | ||
|
|
||
| const segments = getGraphemeSegmenter().segment(str) | ||
| const first = segments[Symbol.iterator]().next() | ||
| return first.done ? "" : first.value.segment | ||
| } |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new grapheme-segmenter.ts module lacks test coverage. Given that this is foundational work for proper emoji and CJK text handling (as stated in the PR description), comprehensive tests should be added to verify:
- Single ASCII character detection
- Multi-byte emoji handling (e.g., 👨👩👧👦, 🏳️🌈)
- CJK character handling
- Combining diacritical marks (e.g., é composed of e + ́)
- Zero-width joiners and other special Unicode sequences
- Edge cases like empty strings and strings with mixed content
The existing pattern in this codebase shows that most modules have corresponding test files (e.g., parse.keypress.test.ts, RGBA.test.ts).
| if (segmenter) return segmenter | ||
|
|
||
| if (typeof Intl !== "undefined" && typeof (Intl as any).Segmenter === "function") { | ||
| segmenter = new Intl.Segmenter(undefined, { granularity: "grapheme" }) | ||
| return segmenter | ||
| } | ||
|
|
||
| throw new Error( | ||
| "Intl.Segmenter is not available. Please ensure your runtime supports it or install @formatjs/intl-segmenter", | ||
| ) |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a potential race condition in the singleton pattern. If getGraphemeSegmenter() is called concurrently before the segmenter is initialized, multiple Intl.Segmenter instances could be created. While this is unlikely to cause functional issues (since they're functionally equivalent), it defeats the purpose of the singleton pattern.
Consider using a more robust initialization pattern, or document that this is acceptable for this use case.
| if (segmenter) return segmenter | |
| if (typeof Intl !== "undefined" && typeof (Intl as any).Segmenter === "function") { | |
| segmenter = new Intl.Segmenter(undefined, { granularity: "grapheme" }) | |
| return segmenter | |
| } | |
| throw new Error( | |
| "Intl.Segmenter is not available. Please ensure your runtime supports it or install @formatjs/intl-segmenter", | |
| ) | |
| if (!segmenter) { | |
| if (typeof Intl !== "undefined" && typeof (Intl as any).Segmenter === "function") { | |
| segmenter = new Intl.Segmenter(undefined, { granularity: "grapheme" }) | |
| } else { | |
| throw new Error( | |
| "Intl.Segmenter is not available. Please ensure your runtime supports it or install @formatjs/intl-segmenter", | |
| ) | |
| } | |
| } | |
| return segmenter |
| if (s.length === 1) return true | ||
|
|
||
| const first = s.charCodeAt(0) | ||
| if (first < 128) { |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The optimization for ASCII characters (code < 128) is good, but there's a subtle issue: the check at line 26-27 assumes that if the second character is also ASCII, the string can't be a single grapheme. However, this doesn't account for the case where s.length might be 1 (which is already handled at line 22).
More importantly, this logic could be clearer. The fast path at line 22 already handles the common case of length 1. The ASCII check could be moved to be more explicit about what it's optimizing.
| if (first < 128) { | |
| // Fast path: if the first two code units are ASCII, they are necessarily | |
| // two separate grapheme clusters, so the string can't be a single grapheme. | |
| if (first < 128 && s.length >= 2) { |
| } else if (isSingleGrapheme(s)) { | ||
| // Single grapheme: special characters, emoji, CJK, etc. | ||
| key.name = s |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The integration of isSingleGrapheme() in the keypress parser lacks test coverage for the new behavior. Tests should be added to verify that:
- Multi-byte emoji (e.g., "👨👩👧👦") are correctly identified as single graphemes
- CJK characters are properly handled
- Characters with combining marks (e.g., "é" as e + combining acute) work correctly
- The existing behavior for ASCII and special characters is preserved
This is particularly important since this change affects a critical part of input handling.
| if (typeof Intl === "undefined" || typeof (Intl as any).Segmenter !== "function") { | ||
| await import("@formatjs/intl-segmenter/polyfill-force.js").catch(() => {}) | ||
| } | ||
|
|
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Top-level await in a module can cause issues with module loading and bundling. This will make the module asynchronous, which means any file that imports from this module will also need to handle the asynchronous nature of the import.
Consider one of these alternatives:
- Move the polyfill loading into the
getGraphemeSegmenter()function and make it async - Use dynamic import only when needed (lazy loading)
- Use a synchronous initialization pattern with a flag to track if initialization has been attempted
The current approach could break compatibility with certain bundlers and runtimes that don't support top-level await.
| if (typeof Intl === "undefined" || typeof (Intl as any).Segmenter !== "function") { | |
| await import("@formatjs/intl-segmenter/polyfill-force.js").catch(() => {}) | |
| } |
| await import("@formatjs/intl-segmenter/polyfill-force.js").catch(() => {}) | ||
| } |
Copilot
AI
Jan 7, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The polyfill import error is silently caught and ignored. If the polyfill fails to load and the runtime doesn't support Intl.Segmenter natively, subsequent calls to getGraphemeSegmenter() will throw an error with a potentially misleading message suggesting to "install @formatjs/intl-segmenter" even though it's already installed.
Consider logging the error or providing more context about why the polyfill failed to load. This will help with debugging in production environments.
- Add grapheme-segmenter.ts with getGraphemeSegmenter(), isSingleGrapheme(), firstGrapheme() - Use @formatjs/intl-segmenter polyfill for environments without native support - Update parse.keypress.ts to use isSingleGrapheme() for proper emoji/CJK detection
a4e10bf to
318db8b
Compare
|
Superseded by #483 which now includes grapheme segmentation along with stdin-buffer improvements. |
Adds
Intl.Segmenterbased grapheme cluster handling for proper Unicode text segmentation.Changes
grapheme-segmenter.tsutility usingIntl.Segmenterparse.keypress.tsfor correct character boundary detectionThis is foundational work for proper emoji and CJK text handling in input buffers.