Skip to content

Conversation

@GreyElaina
Copy link

Adds Intl.Segmenter based grapheme cluster handling for proper Unicode text segmentation.

Changes

  • Add grapheme-segmenter.ts utility using Intl.Segmenter
  • Integrate with parse.keypress.ts for correct character boundary detection

This is foundational work for proper emoji and CJK text handling in input buffers.

Copilot AI review requested due to automatic review settings January 7, 2026 02:40
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds Unicode grapheme cluster handling using Intl.Segmenter to properly detect character boundaries for emoji, CJK characters, and other complex Unicode sequences. This is foundational work for improving input handling in the terminal UI.

Key Changes

  • New grapheme-segmenter.ts utility module with polyfill support for Intl.Segmenter
  • Integration with parse.keypress.ts to replace simple length checks with proper grapheme cluster detection
  • Added @formatjs/intl-segmenter dependency for environments lacking native support

Reviewed changes

Copilot reviewed 3 out of 4 changed files in this pull request and generated 6 comments.

File Description
packages/core/src/lib/grapheme-segmenter.ts New utility module providing grapheme segmentation with polyfill fallback and singleton pattern for performance
packages/core/src/lib/parse.keypress.ts Updated keypress parser to use isSingleGrapheme() for proper Unicode character boundary detection
packages/core/package.json Added @formatjs/intl-segmenter polyfill dependency
bun.lock Lock file updates for new dependency and unrelated Vue package upgrades

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 1 to 85
let segmenter: Intl.Segmenter | null = null

if (typeof Intl === "undefined" || typeof (Intl as any).Segmenter !== "function") {
await import("@formatjs/intl-segmenter/polyfill-force.js").catch(() => {})
}

export function getGraphemeSegmenter(): Intl.Segmenter {
if (segmenter) return segmenter

if (typeof Intl !== "undefined" && typeof (Intl as any).Segmenter === "function") {
segmenter = new Intl.Segmenter(undefined, { granularity: "grapheme" })
return segmenter
}

throw new Error(
"Intl.Segmenter is not available. Please ensure your runtime supports it or install @formatjs/intl-segmenter",
)
}

export function isSingleGrapheme(s: string): boolean {
if (s.length === 0) return false
if (s.length === 1) return true

const first = s.charCodeAt(0)
if (first < 128) {
const second = s.charCodeAt(1)
if (second < 128) return false
}

const iter = getGraphemeSegmenter().segment(s)[Symbol.iterator]()
iter.next()
return iter.next().done === true
}

export function firstGrapheme(str: string): string {
if (str.length === 0) return ""
const firstCode = str.charCodeAt(0)
if (firstCode < 128) {
if (str.length === 1 || str.charCodeAt(1) < 128) return str[0]!
}

const segments = getGraphemeSegmenter().segment(str)
const first = segments[Symbol.iterator]().next()
return first.done ? "" : first.value.segment
}
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new grapheme-segmenter.ts module lacks test coverage. Given that this is foundational work for proper emoji and CJK text handling (as stated in the PR description), comprehensive tests should be added to verify:

  1. Single ASCII character detection
  2. Multi-byte emoji handling (e.g., 👨‍👩‍👧‍👦, 🏳️‍🌈)
  3. CJK character handling
  4. Combining diacritical marks (e.g., é composed of e + ́)
  5. Zero-width joiners and other special Unicode sequences
  6. Edge cases like empty strings and strings with mixed content

The existing pattern in this codebase shows that most modules have corresponding test files (e.g., parse.keypress.test.ts, RGBA.test.ts).

Copilot uses AI. Check for mistakes.
Comment on lines 8 to 39
if (segmenter) return segmenter

if (typeof Intl !== "undefined" && typeof (Intl as any).Segmenter === "function") {
segmenter = new Intl.Segmenter(undefined, { granularity: "grapheme" })
return segmenter
}

throw new Error(
"Intl.Segmenter is not available. Please ensure your runtime supports it or install @formatjs/intl-segmenter",
)
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a potential race condition in the singleton pattern. If getGraphemeSegmenter() is called concurrently before the segmenter is initialized, multiple Intl.Segmenter instances could be created. While this is unlikely to cause functional issues (since they're functionally equivalent), it defeats the purpose of the singleton pattern.

Consider using a more robust initialization pattern, or document that this is acceptable for this use case.

Suggested change
if (segmenter) return segmenter
if (typeof Intl !== "undefined" && typeof (Intl as any).Segmenter === "function") {
segmenter = new Intl.Segmenter(undefined, { granularity: "grapheme" })
return segmenter
}
throw new Error(
"Intl.Segmenter is not available. Please ensure your runtime supports it or install @formatjs/intl-segmenter",
)
if (!segmenter) {
if (typeof Intl !== "undefined" && typeof (Intl as any).Segmenter === "function") {
segmenter = new Intl.Segmenter(undefined, { granularity: "grapheme" })
} else {
throw new Error(
"Intl.Segmenter is not available. Please ensure your runtime supports it or install @formatjs/intl-segmenter",
)
}
}
return segmenter

Copilot uses AI. Check for mistakes.
if (s.length === 1) return true

const first = s.charCodeAt(0)
if (first < 128) {
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The optimization for ASCII characters (code < 128) is good, but there's a subtle issue: the check at line 26-27 assumes that if the second character is also ASCII, the string can't be a single grapheme. However, this doesn't account for the case where s.length might be 1 (which is already handled at line 22).

More importantly, this logic could be clearer. The fast path at line 22 already handles the common case of length 1. The ASCII check could be moved to be more explicit about what it's optimizing.

Suggested change
if (first < 128) {
// Fast path: if the first two code units are ASCII, they are necessarily
// two separate grapheme clusters, so the string can't be a single grapheme.
if (first < 128 && s.length >= 2) {

Copilot uses AI. Check for mistakes.
Comment on lines +302 to 304
} else if (isSingleGrapheme(s)) {
// Single grapheme: special characters, emoji, CJK, etc.
key.name = s
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The integration of isSingleGrapheme() in the keypress parser lacks test coverage for the new behavior. Tests should be added to verify that:

  1. Multi-byte emoji (e.g., "👨‍👩‍👧‍👦") are correctly identified as single graphemes
  2. CJK characters are properly handled
  3. Characters with combining marks (e.g., "é" as e + combining acute) work correctly
  4. The existing behavior for ASCII and special characters is preserved

This is particularly important since this change affects a critical part of input handling.

Copilot uses AI. Check for mistakes.
Comment on lines 3 to 6
if (typeof Intl === "undefined" || typeof (Intl as any).Segmenter !== "function") {
await import("@formatjs/intl-segmenter/polyfill-force.js").catch(() => {})
}

Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Top-level await in a module can cause issues with module loading and bundling. This will make the module asynchronous, which means any file that imports from this module will also need to handle the asynchronous nature of the import.

Consider one of these alternatives:

  1. Move the polyfill loading into the getGraphemeSegmenter() function and make it async
  2. Use dynamic import only when needed (lazy loading)
  3. Use a synchronous initialization pattern with a flag to track if initialization has been attempted

The current approach could break compatibility with certain bundlers and runtimes that don't support top-level await.

Suggested change
if (typeof Intl === "undefined" || typeof (Intl as any).Segmenter !== "function") {
await import("@formatjs/intl-segmenter/polyfill-force.js").catch(() => {})
}

Copilot uses AI. Check for mistakes.
Comment on lines 4 to 5
await import("@formatjs/intl-segmenter/polyfill-force.js").catch(() => {})
}
Copy link

Copilot AI Jan 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The polyfill import error is silently caught and ignored. If the polyfill fails to load and the runtime doesn't support Intl.Segmenter natively, subsequent calls to getGraphemeSegmenter() will throw an error with a potentially misleading message suggesting to "install @formatjs/intl-segmenter" even though it's already installed.

Consider logging the error or providing more context about why the polyfill failed to load. This will help with debugging in production environments.

Copilot uses AI. Check for mistakes.
- Add grapheme-segmenter.ts with getGraphemeSegmenter(), isSingleGrapheme(), firstGrapheme()
- Use @formatjs/intl-segmenter polyfill for environments without native support
- Update parse.keypress.ts to use isSingleGrapheme() for proper emoji/CJK detection
@GreyElaina GreyElaina force-pushed the core-grapheme-segmenter branch from a4e10bf to 318db8b Compare January 7, 2026 03:06
@GreyElaina
Copy link
Author

Superseded by #483 which now includes grapheme segmentation along with stdin-buffer improvements.

@GreyElaina GreyElaina closed this Jan 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant