Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(web): demo page for current state of dictionary-based wordbreaking feature 🔬 #10973

Draft
wants to merge 12 commits into
base: epic/dict-breaker
Choose a base branch
from

Conversation

jahorton
Copy link
Contributor

@jahorton jahorton commented Mar 11, 2024

This PR currently provides a demo representing current efforts to resolve #5025. Significant progress has been made, but there's notably more to do before we can call the effort complete.

⚠️ This page is intended as a demo of the intermediate state of this feature. We will probably NOT want to actually merge it, though. ⚠️

Known tasks that would still need to be done:

  • The new wordbreaker cannot yet be specified for use by any Developer-compiled lexical model

  • If a typo occurs while typing a long word that has a prefix which is a valid shorter word, sometimes the wordbreaker will auto-break at the end of that prefix, meaning that we can no longer correct as expected.

    • Suppose when typing basketball, we typed basker. bask is a perfectly valid English word. basker is technically legal English but is likely quite rare.
    • If basker is not available, we then end up with bask | er, thus with corrections and predictions based on er rather than basker.
    • So... how to mitigate and/or work around this issue?
    • We could probably adjust aspects of feat(common/models/wordbreakers): fuse adjacent unmatched characters when dictionary-breaking 🔬  #12141 to help as a first-pass attempt: "if unmatched and immediately pre-caret, merge with prior word".
      • Would likely be best to try both versions, though. Hmm. We don't currently have support for that sort of variability, though...

This PR is devoted to the demo / testing-host page that supports a state in which the wordbreaker is usable outside of the worker, able to report the state of wordbreaking as context is edited. I aimed to create something like the Developer JS-keyboard test-host has for character map viewing. To do so means building some extra JS bundles in order to support it, and I'm not sure if these specific changes (for the specialized page) should ever fully land.

Anyway, about the page:

image

Test page link - currently bottom of the testing-index's second section, labeled "Demo / test-harness for dictionary-based wordbreaking". (Link matches commit 9b580aa.)

The test page allows retrieving any Keyman keyboard from the cloud while retrieving the lexical model file locally via file-selector. The lexical model will then be 'lightly hacked' - its backing data will be retrieved (via known private field) and fed to the prototype dict wordbreaker, regardless of whatever the model's actual wordbreaking setting is. (Note: no model is pre-loaded for the page.) Predictive-text itself is not activated, though.

This page should help make assessment easier, allowing us to experiment with how it operates with the data from any existing lexical model. I've already tested it out with our English (MTNT) model quite a bit and found the results fairly enlightening; they've already led to a bug fix or two.

A few sample runs, using English:

dict-breaker sample 1

But how well does it handle stuff not in the lexical-model?

dict-breaker sample 2

Things not in the lexical model:

  • joshua
    • Note: in the built version I have locally, josh is an entry. (I didn't edit it in, either...)
  • i / I
  • am
  • 38

Notably, short words not in the model... naturally aren't available as reference points when doing word-breaking.

These facts combined make some behaviors a bit interesting - note how it acts when and I am (minus the spaces) is typed:

  • and | ia
  • an | diam (because "an" and "diam[ond]" starts to look like a real possibility)
  • When diam is no longer the most recent thing in context, it reverts back to and | iam - diam isn't a word, after all.

As for actual predictive-text use though... note that it's currently fairly impacted by any misspellings in recent context, possibly triggering undesiredly-early wordbreaks when typos occur. That's something that'll need to be addressed in some way.

Known issues:

  • I haven't linked in the search-term keyer yet.
    • Mapping "keyed" spans to their pre-keyed equivalents in the actual text doesn't sound trivial, though - leaving it out definitely helped with quicker prototyping.
  • The patterns for Trie model initialization and for dict-breaker specification aren't currently compatible.
    • The Trie model wants a fully-built wordbreaker at initialization time.
    • The dict-breaker wants the root LexiconTraversal from the Trie, which (currently) isn't available until the Trie model is initialized.

@jahorton jahorton added this to the Future milestone Mar 11, 2024
@keymanapp-test-bot keymanapp-test-bot bot added the user-test-missing User tests have not yet been defined for the PR label Mar 11, 2024
@keymanapp-test-bot
Copy link

keymanapp-test-bot bot commented Mar 11, 2024

User Test Results

Test specification and instructions

ERROR: user tests have not yet been defined

Test Artifacts

@jahorton
Copy link
Contributor Author

It appears that when hosted by TC, it's serving up the .mjs script as application/octet-stream, which gets rejected by the browser. sigh

I wonder if renaming all the relevant files with .js instead would be enough...

@jahorton
Copy link
Contributor Author

OK, that did the trick. TC needs any JS script to be .js; it doesn't like .mjs file-extensions.

@mcdurdin
Copy link
Member

@mhosken @srl295 would like your feedback on this

@mcdurdin mcdurdin requested a review from srl295 March 11, 2024 14:47
@srl295
Copy link
Member

srl295 commented Mar 11, 2024

@mhosken @srl295 would like your feedback on this

On my todo to analyze the chain here

@mcdurdin
Copy link
Member

@mhosken is not logged into GitHub but he gave me verbal feedback that this looks good.

@mcdurdin
Copy link
Member

Just realised this is targeting beta but needs to be targeting master -- we won't be able to merge this in during B17S4 as it's too close to release.

@jahorton
Copy link
Contributor Author

Please note that I milestoned it "Future." I never intended this to land in 17.0. I'll rebase to master once the current state of beta is merged into it.

@mcdurdin mcdurdin modified the milestones: Future, 18.0 Mar 11, 2024
@mcdurdin
Copy link
Member

I've set the milestone to 18.0 (unless you really don't want it to merge until 19.0 or later?)

Copy link
Member

@srl295 srl295 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes some sense to me

the long comment probably should go in a design doc?

let rootTraversal = model.traverseFromRoot();
assert.isDefined(rootTraversal);

let smpA = smpForUnicode(0x1d5ba);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

U+1D5BA MATHEMATICAL SANS-SERIF SMALL A

nice..

let smpE = smpForUnicode(0x1d5be);

// Just to be sure our utility function is working right.
assert.equal(smpA + smpP + 'pl' + smpE, "𝖺𝗉pl𝖾");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a comment here that SMP chars are being used.

if (pos >= text.length) {
return text.length;
} else if (isStartOfSurrogatePair(text[pos])) {
return pos + 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if pos+2 is off the end of the buffer?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fcn may not be needed with other suggestion.

Copy link
Contributor Author

@jahorton jahorton Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens if pos+2 is off the end of the buffer?

This gives the index you'd provide to the end param of a substring operation... though admittedly assuming it'll be followed up by the low surrogate, rather than checking.

It's been in place like this for a while, admittedly. Not that this fact excuses leaving things like this now that we've noticed...

Comment on lines 99 to 101
// Whenever we have a space or a ZWNJ (U+200C), we'll assume a 100%-confirmed wordbreak
// at that location. We only need to "guess" at anything between 'em.
const sections = splitOnWhitespace(fullText);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this match @mhosken's "tent" model?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... I'm not sure what you mean by "tent" model. Could you elaborate?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my recollection of the khmer word break improvements, some logic for detecting whether text was pre-broken. and it affected more than just text surrounding the space (hence a 'tent'). Might need to ask him

// - but... how to clear that cache on model change?
// - duh! validate the passed-in root traversal. If unequal, is diff model. ezpz.

return [];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return [];

dead code?

@jahorton jahorton changed the base branch from beta to master March 27, 2024 06:26
@jahorton jahorton force-pushed the feat/web/dict-breaker-demo-host branch from 9b580aa to 8bfeeee Compare March 27, 2024 06:26
@jahorton
Copy link
Contributor Author

Just realised this is targeting beta but needs to be targeting master[...]

Done.

…/models/wordbreakers/fuse-dict-unmatched-chars
…nmatched-chars' into feat/web/dict-breaker-demo-host
Base automatically changed from feat/common/models/wordbreakers/fuse-dict-unmatched-chars to epic/dict-breaker August 27, 2024 03:16
@mcdurdin mcdurdin modified the milestones: 18.0, 19.0 Nov 27, 2024
@mcdurdin mcdurdin changed the title feat(web): adds demo page for current state of dictionary-based wordbreaking feature 🔬 feat(web): demo page for current state of dictionary-based wordbreaking feature 🔬 Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

feat(common/models/wordbreakers): dictionary-based word breaking
3 participants