feat(web): demo page for current state of dictionary-based wordbreaking feature 🔬 #10973

jahorton · 2024-03-11T08:10:27Z

This PR currently provides a demo representing current efforts to resolve #5025. Significant progress has been made, but there's notably more to do before we can call the effort complete.

⚠️ This page is intended as a demo of the intermediate state of this feature. We will probably NOT want to actually merge it, though. ⚠️

Known tasks that would still need to be done:

The new wordbreaker cannot yet be specified for use by any Developer-compiled lexical model
- This is not straight-forward - our existing Trie model pattern requires that the wordbreaker be built first... but this wordbreaker requires a pre-built Trie to be provided to its constructor.
- Fortunately, PR refactor(common/models): split trie-model implementation into separate components 📖 #12080 on epic/user-dict provides a great path forward - we can instantiate a Trie first, pass that to the breaker, then pass both off to the final TrieModel.
If a typo occurs while typing a long word that has a prefix which is a valid shorter word, sometimes the wordbreaker will auto-break at the end of that prefix, meaning that we can no longer correct as expected.
- Suppose when typing basketball, we typed basker. bask is a perfectly valid English word. basker is technically legal English but is likely quite rare.
- If basker is not available, we then end up with bask | er, thus with corrections and predictions based on er rather than basker.
- So... how to mitigate and/or work around this issue?
- We could probably adjust aspects of feat(common/models/wordbreakers): fuse adjacent unmatched characters when dictionary-breaking 🔬 #12141 to help as a first-pass attempt: "if unmatched and immediately pre-caret, merge with prior word".
  - Would likely be best to try both versions, though. Hmm. We don't currently have support for that sort of variability, though...

This PR is devoted to the demo / testing-host page that supports a state in which the wordbreaker is usable outside of the worker, able to report the state of wordbreaking as context is edited. I aimed to create something like the Developer JS-keyboard test-host has for character map viewing. To do so means building some extra JS bundles in order to support it, and I'm not sure if these specific changes (for the specialized page) should ever fully land.

Anyway, about the page:

Test page link - currently bottom of the testing-index's second section, labeled "Demo / test-harness for dictionary-based wordbreaking". (Link matches commit 9b580aa.)

The test page allows retrieving any Keyman keyboard from the cloud while retrieving the lexical model file locally via file-selector. The lexical model will then be 'lightly hacked' - its backing data will be retrieved (via known private field) and fed to the prototype dict wordbreaker, regardless of whatever the model's actual wordbreaking setting is. (Note: no model is pre-loaded for the page.) Predictive-text itself is not activated, though.

This page should help make assessment easier, allowing us to experiment with how it operates with the data from any existing lexical model. I've already tested it out with our English (MTNT) model quite a bit and found the results fairly enlightening; they've already led to a bug fix or two.

A few sample runs, using English:

But how well does it handle stuff not in the lexical-model?

Things not in the lexical model:

joshua
- Note: in the built version I have locally, josh is an entry. (I didn't edit it in, either...)
i / I
am
38

Notably, short words not in the model... naturally aren't available as reference points when doing word-breaking.

These facts combined make some behaviors a bit interesting - note how it acts when and I am (minus the spaces) is typed:

and | ia
an | diam (because "an" and "diam[ond]" starts to look like a real possibility)
When diam is no longer the most recent thing in context, it reverts back to and | iam - diam isn't a word, after all.

As for actual predictive-text use though... note that it's currently fairly impacted by any misspellings in recent context, possibly triggering undesiredly-early wordbreaks when typos occur. That's something that'll need to be addressed in some way.

Known issues:

I haven't linked in the search-term keyer yet.
- Mapping "keyed" spans to their pre-keyed equivalents in the actual text doesn't sound trivial, though - leaving it out definitely helped with quicker prototyping.
The patterns for Trie model initialization and for dict-breaker specification aren't currently compatible.
- The Trie model wants a fully-built wordbreaker at initialization time.
- The dict-breaker wants the root LexiconTraversal from the Trie, which (currently) isn't available until the Trie model is initialized.
  - I have a pretty decent idea on how to resolve this, but it would need some work - both in the models department and in the model-compiler.
  - Again, refactor(common/models): split trie-model implementation into separate components 📖 #12080's changes should help here.

keymanapp-test-bot · 2024-03-11T08:10:32Z

User Test Results

Test specification and instructions

ERROR: user tests have not yet been defined

Test Artifacts

Linux

jahorton · 2024-03-11T09:09:26Z

It appears that when hosted by TC, it's serving up the .mjs script as application/octet-stream, which gets rejected by the browser. sigh

I wonder if renaming all the relevant files with .js instead would be enough...

jahorton · 2024-03-11T09:33:06Z

OK, that did the trick. TC needs any JS script to be .js; it doesn't like .mjs file-extensions.

mcdurdin · 2024-03-11T14:46:49Z

@mhosken @srl295 would like your feedback on this

srl295 · 2024-03-11T14:47:51Z

@mhosken @srl295 would like your feedback on this

On my todo to analyze the chain here

mcdurdin · 2024-03-11T14:54:40Z

@mhosken is not logged into GitHub but he gave me verbal feedback that this looks good.

mcdurdin · 2024-03-11T14:55:54Z

Just realised this is targeting beta but needs to be targeting master -- we won't be able to merge this in during B17S4 as it's too close to release.

jahorton · 2024-03-11T15:20:13Z

Please note that I milestoned it "Future." I never intended this to land in 17.0. I'll rebase to master once the current state of beta is merged into it.

mcdurdin · 2024-03-11T15:48:16Z

I've set the milestone to 18.0 (unless you really don't want it to merge until 19.0 or later?)

srl295

makes some sense to me

the long comment probably should go in a design doc?

srl295 · 2024-03-12T02:13:33Z

common/models/templates/test/test-trie-traversal.js

+    let rootTraversal = model.traverseFromRoot();
+    assert.isDefined(rootTraversal);
+
+    let smpA = smpForUnicode(0x1d5ba);


U+1D5BA MATHEMATICAL SANS-SERIF SMALL A

nice..

srl295 · 2024-03-12T02:15:31Z

common/models/templates/test/test-trie-traversal.js

+    let smpE = smpForUnicode(0x1d5be);
+
+    // Just to be sure our utility function is working right.
+    assert.equal(smpA + smpP + 'pl' + smpE, "𝖺𝗉pl𝖾");


I'd add a comment here that SMP chars are being used.

srl295 · 2024-03-12T02:16:37Z

common/models/wordbreakers/src/default/index.ts

+  if (pos >= text.length) {
+    return text.length;
+  } else if (isStartOfSurrogatePair(text[pos])) {
+    return pos + 2;


what happens if pos+2 is off the end of the buffer?

fcn may not be needed with other suggestion.

what happens if pos+2 is off the end of the buffer?

This gives the index you'd provide to the end param of a substring operation... though admittedly assuming it'll be followed up by the low surrogate, rather than checking.

It's been in place like this for a while, admittedly. Not that this fact excuses leaving things like this now that we've noticed...

common/models/wordbreakers/src/dict/index.ts

srl295 · 2024-03-12T02:22:04Z

common/models/wordbreakers/src/dict/index.ts

+  // Whenever we have a space or a ZWNJ (U+200C), we'll assume a 100%-confirmed wordbreak
+  // at that location.  We only need to "guess" at anything between 'em.
+  const sections = splitOnWhitespace(fullText);


does this match @mhosken's "tent" model?

... I'm not sure what you mean by "tent" model. Could you elaborate?

my recollection of the khmer word break improvements, some logic for detecting whether text was pre-broken. and it affected more than just text surrounding the space (hence a 'tent'). Might need to ask him

srl295 · 2024-03-12T02:25:48Z

common/models/wordbreakers/src/dict/index.ts

+  //     - but... how to clear that cache on model change?
+  //       - duh!  validate the passed-in root traversal.  If unequal, is diff model.  ezpz.
+
+  return [];


Suggested change

return [];

dead code?

jahorton · 2024-03-27T06:27:34Z

Just realised this is targeting beta but needs to be targeting master[...]

Done.

…/models/wordbreakers/fuse-dict-unmatched-chars

…nmatched-chars' into feat/web/dict-breaker-demo-host

jahorton added this to the Future milestone Mar 11, 2024

keymanapp-test-bot bot added the user-test-missing User tests have not yet been defined for the PR label Mar 11, 2024

github-actions bot added web/ common/ common/models/ common/web/ common/models/types/ common/models/templates/ common/models/wordbreakers/ feat labels Mar 11, 2024

mcdurdin requested a review from srl295 March 11, 2024 14:47

mcdurdin modified the milestones: Future, 18.0 Mar 11, 2024

srl295 reviewed Mar 12, 2024

View reviewed changes

jahorton changed the base branch from beta to master March 27, 2024 06:26

jahorton force-pushed the feat/web/dict-breaker-demo-host branch from 9b580aa to 8bfeeee Compare March 27, 2024 06:26

jahorton mentioned this pull request Mar 27, 2024

feat(common/models): add Trie string-encoding + decoding methods 💾 #11088

Merged

jahorton force-pushed the feat/web/dict-breaker-demo-host branch from 8bfeeee to f18a5bf Compare June 25, 2024 05:18

This was referenced Jun 25, 2024

refactor(common/models): move TS priority-queue implementation to web-utils 📚 #11867

Merged

feat(web): provide lexicon probabilities directly on the search path 📚 #11868

Merged

github-actions bot added common/models/templates/ common/models/wordbreakers/ common/web/ and removed common/web/ common/models/templates/ common/models/wordbreakers/ labels Aug 9, 2024

chore(common/models): design comment cleanup

c48941e

github-actions bot added common/models/templates/ common/web/ and removed common/web/ common/models/templates/ labels Aug 9, 2024

keymanapp-test-bot bot added the epic-dict-breaker label Aug 9, 2024

jahorton added 2 commits August 23, 2024 10:01

chore(common/models/wordbreakers): Merge base branch into feat/common…

cd0e73c

…/models/wordbreakers/fuse-dict-unmatched-chars

chore(web): Merge branch 'feat/common/models/wordbreakers/fuse-dict-u…

4cc6f4a

…nmatched-chars' into feat/web/dict-breaker-demo-host

github-actions bot added windows/ android/ developer/ linux/ developer/compilers/ common/resources/ Build infrastructure android/engine/ android/app/ core/ Keyman Core developer/ide/ windows/config/ docs labels Aug 23, 2024

Base automatically changed from feat/common/models/wordbreakers/fuse-dict-unmatched-chars to epic/dict-breaker August 27, 2024 03:16

mcdurdin modified the milestones: 18.0, 19.0 Nov 27, 2024

mcdurdin changed the title ~~feat(web): adds demo page for current state of dictionary-based wordbreaking feature 🔬~~ feat(web): demo page for current state of dictionary-based wordbreaking feature 🔬 Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(web): demo page for current state of dictionary-based wordbreaking feature 🔬 #10973

feat(web): demo page for current state of dictionary-based wordbreaking feature 🔬 #10973

jahorton commented Mar 11, 2024 •

edited

Loading

keymanapp-test-bot bot commented Mar 11, 2024 •

edited

Loading

jahorton commented Mar 11, 2024

jahorton commented Mar 11, 2024

mcdurdin commented Mar 11, 2024

srl295 commented Mar 11, 2024

mcdurdin commented Mar 11, 2024

mcdurdin commented Mar 11, 2024

jahorton commented Mar 11, 2024

mcdurdin commented Mar 11, 2024

srl295 left a comment

srl295 Mar 12, 2024

srl295 Mar 12, 2024

srl295 Mar 12, 2024

srl295 Mar 12, 2024

jahorton Mar 12, 2024 •

edited

Loading

srl295 Mar 12, 2024

jahorton Mar 12, 2024

srl295 Mar 12, 2024

srl295 Mar 12, 2024

jahorton commented Mar 27, 2024

feat(web): demo page for current state of dictionary-based wordbreaking feature 🔬 #10973

Are you sure you want to change the base?

feat(web): demo page for current state of dictionary-based wordbreaking feature 🔬 #10973

Conversation

jahorton commented Mar 11, 2024 • edited Loading

keymanapp-test-bot bot commented Mar 11, 2024 • edited Loading

User Test Results

Test Artifacts

jahorton commented Mar 11, 2024

jahorton commented Mar 11, 2024

mcdurdin commented Mar 11, 2024

srl295 commented Mar 11, 2024

mcdurdin commented Mar 11, 2024

mcdurdin commented Mar 11, 2024

jahorton commented Mar 11, 2024

mcdurdin commented Mar 11, 2024

srl295 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jahorton Mar 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jahorton commented Mar 27, 2024

jahorton commented Mar 11, 2024 •

edited

Loading

keymanapp-test-bot bot commented Mar 11, 2024 •

edited

Loading

jahorton Mar 12, 2024 •

edited

Loading