-
-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(web): demo page for current state of dictionary-based wordbreaking feature 🔬 #10973
base: epic/dict-breaker
Are you sure you want to change the base?
Conversation
User Test ResultsTest specification and instructions ERROR: user tests have not yet been defined Test Artifacts |
It appears that when hosted by TC, it's serving up the I wonder if renaming all the relevant files with .js instead would be enough... |
OK, that did the trick. TC needs any JS script to be |
@mhosken is not logged into GitHub but he gave me verbal feedback that this looks good. |
Just realised this is targeting beta but needs to be targeting master -- we won't be able to merge this in during B17S4 as it's too close to release. |
Please note that I milestoned it "Future." I never intended this to land in 17.0. I'll rebase to master once the current state of beta is merged into it. |
I've set the milestone to 18.0 (unless you really don't want it to merge until 19.0 or later?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes some sense to me
the long comment probably should go in a design doc?
let rootTraversal = model.traverseFromRoot(); | ||
assert.isDefined(rootTraversal); | ||
|
||
let smpA = smpForUnicode(0x1d5ba); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
U+1D5BA MATHEMATICAL SANS-SERIF SMALL A
nice..
let smpE = smpForUnicode(0x1d5be); | ||
|
||
// Just to be sure our utility function is working right. | ||
assert.equal(smpA + smpP + 'pl' + smpE, "𝖺𝗉pl𝖾"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add a comment here that SMP chars are being used.
if (pos >= text.length) { | ||
return text.length; | ||
} else if (isStartOfSurrogatePair(text[pos])) { | ||
return pos + 2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if pos+2 is off the end of the buffer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fcn may not be needed with other suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what happens if pos+2 is off the end of the buffer?
This gives the index you'd provide to the end
param of a substring operation... though admittedly assuming it'll be followed up by the low surrogate, rather than checking.
It's been in place like this for a while, admittedly. Not that this fact excuses leaving things like this now that we've noticed...
// Whenever we have a space or a ZWNJ (U+200C), we'll assume a 100%-confirmed wordbreak | ||
// at that location. We only need to "guess" at anything between 'em. | ||
const sections = splitOnWhitespace(fullText); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this match @mhosken's "tent" model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... I'm not sure what you mean by "tent" model. Could you elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my recollection of the khmer word break improvements, some logic for detecting whether text was pre-broken. and it affected more than just text surrounding the space (hence a 'tent'). Might need to ask him
// - but... how to clear that cache on model change? | ||
// - duh! validate the passed-in root traversal. If unequal, is diff model. ezpz. | ||
|
||
return []; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return []; |
dead code?
9b580aa
to
8bfeeee
Compare
Done. |
8bfeeee
to
f18a5bf
Compare
…/models/wordbreakers/fuse-dict-unmatched-chars
…nmatched-chars' into feat/web/dict-breaker-demo-host
This PR currently provides a demo representing current efforts to resolve #5025. Significant progress has been made, but there's notably more to do before we can call the effort complete.
Known tasks that would still need to be done:
The new wordbreaker cannot yet be specified for use by any Developer-compiled lexical model
epic/user-dict
provides a great path forward - we can instantiate a Trie first, pass that to the breaker, then pass both off to the finalTrieModel
.If a typo occurs while typing a long word that has a prefix which is a valid shorter word, sometimes the wordbreaker will auto-break at the end of that prefix, meaning that we can no longer correct as expected.
basketball
, we typedbasker
.bask
is a perfectly valid English word.basker
is technically legal English but is likely quite rare.basker
is not available, we then end up withbask | er
, thus with corrections and predictions based oner
rather thanbasker
.This PR is devoted to the demo / testing-host page that supports a state in which the wordbreaker is usable outside of the worker, able to report the state of wordbreaking as context is edited. I aimed to create something like the Developer JS-keyboard test-host has for character map viewing. To do so means building some extra JS bundles in order to support it, and I'm not sure if these specific changes (for the specialized page) should ever fully land.
Anyway, about the page:
Test page link - currently bottom of the testing-index's second section, labeled "Demo / test-harness for dictionary-based wordbreaking". (Link matches commit 9b580aa.)
The test page allows retrieving any Keyman keyboard from the cloud while retrieving the lexical model file locally via file-selector. The lexical model will then be 'lightly hacked' - its backing data will be retrieved (via known private field) and fed to the prototype
dict
wordbreaker, regardless of whatever the model's actual wordbreaking setting is. (Note: no model is pre-loaded for the page.) Predictive-text itself is not activated, though.This page should help make assessment easier, allowing us to experiment with how it operates with the data from any existing lexical model. I've already tested it out with our English (MTNT) model quite a bit and found the results fairly enlightening; they've already led to a bug fix or two.
A few sample runs, using English:
But how well does it handle stuff not in the lexical-model?
Things not in the lexical model:
joshua
josh
is an entry. (I didn't edit it in, either...)i
/I
am
38
Notably, short words not in the model... naturally aren't available as reference points when doing word-breaking.
These facts combined make some behaviors a bit interesting - note how it acts when
and I am
(minus the spaces) is typed:and | ia
an | diam
(because "an" and "diam[ond]" starts to look like a real possibility)diam
is no longer the most recent thing in context, it reverts back toand | iam
-diam
isn't a word, after all.As for actual predictive-text use though... note that it's currently fairly impacted by any misspellings in recent context, possibly triggering undesiredly-early wordbreaks when typos occur. That's something that'll need to be addressed in some way.
Known issues:
LexiconTraversal
from the Trie, which (currently) isn't available until the Trie model is initialized.