Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clicking the "Color" field in the card editor causes the "Pinyin" field's coloring to incorrectly change #55

Open
Jacobinski opened this issue Jan 15, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@Jacobinski
Copy link

Describe the bug

For certain words, such as 宾馆 and 可能, the extension will generate a correctly colored set of pinyin initially, but if the user clicks onto the "Color" field in the card (maybe other fields reproduce this too?), the pinyin coloring will incorrectly change. Please find two demonstration videos attached below.

Screen.Recording.2024-01-15.at.2.47.53.PM.mov
Screen.Recording.2024-01-15.at.2.45.30.PM.mov

Reproduction

  1. Install the Chinese Support 3 add-on and disable all other add-ons. Perform the standard setup procedures from the README document in this GitHub repo to add a Chinese (Basic) card-type to your Anki.
  2. Create a new Chinese (Basic) card, enable Chinese Support V3, then copy-and-paste 可能 into the Hanzi field. It should auto-populate the other card fields.
  3. Expand the Pinyin HTML and verify that it is <span class="tone3">kě</span><span class="tone3">yǐ</span> <!-- ke yi -->
  4. While having the Pinyin HTML expanded, click on the Color field and observe that the Pinyin HTML incorrectly changes to <span class="tone3">kěn</span><span class="tone2">éng</span> <!-- ken eng -->

Expected behavior

The Pinyin does not change into an incorrect version of itself.

Specs

  • OS: macOS 14.2.1 (23C71)
  • Anki Version: Version ⁨23.12.1 (1a1d4d54)⁩
  • Chinese Support Version: v0.17.0
@Jacobinski
Copy link
Author

There is an interesting unit test, called test_issue_78, in the code which seems to be checking for something similar to my issue. Here is the bug-report associated with the test: jdlorimer/chinese-support-redux#78.

class FormatPinyin(Base):
def test_issue_78(self):
note = {'Hanzi': '壮观', 'Pinyin': 'zhuàngguān'}
expected = (
'<span class="tone4">zhuàng</span>'
'<span class="tone1">guān</span> '
'<!-- zhuang guan -->'
)
reformat_transcript(note, 'pinyin', 'pinyin')
self.assertEqual(note['Pinyin'], expected)
reformat_transcript(note, 'pinyin', 'pinyin')
self.assertEqual(note['Pinyin'], expected)

It's not clear to me that this ever was "correctly" fixed. Running a Git Blame on the test shows that it was added in jdlorimer/chinese-support-redux@7a88d51 which seems to be a hack to workaround the issue.

@Jacobinski
Copy link
Author

This issue can be reproduced by adding the following new unit test to the class FormatPinyin(Base) in test_behavior.py.

    def test_issue_55(self):
        note = {'Hanzi': '可能', 'Pinyin': 'kěnéng'}
        expected = (
            '<span class="tone3">kě</span>'
            '<span class="tone2">néng</span> '
            '<!-- ke neng -->'
        )
        reformat_transcript(note, 'pinyin', 'pinyin')
        self.assertEqual(note['Pinyin'], expected)
        reformat_transcript(note, 'pinyin', 'pinyin')
        self.assertEqual(note['Pinyin'], expected)

When tested using $ make test, the test fails with error:

======================================================== FAILURES =========================================================
_______________________________________________ FormatPinyin.test_issue_55 ________________________________________________

self = <tests.test_behavior.FormatPinyin testMethod=test_issue_55>

    def test_issue_55(self):
        note = {'Hanzi': '可能', 'Pinyin': 'kěnéng'}
        expected = (
            '<span class="tone3">kě</span>'
            '<span class="tone2">néng</span> '
            '<!-- ke neng -->'
        )
        reformat_transcript(note, 'pinyin', 'pinyin')
>       self.assertEqual(note['Pinyin'], expected)
E       AssertionError: '<span class="tone3">kěn</span><span class="tone2">éng</span> <!-- ken eng -->' != '<span class="tone3">kě</span><span class="tone2">néng</span> <!-- ke neng -->'
E       - <span class="tone3">kěn</span><span class="tone2">éng</span> <!-- ken eng -->
E       ?                       -                                              -
E       + <span class="tone3">kě</span><span class="tone2">néng</span> <!-- ke neng -->
E       ?                                                  +                  +

tests/test_behavior.py:60: AssertionError

@Jacobinski
Copy link
Author

Jacobinski commented Jan 16, 2024

Another discovery:

If you go into test_transcribe.py and add a new test for 可能 in class Transcribe(Base) the test suite passes.

image

However, if you add a test for the same word into class SplitTranscript(Base) then the test suite fails

image

@Jacobinski
Copy link
Author

Based on the above comment, if we look at the fact that the transcribe(['可能'], 'pinyin', 'simp') == ['kě néng'] test passes, and the split_transcript(' kěnéng', 'pinyin') == ['kě néng'] test fails, it seems to suggest that the issue lies in split_transcript.

def split_transcript(transcript, target, grouped=True):
assert isinstance(transcript, str)
if target not in ['pinyin', 'pinyin_tw', 'jyutping']:
raise NotImplementedError(target)
def _split(pattern, s):
if search(f'^{pattern}$', s, IGNORECASE):
return s
remainder = s.replace("'", '')
done = []
while True:
found = False
for i in range(len(remainder), 0, -1):
if search(f'^{pattern}$', remainder[:i], IGNORECASE):
done.append(remainder[:i])
remainder = remainder[i:]
found = True
break
if found and remainder:
continue
elif remainder:
done.append(remainder)
break
else:
break
return ' '.join(done)
separated = []
for text in split(NOT_PINYIN_REGEX, transcript):
if target in ['pinyin', 'pinyin_tw']:
text = _split(PINYIN_REGEX, text)
elif target == 'jyutping':
text = _split(JYUTPING_REGEX, text)
if grouped:
separated.append(text)
else:
separated.extend(text.split())
return list(filter(lambda s: s.strip(), separated))

It seems like what this function does it attempt to use regular expressions to split the pinyin word kěnéng into multiple pieces based on initials and finals. Unfortunately, kěn éng and kě néng are both valid splits of the word, so it's up to luck if the correct split is chosen for any given word.

Perhaps we can avoid attempting to split the pinyin based on a regular expression and instead always rely on the Hanzi field as a source of truth?

Jacobinski added a commit to Jacobinski/anki-chinese-support-3 that referenced this issue Jan 16, 2024
…e the transcription

The original transcription uses the hanzi as a source-of-truth which will typically be more accurate than the `reformat_transcript()` function which uses a regular-expression approach to splitting the pinyin.

Certain edge-cases, such as 可能, will have the original pinyin correctly split into "kě néng" but the `reformat_transcript()` function will incorrectly change it to "kěn éng" as that is also a valid pinyin string.  We attempt to avoid this issue by never changing the pinyin field if it matches the original transcription.  If it does not match the original transcription, the user probably intentionally changed it, so running `reformat_transcript()`  is desired.

See Gustaf-C#55 for more information.
Jacobinski added a commit to Jacobinski/anki-chinese-support-3 that referenced this issue Jan 16, 2024
Before this change, there are two different ways that the code interacts with
`pinyin` in a note:

  1. If the `pinyin` field is empty, it uses the `hanzi` field as a
     source-of-truth to generate the `pinyin` field.

  2. If the `pinyin` field is non-empty, it takes the contents of
     `pinyin` and runs `reformat_transcript()` on it. The idea of this
     function is that it will update (split, colorize, etc) the `pinyin`
     field with new information (that the user provides). This function
     does the splitting using a regular-expression.

We have observed a bug in this logic that occurs for some words, such as
可能 which initially see the correct pinyin populated ("kě néng") but
have this pinyin incorrectly change ("kěn éng") as a result of running
the `reformat_transcript()` function on them. This bug can occur for any
pinyin in which there are multiple acceptable regular expression splits.

Bug report: Gustaf-C#55

This commit attempts to put a bandaid over the issue by avoiding repopulating
the pinyin field for words if the user did not change the original hanzi
transcription. Unit tests and documentation are also included in the commit.
Jacobinski added a commit to Jacobinski/anki-chinese-support-3 that referenced this issue Jan 16, 2024
Before this change, there are two different ways that the code interacts with
`pinyin` in a note:

  1. If the `pinyin` field is empty, it uses the `hanzi` field as a
     source-of-truth to generate the `pinyin` field.

  2. If the `pinyin` field is non-empty, it takes the contents of
     `pinyin` and runs `reformat_transcript()` on it. The idea of this
     function is that it will update (split, colorize, etc) the `pinyin`
     field with new information (that the user provides). This function
     does the splitting using a regular-expression.

We have observed a bug in this logic that occurs for some words, such as
可能 which initially see the correct pinyin populated ("kě néng") but
have this pinyin incorrectly change ("kěn éng") as a result of running
the `reformat_transcript()` function on them. This bug can occur for any
pinyin in which there are multiple acceptable regular expression splits.

Bug report: Gustaf-C#55

This commit attempts to put a bandaid over the issue by avoiding repopulating
the pinyin field for words if the user did not change the original hanzi
transcription. Unit tests and documentation are also included in the commit.
Jacobinski added a commit to Jacobinski/anki-chinese-support-3 that referenced this issue Jan 16, 2024
Before this change, there are two different ways that the code interacts with
`pinyin` in a note:

  1. If the `pinyin` field is empty, it uses the `hanzi` field as a
     source-of-truth to generate the `pinyin` field.

  2. If the `pinyin` field is non-empty, it takes the contents of
     `pinyin` and runs `reformat_transcript()` on it. The idea of this
     function is that it will update (split, colorize, etc) the `pinyin`
     field with new information (that the user provides). This function
     does the splitting using a regular-expression.

We have observed a bug in this logic that occurs for some words, such as
可能 which initially see the correct pinyin populated ("kě néng") but
have this pinyin incorrectly change ("kěn éng") as a result of running
the `reformat_transcript()` function on them. This bug can occur for any
pinyin in which there are multiple acceptable regular expression splits.

Bug report: Gustaf-C#55

This commit attempts to put a bandaid over the issue by avoiding repopulating
the pinyin field for words if the user did not change the original hanzi
transcription. Unit tests and documentation are also included in the commit.
@Jacobinski
Copy link
Author

This is actually kind of tricky to work around since we cannot always rely on the hanzi field as a source of truth. In many cases, there are multiple valid pinyins for a given hanzi (ex. 大都 has two translations: dàdōu means "for the most part"; dàdū: "metropolis") and if the extension chooses the wrong pinyin, the user should be able to overwrite it without the hanzi acting a source-of-truth. And if the user overwrites it, we can no longer rely on the hanzi field as a source-of-truth.

I think that the simplest way to work-around the issue for now it to treat the hanzi as a source-of-truth unless it is changed by the user. In this way, we will avoid having reformat_transcript() overwrite the translation, avoiding my reported bug. If the user changes the pinyin, the new value can act as a source-of-truth and we can run reformat_transcript() against it. This will still have issues for pinyin with multiple valid regular expression splits, but hopefully it will avoid the number of occurrences of this bug. Additionally, users should be able to manually put spaces in their pinyins to avoid this issue.

Please find my implementation of this fix in #56

@Gustaf-C Gustaf-C added the bug Something isn't working label Jan 18, 2024
BorisNA added a commit to BorisNA/anki-chinese-support-3 that referenced this issue Dec 19, 2024
* Add comment backlinking `test_issue_78` to original issue

* Avoid `reformat_transcript()` on original pinyin

Before this change, there are two different ways that the code interacts with
`pinyin` in a note:

  1. If the `pinyin` field is empty, it uses the `hanzi` field as a
     source-of-truth to generate the `pinyin` field.

  2. If the `pinyin` field is non-empty, it takes the contents of
     `pinyin` and runs `reformat_transcript()` on it. The idea of this
     function is that it will update (split, colorize, etc) the `pinyin`
     field with new information (that the user provides). This function
     does the splitting using a regular-expression.

We have observed a bug in this logic that occurs for some words, such as
可能 which initially see the correct pinyin populated ("kě néng") but
have this pinyin incorrectly change ("kěn éng") as a result of running
the `reformat_transcript()` function on them. This bug can occur for any
pinyin in which there are multiple acceptable regular expression splits.

Bug report: Gustaf-C#55

This commit attempts to put a bandaid over the issue by avoiding repopulating
the pinyin field for words if the user did not change the original hanzi
transcription. Unit tests and documentation are also included in the commit.

---------

Co-authored-by: Jacob Budzis <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants