Clicking the "Color" field in the card editor causes the "Pinyin" field's coloring to incorrectly change #55

Jacobinski · 2024-01-15T23:02:42Z

Describe the bug

For certain words, such as 宾馆 and 可能, the extension will generate a correctly colored set of pinyin initially, but if the user clicks onto the "Color" field in the card (maybe other fields reproduce this too?), the pinyin coloring will incorrectly change. Please find two demonstration videos attached below.

Screen.Recording.2024-01-15.at.2.47.53.PM.mov

Screen.Recording.2024-01-15.at.2.45.30.PM.mov

Reproduction

Install the Chinese Support 3 add-on and disable all other add-ons. Perform the standard setup procedures from the README document in this GitHub repo to add a Chinese (Basic) card-type to your Anki.
Create a new Chinese (Basic) card, enable Chinese Support V3, then copy-and-paste 可能 into the Hanzi field. It should auto-populate the other card fields.
Expand the Pinyin HTML and verify that it is kěyǐ 
While having the Pinyin HTML expanded, click on the Color field and observe that the Pinyin HTML incorrectly changes to kěnéng

Expected behavior

The Pinyin does not change into an incorrect version of itself.

Specs

OS: macOS 14.2.1 (23C71)
Anki Version: Version ⁨23.12.1 (1a1d4d54)⁩
Chinese Support Version: v0.17.0

The text was updated successfully, but these errors were encountered:

Jacobinski · 2024-01-15T23:19:53Z

There is an interesting unit test, called test_issue_78, in the code which seems to be checking for something similar to my issue. Here is the bug-report associated with the test: jdlorimer/chinese-support-redux#78.

anki-chinese-support-3/tests/test_behavior.py

Lines 37 to 48 in 42af602

    
           class FormatPinyin(Base): 
        
               def test_issue_78(self): 
        
                   note = {'Hanzi': '壮观', 'Pinyin': 'zhuàngguān'} 
        
                   expected = ( 
        
                       '<span class="tone4">zhuàng</span>' 
        
                       '<span class="tone1">guān</span> ' 
        
                       '<!-- zhuang guan -->' 
        
                   ) 
        
                   reformat_transcript(note, 'pinyin', 'pinyin') 
        
                   self.assertEqual(note['Pinyin'], expected) 
        
                   reformat_transcript(note, 'pinyin', 'pinyin') 
        
                   self.assertEqual(note['Pinyin'], expected)

It's not clear to me that this ever was "correctly" fixed. Running a Git Blame on the test shows that it was added in jdlorimer/chinese-support-redux@7a88d51 which seems to be a hack to workaround the issue.

Jacobinski · 2024-01-15T23:43:20Z

This issue can be reproduced by adding the following new unit test to the class FormatPinyin(Base) in test_behavior.py.

    def test_issue_55(self):
        note = {'Hanzi': '可能', 'Pinyin': 'kěnéng'}
        expected = (
            '<span class="tone3">kě</span>'
            '<span class="tone2">néng</span> '
            '<!-- ke neng -->'
        )
        reformat_transcript(note, 'pinyin', 'pinyin')
        self.assertEqual(note['Pinyin'], expected)
        reformat_transcript(note, 'pinyin', 'pinyin')
        self.assertEqual(note['Pinyin'], expected)

When tested using $ make test, the test fails with error:

======================================================== FAILURES =========================================================
_______________________________________________ FormatPinyin.test_issue_55 ________________________________________________

self = <tests.test_behavior.FormatPinyin testMethod=test_issue_55>

    def test_issue_55(self):
        note = {'Hanzi': '可能', 'Pinyin': 'kěnéng'}
        expected = (
            '<span class="tone3">kě</span>'
            '<span class="tone2">néng</span> '
            '<!-- ke neng -->'
        )
        reformat_transcript(note, 'pinyin', 'pinyin')
>       self.assertEqual(note['Pinyin'], expected)
E       AssertionError: '<span class="tone3">kěn</span><span class="tone2">éng</span> <!-- ken eng -->' != '<span class="tone3">kě</span><span class="tone2">néng</span> <!-- ke neng -->'
E       - <span class="tone3">kěn</span><span class="tone2">éng</span> <!-- ken eng -->
E       ?                       -                                              -
E       + <span class="tone3">kě</span><span class="tone2">néng</span> <!-- ke neng -->
E       ?                                                  +                  +

tests/test_behavior.py:60: AssertionError

Jacobinski · 2024-01-16T00:02:55Z

Another discovery:

If you go into test_transcribe.py and add a new test for 可能 in class Transcribe(Base) the test suite passes.

However, if you add a test for the same word into class SplitTranscript(Base) then the test suite fails

Jacobinski · 2024-01-16T00:16:15Z

Based on the above comment, if we look at the fact that the transcribe(['可能'], 'pinyin', 'simp') == ['kě néng'] test passes, and the split_transcript(' kěnéng', 'pinyin') == ['kě néng'] test fails, it seems to suggest that the issue lies in split_transcript.

anki-chinese-support-3/chinese/transcribe.py

Lines 200 to 242 in 42af602

    
           def split_transcript(transcript, target, grouped=True): 
        
               assert isinstance(transcript, str) 
        
               if target not in ['pinyin', 'pinyin_tw', 'jyutping']: 
        
                   raise NotImplementedError(target) 
        
               def _split(pattern, s): 
        
                   if search(f'^{pattern}$', s, IGNORECASE): 
        
                       return s 
        
                   remainder = s.replace("'", '') 
        
                   done = [] 
        
                   while True: 
        
                       found = False 
        
                       for i in range(len(remainder), 0, -1): 
        
                           if search(f'^{pattern}$', remainder[:i], IGNORECASE): 
        
                               done.append(remainder[:i]) 
        
                               remainder = remainder[i:] 
        
                               found = True 
        
                               break 
        
                       if found and remainder: 
        
                           continue 
        
                       elif remainder: 
        
                           done.append(remainder) 
        
                           break 
        
                       else: 
        
                           break 
        
                   return ' '.join(done) 
        
               separated = [] 
        
               for text in split(NOT_PINYIN_REGEX, transcript): 
        
                   if target in ['pinyin', 'pinyin_tw']: 
        
                       text = _split(PINYIN_REGEX, text) 
        
                   elif target == 'jyutping': 
        
                       text = _split(JYUTPING_REGEX, text) 
        
                   if grouped: 
        
                       separated.append(text) 
        
                   else: 
        
                       separated.extend(text.split()) 
        
               return list(filter(lambda s: s.strip(), separated))

It seems like what this function does it attempt to use regular expressions to split the pinyin word kěnéng into multiple pieces based on initials and finals. Unfortunately, kěn éng and kě néng are both valid splits of the word, so it's up to luck if the correct split is chosen for any given word.

Perhaps we can avoid attempting to split the pinyin based on a regular expression and instead always rely on the Hanzi field as a source of truth?

…e the transcription The original transcription uses the hanzi as a source-of-truth which will typically be more accurate than the `reformat_transcript()` function which uses a regular-expression approach to splitting the pinyin. Certain edge-cases, such as 可能, will have the original pinyin correctly split into "kě néng" but the `reformat_transcript()` function will incorrectly change it to "kěn éng" as that is also a valid pinyin string. We attempt to avoid this issue by never changing the pinyin field if it matches the original transcription. If it does not match the original transcription, the user probably intentionally changed it, so running `reformat_transcript()` is desired. See Gustaf-C#55 for more information.

Before this change, there are two different ways that the code interacts with `pinyin` in a note: 1. If the `pinyin` field is empty, it uses the `hanzi` field as a source-of-truth to generate the `pinyin` field. 2. If the `pinyin` field is non-empty, it takes the contents of `pinyin` and runs `reformat_transcript()` on it. The idea of this function is that it will update (split, colorize, etc) the `pinyin` field with new information (that the user provides). This function does the splitting using a regular-expression. We have observed a bug in this logic that occurs for some words, such as 可能 which initially see the correct pinyin populated ("kě néng") but have this pinyin incorrectly change ("kěn éng") as a result of running the `reformat_transcript()` function on them. This bug can occur for any pinyin in which there are multiple acceptable regular expression splits. Bug report: Gustaf-C#55 This commit attempts to put a bandaid over the issue by avoiding repopulating the pinyin field for words if the user did not change the original hanzi transcription. Unit tests and documentation are also included in the commit.

Jacobinski · 2024-01-16T02:12:40Z

This is actually kind of tricky to work around since we cannot always rely on the hanzi field as a source of truth. In many cases, there are multiple valid pinyins for a given hanzi (ex. 大都 has two translations: dàdōu means "for the most part"; dàdū: "metropolis") and if the extension chooses the wrong pinyin, the user should be able to overwrite it without the hanzi acting a source-of-truth. And if the user overwrites it, we can no longer rely on the hanzi field as a source-of-truth.

I think that the simplest way to work-around the issue for now it to treat the hanzi as a source-of-truth unless it is changed by the user. In this way, we will avoid having reformat_transcript() overwrite the translation, avoiding my reported bug. If the user changes the pinyin, the new value can act as a source-of-truth and we can run reformat_transcript() against it. This will still have issues for pinyin with multiple valid regular expression splits, but hopefully it will avoid the number of occurrences of this bug. Additionally, users should be able to manually put spaces in their pinyins to avoid this issue.

Please find my implementation of this fix in #56

* Add comment backlinking `test_issue_78` to original issue * Avoid `reformat_transcript()` on original pinyin Before this change, there are two different ways that the code interacts with `pinyin` in a note: 1. If the `pinyin` field is empty, it uses the `hanzi` field as a source-of-truth to generate the `pinyin` field. 2. If the `pinyin` field is non-empty, it takes the contents of `pinyin` and runs `reformat_transcript()` on it. The idea of this function is that it will update (split, colorize, etc) the `pinyin` field with new information (that the user provides). This function does the splitting using a regular-expression. We have observed a bug in this logic that occurs for some words, such as 可能 which initially see the correct pinyin populated ("kě néng") but have this pinyin incorrectly change ("kěn éng") as a result of running the `reformat_transcript()` function on them. This bug can occur for any pinyin in which there are multiple acceptable regular expression splits. Bug report: Gustaf-C#55 This commit attempts to put a bandaid over the issue by avoiding repopulating the pinyin field for words if the user did not change the original hanzi transcription. Unit tests and documentation are also included in the commit. --------- Co-authored-by: Jacob Budzis <[email protected]>

Jacobinski mentioned this issue Jan 16, 2024

Avoid running reformat_transcript() on original pinyin #56

Open

Gustaf-C added the bug Something isn't working label Jan 18, 2024

BorisNA mentioned this issue Dec 19, 2024

Avoid reformat_transcript() on original pinyin BorisNA/anki-chinese-support-3#1

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clicking the "Color" field in the card editor causes the "Pinyin" field's coloring to incorrectly change #55

Clicking the "Color" field in the card editor causes the "Pinyin" field's coloring to incorrectly change #55

Jacobinski commented Jan 15, 2024

Jacobinski commented Jan 15, 2024

Jacobinski commented Jan 15, 2024

Jacobinski commented Jan 16, 2024 •

edited

Loading

Jacobinski commented Jan 16, 2024

Jacobinski commented Jan 16, 2024

Clicking the "Color" field in the card editor causes the "Pinyin" field's coloring to incorrectly change #55

Clicking the "Color" field in the card editor causes the "Pinyin" field's coloring to incorrectly change #55

Comments

Jacobinski commented Jan 15, 2024

Describe the bug

Reproduction

Expected behavior

Specs

Jacobinski commented Jan 15, 2024

Jacobinski commented Jan 15, 2024

Jacobinski commented Jan 16, 2024 • edited Loading

Jacobinski commented Jan 16, 2024

Jacobinski commented Jan 16, 2024

Jacobinski commented Jan 16, 2024 •

edited

Loading