gh-132121: Always escape non-printable characters in `pygettext` #132122

tomasr8 · 2025-04-05T12:56:39Z

Pretty much the title, all non-printable characters are always escaped regardless whether --escape is passed or not.

Issue: pygettext: Always escape non-printable characters regardless of the --escape option #132121

serhiy-storchaka

Could you please measure performance of escape_ascii() for non-ASCII strings of different length (10 to 100000)?

tomasr8 · 2025-04-05T16:55:13Z

This is on a PGO build:

main

n=10
        only printable: 0.0015040868893265724
        only non-printable: 0.001876825001090765
        mix (90% printable, 10% non-printable): 0.001469448208808899
        mix (50% printable, 50% non-printable): 0.0016370960511267185
n=100
        only printable: 0.010028427001088858
        only non-printable: 0.015033604810014367
        mix (90% printable, 10% non-printable): 0.010826029814779758
        mix (50% printable, 50% non-printable): 0.012500355951488018
n=1000
        only printable: 0.09287006920203567
        only non-printable: 0.14058534195646644
        mix (90% printable, 10% non-printable): 0.09682066598907113
        mix (50% printable, 50% non-printable): 0.11390575394034386
n=10000
        only printable: 0.9122045831754804
        only non-printable: 1.4113427978008986
        mix (90% printable, 10% non-printable): 0.9659982230514288
        mix (50% printable, 50% non-printable): 1.1028782678768039
n=100000
        only printable: 12.202279687859118
        only non-printable: 13.532787967007607
        mix (90% printable, 10% non-printable): 12.229883643100038
        mix (50% printable, 50% non-printable): 12.140505719929934

PR

n=10
        only printable: 0.0016840400639921427
        only non-printable: 0.008518028073012829
        mix (90% printable, 10% non-printable): 0.002215402899309993
        mix (50% printable, 50% non-printable): 0.0051144990138709545
n=100
        only printable: 0.012747738044708967
        only non-printable: 0.08456205087713897
        mix (90% printable, 10% non-printable): 0.02049739588983357
        mix (50% printable, 50% non-printable): 0.050314649008214474
n=1000
        only printable: 0.12691838503815234
        only non-printable: 0.8395978400949389
        mix (90% printable, 10% non-printable): 0.20092108193784952
        mix (50% printable, 50% non-printable): 0.4972433662042022
n=10000
        only printable: 1.2568156099878252
        only non-printable: 8.38266294496134
        mix (90% printable, 10% non-printable): 2.0531771059613675
        mix (50% printable, 50% non-printable): 4.823235515970737
n=100000
        only printable: 12.481141783064231
        only non-printable: 82.42004368198104
        mix (90% printable, 10% non-printable): 20.128975569969043
        mix (50% printable, 50% non-printable): 49.69893314293586

test code:

def test_escape_ascii(string):
    for _ in range(100):
        escape_ascii(string, 'utf-8')

make_escapes(True)

for n in (10, 100, 1_000, 10_000, 100_000):
    print(f"{n=}")
    string = 'α' * n
    print(f"\tonly printable: {timeit.timeit(partial(test_escape_ascii, string), number=10)}")
    string = '\302\200' * n
    print(f"\tonly non-printable: {timeit.timeit(partial(test_escape_ascii, string), number=10)}")
    string = 'α' * ((9*n)//10) + '\302\200' * (n//10)
    print(f"\tmix (90% printable, 10% non-printable): {timeit.timeit(partial(test_escape_ascii, string), number=10)}")
    string = 'α' * (n//2) + '\302\200' * (n//2)
    print(f"\tmix (50% printable, 50% non-printable): {timeit.timeit(partial(test_escape_ascii, string), number=10)}")

Strings that only contain non-printable characters are much slower with this PR, though they should be quite uncommon. I also tested mixed strings (50% printable/non-printable and 90%/10% printable/non-printable) where the numbers look more reasonable but maybe there's a way to optimize the function?

serhiy-storchaka · 2025-04-05T17:33:41Z

Thanks. It is expected that non-printable characters should be slower, but non-ASCII non-printable characters should be extremely rare. I worried about printable characters. How is this in comparison with the current code?

The time grow looks linear, but this may be CPython specific optimization, on other implementations it may be quadratic.

I am asking because you changed the implementation to use += in a loop instead of str.join() with a generator -- this may be slower or faster, depending on data and Python implementation. It does not matter if the total time is instant, but if it is increased from seconds to minutes, we have a problem.

serhiy-storchaka · 2025-04-05T17:36:56Z

Ah, I see, you already tested it against main and PR. The difference is insignificant. Could you please test it on PyPy? Just to be sure that the worst case is not so bad.

tomasr8 · 2025-04-05T19:04:18Z

Could you please test it on PyPy? Just to be sure that the worst case is not so bad.

Will do, either today or tomorrow :)

tomasr8 · 2025-04-06T08:56:52Z

On PyPy 3.11 it regresses by quite a bit (I'll only include n=100000 as that's the interesting one, but I can give you numbers for all ns):

Main

n=100000
	only printable: 0.2000591829419136
	only non-printable: 0.39696668391115963
	mix (90% printable, 10% non-printable): 0.22258107410743833
	mix (50% printable, 50% non-printable): 0.31844047689810395

PR

n=100000
	only printable: 63.42232683789916
	only non-printable: 1022.779372680001
	mix (90% printable, 10% non-printable): 88.12281004199758
	mix (50% printable, 50% non-printable): 344.05332187889144

This is a pretty big regression. I rewrote the function back to a list comprehension and the numbers look much better:

PR (list comprehension)

n=100000
	only printable: 0.26949335308745503
	only non-printable: 2.3480193230789155
	mix (90% printable, 10% non-printable): 0.47037211689166725
	mix (50% printable, 50% non-printable): 1.307479243958369

def escape_ascii(s, encoding):
    return ''.join(escapes[ord(c)] if ord(c) < 128 else c
                   if c.isprintable() else escape_nonascii(c, encoding)
                   for c in s)

I'll update the PR to keep the list comprehension.

serhiy-storchaka · 2025-04-06T12:52:48Z

Thank you. So there was a difference.

BTW, there is an error in your benchmarks for non-printable characters. \302\200 is two characters, so the code processes twice more non-printable characters than printable ones. Don't worry, these results don't matter.

The changes in make_escapes() look unnecessary.

python-cla-bot · 2025-04-06T13:55:33Z

All commit authors signed the Contributor License Agreement.

tomasr8 · 2025-04-06T16:56:35Z

\302\200 is two characters, so the code processes twice more non-printable characters than printable ones.

Ah, you're right, thanks for pointing it out!

The changes in make_escapes() look unnecessary.

This change makes it so that the escapes list is always populated all the way to 256. It is needed because escape_ascii can now call escape_nonascii which relies on the escapes list having values all the way up to 256.

serhiy-storchaka · 2025-04-06T20:14:32Z

It is needed because escape_ascii can now call escape_nonascii which relies on the escapes list having values all the way up to 256.

Ah, good point!

LGTM. Thank you for your PR.

tomasr8 · 2025-04-06T20:55:22Z

Thanks for the review!

…ythonGH-132122)

tomasr8 added 2 commits April 5, 2025 14:40

Always escape non-printable characters

54c64a7

Add news entry

b9e37eb

bedevere-app bot added the awaiting review label Apr 5, 2025

bedevere-app bot mentioned this pull request Apr 5, 2025

pygettext: Always escape non-printable characters regardless of the --escape option #132121

Closed

Ap7972118447 approved these changes Apr 5, 2025

View reviewed changes

bedevere-app bot added awaiting core review and removed awaiting review labels Apr 5, 2025

tomasr8 requested a review from serhiy-storchaka April 5, 2025 13:57

serhiy-storchaka reviewed Apr 5, 2025

View reviewed changes

Ap7972118447 approved these changes Apr 5, 2025

View reviewed changes

Keep list comprehension

8c2150f

serhiy-storchaka approved these changes Apr 6, 2025

View reviewed changes

bedevere-app bot added awaiting merge and removed awaiting core review labels Apr 6, 2025

serhiy-storchaka merged commit a693eaa into python:main Apr 6, 2025
43 checks passed

bedevere-app bot removed the awaiting merge label Apr 6, 2025

tomasr8 deleted the pygettext-nonprintable branch April 6, 2025 20:55

seehwan pushed a commit to seehwan/cpython that referenced this pull request Apr 16, 2025

pythongh-132121: Always escape non-printable characters in pygettext (p…

c8fce85

…ythonGH-132122)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

gh-132121: Always escape non-printable characters in `pygettext` #132122

gh-132121: Always escape non-printable characters in `pygettext` #132122

Uh oh!

tomasr8 commented Apr 5, 2025 •

edited by bedevere-app bot

Loading

Uh oh!

serhiy-storchaka left a comment

Uh oh!

tomasr8 commented Apr 5, 2025

Uh oh!

serhiy-storchaka commented Apr 5, 2025

Uh oh!

serhiy-storchaka commented Apr 5, 2025

Uh oh!

tomasr8 commented Apr 5, 2025

Uh oh!

tomasr8 commented Apr 6, 2025

Uh oh!

serhiy-storchaka commented Apr 6, 2025

Uh oh!

python-cla-bot bot commented Apr 6, 2025

Uh oh!

tomasr8 commented Apr 6, 2025

Uh oh!

serhiy-storchaka commented Apr 6, 2025

Uh oh!

Uh oh!

tomasr8 commented Apr 6, 2025

Uh oh!

Uh oh!

Uh oh!

gh-132121: Always escape non-printable characters in pygettext #132122

gh-132121: Always escape non-printable characters in pygettext #132122

Uh oh!

Conversation

tomasr8 commented Apr 5, 2025 • edited by bedevere-app bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

serhiy-storchaka left a comment

Choose a reason for hiding this comment

Uh oh!

tomasr8 commented Apr 5, 2025

Uh oh!

serhiy-storchaka commented Apr 5, 2025

Uh oh!

serhiy-storchaka commented Apr 5, 2025

Uh oh!

tomasr8 commented Apr 5, 2025

Uh oh!

tomasr8 commented Apr 6, 2025

Uh oh!

serhiy-storchaka commented Apr 6, 2025

Uh oh!

python-cla-bot bot commented Apr 6, 2025

Uh oh!

tomasr8 commented Apr 6, 2025

Uh oh!

serhiy-storchaka commented Apr 6, 2025

Uh oh!

Uh oh!

tomasr8 commented Apr 6, 2025

Uh oh!

Uh oh!

gh-132121: Always escape non-printable characters in `pygettext` #132122

gh-132121: Always escape non-printable characters in `pygettext` #132122

tomasr8 commented Apr 5, 2025 •

edited by bedevere-app bot

Loading