Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-132121: Always escape non-printable characters in pygettext #132122

Merged
merged 3 commits into from
Apr 6, 2025

Conversation

tomasr8
Copy link
Member

@tomasr8 tomasr8 commented Apr 5, 2025

Pretty much the title, all non-printable characters are always escaped regardless whether --escape is passed or not.

Copy link
Member

@serhiy-storchaka serhiy-storchaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please measure performance of escape_ascii() for non-ASCII strings of different length (10 to 100000)?

@tomasr8
Copy link
Member Author

tomasr8 commented Apr 5, 2025

This is on a PGO build:

main

n=10
        only printable: 0.0015040868893265724
        only non-printable: 0.001876825001090765
        mix (90% printable, 10% non-printable): 0.001469448208808899
        mix (50% printable, 50% non-printable): 0.0016370960511267185
n=100
        only printable: 0.010028427001088858
        only non-printable: 0.015033604810014367
        mix (90% printable, 10% non-printable): 0.010826029814779758
        mix (50% printable, 50% non-printable): 0.012500355951488018
n=1000
        only printable: 0.09287006920203567
        only non-printable: 0.14058534195646644
        mix (90% printable, 10% non-printable): 0.09682066598907113
        mix (50% printable, 50% non-printable): 0.11390575394034386
n=10000
        only printable: 0.9122045831754804
        only non-printable: 1.4113427978008986
        mix (90% printable, 10% non-printable): 0.9659982230514288
        mix (50% printable, 50% non-printable): 1.1028782678768039
n=100000
        only printable: 12.202279687859118
        only non-printable: 13.532787967007607
        mix (90% printable, 10% non-printable): 12.229883643100038
        mix (50% printable, 50% non-printable): 12.140505719929934

PR

n=10
        only printable: 0.0016840400639921427
        only non-printable: 0.008518028073012829
        mix (90% printable, 10% non-printable): 0.002215402899309993
        mix (50% printable, 50% non-printable): 0.0051144990138709545
n=100
        only printable: 0.012747738044708967
        only non-printable: 0.08456205087713897
        mix (90% printable, 10% non-printable): 0.02049739588983357
        mix (50% printable, 50% non-printable): 0.050314649008214474
n=1000
        only printable: 0.12691838503815234
        only non-printable: 0.8395978400949389
        mix (90% printable, 10% non-printable): 0.20092108193784952
        mix (50% printable, 50% non-printable): 0.4972433662042022
n=10000
        only printable: 1.2568156099878252
        only non-printable: 8.38266294496134
        mix (90% printable, 10% non-printable): 2.0531771059613675
        mix (50% printable, 50% non-printable): 4.823235515970737
n=100000
        only printable: 12.481141783064231
        only non-printable: 82.42004368198104
        mix (90% printable, 10% non-printable): 20.128975569969043
        mix (50% printable, 50% non-printable): 49.69893314293586

test code:

def test_escape_ascii(string):
    for _ in range(100):
        escape_ascii(string, 'utf-8')

make_escapes(True)

for n in (10, 100, 1_000, 10_000, 100_000):
    print(f"{n=}")
    string = 'α' * n
    print(f"\tonly printable: {timeit.timeit(partial(test_escape_ascii, string), number=10)}")
    string = '\302\200' * n
    print(f"\tonly non-printable: {timeit.timeit(partial(test_escape_ascii, string), number=10)}")
    string = 'α' * ((9*n)//10) + '\302\200' * (n//10)
    print(f"\tmix (90% printable, 10% non-printable): {timeit.timeit(partial(test_escape_ascii, string), number=10)}")
    string = 'α' * (n//2) + '\302\200' * (n//2)
    print(f"\tmix (50% printable, 50% non-printable): {timeit.timeit(partial(test_escape_ascii, string), number=10)}")

Strings that only contain non-printable characters are much slower with this PR, though they should be quite uncommon. I also tested mixed strings (50% printable/non-printable and 90%/10% printable/non-printable) where the numbers look more reasonable but maybe there's a way to optimize the function?

@serhiy-storchaka
Copy link
Member

Thanks. It is expected that non-printable characters should be slower, but non-ASCII non-printable characters should be extremely rare. I worried about printable characters. How is this in comparison with the current code?

The time grow looks linear, but this may be CPython specific optimization, on other implementations it may be quadratic.

I am asking because you changed the implementation to use += in a loop instead of str.join() with a generator -- this may be slower or faster, depending on data and Python implementation. It does not matter if the total time is instant, but if it is increased from seconds to minutes, we have a problem.

@serhiy-storchaka
Copy link
Member

Ah, I see, you already tested it against main and PR. The difference is insignificant. Could you please test it on PyPy? Just to be sure that the worst case is not so bad.

@tomasr8
Copy link
Member Author

tomasr8 commented Apr 5, 2025

Could you please test it on PyPy? Just to be sure that the worst case is not so bad.

Will do, either today or tomorrow :)

@tomasr8
Copy link
Member Author

tomasr8 commented Apr 6, 2025

On PyPy 3.11 it regresses by quite a bit (I'll only include n=100000 as that's the interesting one, but I can give you numbers for all ns):

Main

n=100000
	only printable: 0.2000591829419136
	only non-printable: 0.39696668391115963
	mix (90% printable, 10% non-printable): 0.22258107410743833
	mix (50% printable, 50% non-printable): 0.31844047689810395

PR

n=100000
	only printable: 63.42232683789916
	only non-printable: 1022.779372680001
	mix (90% printable, 10% non-printable): 88.12281004199758
	mix (50% printable, 50% non-printable): 344.05332187889144

This is a pretty big regression. I rewrote the function back to a list comprehension and the numbers look much better:

PR (list comprehension)

n=100000
	only printable: 0.26949335308745503
	only non-printable: 2.3480193230789155
	mix (90% printable, 10% non-printable): 0.47037211689166725
	mix (50% printable, 50% non-printable): 1.307479243958369
def escape_ascii(s, encoding):
    return ''.join(escapes[ord(c)] if ord(c) < 128 else c
                   if c.isprintable() else escape_nonascii(c, encoding)
                   for c in s)

I'll update the PR to keep the list comprehension.

@serhiy-storchaka
Copy link
Member

Thank you. So there was a difference.

BTW, there is an error in your benchmarks for non-printable characters. \302\200 is two characters, so the code processes twice more non-printable characters than printable ones. Don't worry, these results don't matter.

The changes in make_escapes() look unnecessary.

@python-cla-bot
Copy link

python-cla-bot bot commented Apr 6, 2025

All commit authors signed the Contributor License Agreement.

CLA signed

@tomasr8
Copy link
Member Author

tomasr8 commented Apr 6, 2025

\302\200 is two characters, so the code processes twice more non-printable characters than printable ones.

Ah, you're right, thanks for pointing it out!

The changes in make_escapes() look unnecessary.

This change makes it so that the escapes list is always populated all the way to 256. It is needed because escape_ascii can now call escape_nonascii which relies on the escapes list having values all the way up to 256.

@serhiy-storchaka
Copy link
Member

It is needed because escape_ascii can now call escape_nonascii which relies on the escapes list having values all the way up to 256.

Ah, good point!

LGTM. Thank you for your PR.

@serhiy-storchaka serhiy-storchaka merged commit a693eaa into python:main Apr 6, 2025
43 checks passed
@tomasr8
Copy link
Member Author

tomasr8 commented Apr 6, 2025

Thanks for the review!

@tomasr8 tomasr8 deleted the pygettext-nonprintable branch April 6, 2025 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants