Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding issue in request.py #464

Open
rjstanford opened this issue Apr 4, 2024 · 2 comments
Open

Encoding issue in request.py #464

rjstanford opened this issue Apr 4, 2024 · 2 comments

Comments

@rjstanford
Copy link

I'm not entirely sure what the intent is here so hesitate to file a PR. We saw some errors thrown by our webapp (using gunicorn) and traced it to request.encget():

  File "/layers/google.python.pip/pip/lib/python3.9/site-packages/webob/request.py", line 495, in url
    url = self.path_url
  File "/layers/google.python.pip/pip/lib/python3.9/site-packages/webob/request.py", line 467, in path_url
    bpath_info = bytes_(self.path_info, self.url_encoding)
  File "/layers/google.python.pip/pip/lib/python3.9/site-packages/webob/descriptors.py", line 70, in fget
    return req.encget(key, encattr=encattr)
  File "/layers/google.python.pip/pip/lib/python3.9/site-packages/webob/request.py", line 165, in encget
    return bytes_(val, 'latin-1').decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 66: invalid start byte"

My read of util.byte_ is that, when passed a string, it performs val.encode() on it. So the following code in encget():

return bytes_(val, "latin-1").decode(encoding)

is the same as doing:

return val.encode("latin-1", "strict").decode(encoding)

Based on our exception we can see that the value of encoding is "utf-8", which gives us:

return val.encode("latin-1", "strict").decode("utf-8")

or with a specific example that will fail:

x = "À".encode('latin-1').decode('utf-8')

I'm not sure why we'd ever be explicitly encoding a string as latin-1 and then decoding it as UTF-8 in the first place -- a simpler return val.encode(encoding) would seem more appropriate here -- but again, there's probably nuance that I'm not understanding, hence the issue report.

@rjstanford
Copy link
Author

This is on released version 1.8.7 btw, I see that there's been some unreleased development since then.

@digitalresistor
Copy link
Member

This is due to the fact that HTTP doesn't officially support unicode in HTTP requests/paths and as explained in https://peps.python.org/pep-3333/#unicode-issues all of the HTTP path/URI's should be treated as latin-1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants