Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

email.parser.BytesParser.parse() cannot handle binary data that include \x0d \x0a correctly. #128949

Open
tnakamot opened this issue Jan 17, 2025 · 2 comments
Labels
stdlib Python modules in the Lib dir topic-email type-bug An unexpected behavior, bug, or error

Comments

@tnakamot
Copy link

tnakamot commented Jan 17, 2025

Bug report

Bug description:

I would like to extract a binary file in a multipart MIME file using email.parser.BytesParser, but a byte sequence "0x0d 0x0a" (CR + LF) in the binary file is replaced by "0x0a" (LF). Below is a minimal reproducible example.

from email.parser import BytesParser
from email.policy import default
from io import BytesIO

mime_file_byte_array = b'MIME-Version: 1.0\r\nContent-Type: multipart/mixed; boundary="MIME\
_boundary-1";\r\n\r\n--MIME_boundary-1\r\nContent-Type: application/octet-stream\r\nContent\
-Location: test.bin\r\n\r\na\r\nb\r\n--MIME_boundary-1--\r\n\r\n'
fp = BytesIO(mime_file_byte_array)
parser = BytesParser(policy=default)
msg = parser.parse(fp)

parts = [part for part in msg.walk()]
binary_data = parts[1].get_payload(decode=True)

print('===== Beginning of Original MIME File =====')
print(mime_file_byte_array.decode())
print('===== End of Original MIME File =====')
print('')
print('===== test.bin after parse =====')
print(binary_data)
print('===== test.bin after parse =====')

As can be seen in the fifth line, the multipart MIME file includes a binary file "test.bin". The contents of the binary file is b"a\r\nb".
Therefore, the variable binary_data is supposed to contain b"a\r\nb", but it was actually b"a\nb".

It is probably because TextIOWrapper in BytesParser.parse() translates CR+LF to LF on Linux.

fp = TextIOWrapper(fp, encoding='ascii', errors='surrogateescape')

When I replaced the above line with the line below, this problem was fixed. However, this fix may have a side effect which I cannot foresee.

fp = TextIOWrapper(fp, encoding='ascii', errors='surrogateescape', newline='')

CPython versions tested on:

3.10

Operating systems tested on:

Linux

@tnakamot tnakamot added the type-bug An unexpected behavior, bug, or error label Jan 17, 2025
@picnixz picnixz added stdlib Python modules in the Lib dir topic-email labels Jan 17, 2025
@RanKKI
Copy link
Contributor

RanKKI commented Feb 16, 2025

Using the parameter newline='' resolves this issue. By default (when newline is None), the wrapper translates CRLF/LF to os.linesep. When newline is set to an empty string, the wrapper does not modify CRLF/LF.

cpython/Modules/_io/textio.c

Lines 1085 to 1089 in a7d41a8

* On output, if newline is None, any '\n' characters written are
translated to the system default line separator, os.linesep. If
newline is '' or '\n', no translation takes place. If newline is any
of the other legal values, any '\n' characters written are translated
to the given string.

But,

according to RFC 2046#4.1.1, text/* types must always use CRLF line endings

The canonical form of any MIME "text" subtype MUST always represent a
line break as a CRLF sequence.

The current implementation has a bug (I guess), and I don't see an easy solution to fix it.

@tnakamot
Copy link
Author

Thank you for your comment, @RanKKI . Yes, it is understandable if this translation from CRLF to os.linesep is activated only for the text/* types. I guess that some python applications that use the email.parser module may be relying on this feature of the line break translation. So, in order to maintain the backward compatibility, a possible solution is to deactivate this translation if the content type is not text/*.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stdlib Python modules in the Lib dir topic-email type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

3 participants