Skip to content

Commit 386539d

Browse files
authored
UTF7: enable detection of empty-document with byte-order-mark (#717)
* Tests: add detection coverage for UTF7 with BOM This means that it would never provide detection of UTF7 at runtime, because the other prefix would always take priority. * 🐛 Allow UTF7 BOM to detect empty-content case Checking for this special-case UTF7 BOM before the other cases allows us to detect an empty document, which Python 3 encodes into ASCII with a trailing minus symbol ('-').
1 parent 5478b84 commit 386539d

2 files changed

Lines changed: 6 additions & 1 deletion

File tree

src/charset_normalizer/constant.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,11 +9,11 @@
99
ENCODING_MARKS: dict[str, bytes | list[bytes]] = {
1010
"utf_8": BOM_UTF8,
1111
"utf_7": [
12+
b"\x2b\x2f\x76\x38\x2d",
1213
b"\x2b\x2f\x76\x38",
1314
b"\x2b\x2f\x76\x39",
1415
b"\x2b\x2f\x76\x2b",
1516
b"\x2b\x2f\x76\x2f",
16-
b"\x2b\x2f\x76\x38\x2d",
1717
],
1818
"gb18030": b"\x84\x31\x95\x33",
1919
"utf_32": [BOM_UTF32_BE, BOM_UTF32_LE],

tests/test_base_detection.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@ def test_bool_matches():
3333
[
3434
(b"\xfe\xff", "utf_16"),
3535
("\uFEFF".encode("gb18030"), "gb18030"),
36+
("\uFEFF".encode("utf-7"), "utf_7"),
3637
(b"\xef\xbb\xbf", "utf_8"),
3738
("".encode("utf_32"), "utf_32"),
3839
],
@@ -90,6 +91,10 @@ def test_md_triggered_but_with_bom_or_sig(payload, expected_encoding):
9091
"我没有埋怨,磋砣的只是一些时间。".encode("utf_8_sig"),
9192
"utf_8",
9293
),
94+
(
95+
("\uFEFF" + "🐕").encode("utf-7"),
96+
"utf_7",
97+
),
9398
],
9499
)
95100
def test_content_with_bom_or_sig(payload, expected_encoding):

0 commit comments

Comments
 (0)