Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions NOTICE
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,17 @@ Copyright (2023) Databricks, Inc.

This Software includes software developed at Databricks (https://www.databricks.com/) and its use is subject to the included LICENSE file.

____________________
This Software contains code from the following open source projects, licensed under the Apache 2.0 license:

Databricks SDK for Python - https://github.com/databricks/databricks-sdk-py
Copyright 2023 Databricks, Inc. All rights reserved.
License - https://github.com/databricks/databricks-sdk-py/blob/main/LICENSE


____________________
This Software contains code from the following open source projects, licensed under the GNU Lesser GPL v2:

chardet - https://github.com/chardet/chardet
Copyright 2005-2024 Mark Pilgrim, Maintainer: Dan Blanchard
License - https://github.com/chardet/chardet/blob/main/LICENSE
Comment on lines +14 to +18
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gueniai: If we proceed with this, this will need review.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sundarshankar89: EOL at EOF

5 changes: 4 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,10 @@ classifiers = [
dependencies = ["databricks-sdk>=0.16.0"]

[project.optional-dependencies]
yaml = ["PyYAML>=6.0.0,<7.0.0"]
yaml = [
"PyYAML>=6.0.0,<7.0.0",
"chardet>=5.1.0,<6.0.0",
]

[project.urls]
Issues = "https://github.com/databrickslabs/blueprint/issues"
Expand Down
9 changes: 8 additions & 1 deletion src/databricks/labs/blueprint/paths.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
from typing import BinaryIO, Literal, NoReturn, TextIO, TypeVar
from urllib.parse import quote_from_bytes as urlquote_from_bytes

import chardet
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import means it's not an optional dependency, which is why the downstream projects are failing.

from databricks.sdk import WorkspaceClient
from databricks.sdk.errors import DatabricksError, ResourceDoesNotExist
from databricks.sdk.service.files import FileInfo
Expand Down Expand Up @@ -1150,14 +1151,20 @@ def decode_with_bom(
a text-based IO wrapper that will decode the underlying binary-mode file as text.
"""
use_encoding: str | None
_chardet_confidence_threshold: float = 0.6
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be determined by client or controlled common library?

if encoding is not None:
use_encoding = encoding
else:
use_encoding = _detect_encoding_bom(file, preserve_position=True)
if use_encoding is None and detect_xml:
use_encoding = _detect_encoding_xml(file, preserve_position=True)
if use_encoding is None:
use_encoding = locale.getpreferredencoding()
result = chardet.detect(file.read())
use_encoding = result["encoding"] or locale.getpreferredencoding()
if result["confidence"] < _chardet_confidence_threshold:
logger.debug(f"Low confidence ({result['confidence']}) in detected encoding: {result}")
use_encoding = locale.getpreferredencoding()
file.seek(0)
return io.TextIOWrapper(file, encoding=use_encoding, errors=errors, newline=newline)


Expand Down
1 change: 1 addition & 0 deletions tests/unit/test_paths.py
Original file line number Diff line number Diff line change
Expand Up @@ -1128,6 +1128,7 @@ def test_read_xml_file_default_utf8(tmp_path: Path, monkeypatch) -> None:
path.write_text(example, encoding="utf-8")

# Verify the monkey-patching means we're not defaulting to UTF-8.
# with chardet this would likely work, unless the confidence score is less than 0.6 for this example it is 0.506
monkeypatch.setattr(locale, "getpreferredencoding", lambda: "Windows-1252")
assert locale.getpreferredencoding() != "UTF-8"
assert read_text(path, detect_xml=False) != example
Expand Down
Loading