-
Notifications
You must be signed in to change notification settings - Fork 14
created a speculative encoding detector #312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -2,8 +2,17 @@ Copyright (2023) Databricks, Inc. | |
|
|
||
| This Software includes software developed at Databricks (https://www.databricks.com/) and its use is subject to the included LICENSE file. | ||
|
|
||
| ____________________ | ||
| This Software contains code from the following open source projects, licensed under the Apache 2.0 license: | ||
|
|
||
| Databricks SDK for Python - https://github.com/databricks/databricks-sdk-py | ||
| Copyright 2023 Databricks, Inc. All rights reserved. | ||
| License - https://github.com/databricks/databricks-sdk-py/blob/main/LICENSE | ||
|
|
||
|
|
||
| ____________________ | ||
| This Software contains code from the following open source projects, licensed under the GNU Lesser GPL v2: | ||
|
|
||
| chardet - https://github.com/chardet/chardet | ||
| Copyright 2005-2024 Mark Pilgrim, Maintainer: Dan Blanchard | ||
| License - https://github.com/chardet/chardet/blob/main/LICENSE | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @sundarshankar89: EOL at EOF |
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -18,6 +18,7 @@ | |
| from typing import BinaryIO, Literal, NoReturn, TextIO, TypeVar | ||
| from urllib.parse import quote_from_bytes as urlquote_from_bytes | ||
|
|
||
| import chardet | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This import means it's not an optional dependency, which is why the downstream projects are failing. |
||
| from databricks.sdk import WorkspaceClient | ||
| from databricks.sdk.errors import DatabricksError, ResourceDoesNotExist | ||
| from databricks.sdk.service.files import FileInfo | ||
|
|
@@ -1150,14 +1151,20 @@ def decode_with_bom( | |
| a text-based IO wrapper that will decode the underlying binary-mode file as text. | ||
| """ | ||
| use_encoding: str | None | ||
| _chardet_confidence_threshold: float = 0.6 | ||
|
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should this be determined by client or controlled common library? |
||
| if encoding is not None: | ||
| use_encoding = encoding | ||
| else: | ||
| use_encoding = _detect_encoding_bom(file, preserve_position=True) | ||
| if use_encoding is None and detect_xml: | ||
| use_encoding = _detect_encoding_xml(file, preserve_position=True) | ||
| if use_encoding is None: | ||
| use_encoding = locale.getpreferredencoding() | ||
| result = chardet.detect(file.read()) | ||
| use_encoding = result["encoding"] or locale.getpreferredencoding() | ||
| if result["confidence"] < _chardet_confidence_threshold: | ||
| logger.debug(f"Low confidence ({result['confidence']}) in detected encoding: {result}") | ||
| use_encoding = locale.getpreferredencoding() | ||
| file.seek(0) | ||
| return io.TextIOWrapper(file, encoding=use_encoding, errors=errors, newline=newline) | ||
|
|
||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gueniai: If we proceed with this, this will need review.