Skip to content

Add new JSON type.#478

Open
gsergey418 wants to merge 4 commits intomymarilyn:masterfrom
gsergey418:newjson
Open

Add new JSON type.#478
gsergey418 wants to merge 4 commits intomymarilyn:masterfrom
gsergey418:newjson

Conversation

@gsergey418
Copy link

@gsergey418 gsergey418 commented Feb 17, 2025

Add support for the new JSON type. My company uses this driver lib to interact with CH and we rely heavily on the JSON type. Any help would be greatly appreciated since I don't know much about writing database drivers.

Checklist:

  • Add tests that demonstrate the correct behavior of the change. Tests should fail without the change.
  • Add or update relevant docs, in the docs folder and in code.
  • Ensure PR doesn't contain untouched code reformatting: spaces, etc.
  • Run flake8 and fix issues.
  • Run pytest no tests failed. See https://clickhouse-driver.readthedocs.io/en/latest/development.html.

@gsergey418 gsergey418 changed the title Add a test and column for the new JSON type. Add new JSON type. Feb 17, 2025
@gsergey418 gsergey418 marked this pull request as ready for review February 27, 2025 14:37
@gsergey418
Copy link
Author

@xzkostyan I've marked the PR as ready, please review.

@gsergey418
Copy link
Author

Not ready yet, after some testing with our software we found out that this column has trouble parsing Array(JSON) and Array(Array) values. Will fix this in the next few days, the majority of code should still be good for review.

@gsergey418
Copy link
Author

Seems to be ready now, added additional tests for checking compatibility with the CLI.

@gsergey418
Copy link
Author

There's a problem that it's not compatible with the CLI when using array-nested dictionaries at more than five layers of nesting. I probably won't add this functionality since it's too niche of a use case.

@gsergey418
Copy link
Author

gsergey418 commented Mar 4, 2025

The problem was the inability to deserialize shared JSON when max_dynamic_paths and max_dynamic_types are exceeded. It uses a separate undocumented format I sadly don't have time to work on right now. Dynamic JSON serialization I implemented seems to be fully compatible with itself and works well for objects inserted with other libraries/clients that don't exceed the dynamic paths limit (1024 dynamic paths by default and /4 on every array-nested dictionary). Requesting review @xzkostyan

@jhKessler
Copy link

Any indication on when this will be merged?

@KingMohan45
Copy link

+1

@chrisgoddard
Copy link

Any update on this? This is a big blocker for me currently.

@gsergey418
Copy link
Author

Any update on this? This is a big blocker for me currently.

Not yet, the maintainer has to review the PR. You can try this out for now and let me know if it works.

$ pip uninstall clickhouse-driver
$ pip install git+https://github.com/gsergey418/clickhouse-driver@newjson#egg=clickhouse-driver

@etos
Copy link

etos commented May 5, 2025

+1

@yuangezhizao
Copy link

Any update on this? This is a big blocker for me currently.

Not yet, the maintainer has to review the PR. You can try this out for now and let me know if it works.

$ pip uninstall clickhouse-driver
$ pip install git+https://github.com/gsergey418/clickhouse-driver@newjson#egg=clickhouse-driver

In my test case, it works. Thanks for the PR. 👍

clickhouse-driver execute result

/Users/yuangezhizao/Documents/GitLab/<rm>-capacity-poc/.venv/bin/python /Users/yuangezhizao/Documents/GitLab/<rm>-capacity-poc/<rm>_capacity_poc/clickhouse/test_new_json_type.py 
DEBUG:clickhouse_driver.connection:Connecting. Database: default. User: default
DEBUG:clickhouse_driver.connection:Connecting to <rm>:9001
DEBUG:clickhouse_driver.connection:Connected to ClickHouse server version 25.4.2, revision: 54477
DEBUG:clickhouse_driver.connection:Query: DROP TABLE IF EXISTS `default`.test_new_json_type
DEBUG:clickhouse_driver.connection:Block "" send time: 0.000023
INFO:__main__:Table deleted
DEBUG:clickhouse_driver.connection:Query: 
                CREATE TABLE `default`.test_new_json_type
                (
                    `id` UInt64,
                    `json_data` JSON()
                )
                ENGINE = MergeTree()
                ORDER BY id;
                
DEBUG:clickhouse_driver.connection:Block "" send time: 0.000024
INFO:__main__:Table created
DEBUG:clickhouse_driver.connection:Query: INSERT INTO `default`.test_new_json_type (id, json_data) VALUES
DEBUG:clickhouse_driver.connection:Block "" send time: 0.000028
DEBUG:clickhouse_driver.streams.native:Writing column id
DEBUG:clickhouse_driver.streams.native:Writing column json_data
DEBUG:clickhouse_driver.connection:Block "" send time: 0.000856
DEBUG:clickhouse_driver.streams.native:Reading column host_name
DEBUG:clickhouse_driver.streams.native:Reading column current_time
DEBUG:tzlocal:/etc/localtime found
DEBUG:tzlocal:1 found:
 {'/etc/localtime is a symlink to': 'Asia/Shanghai'}
DEBUG:clickhouse_driver.streams.native:Reading column thread_id
DEBUG:clickhouse_driver.streams.native:Reading column type
DEBUG:clickhouse_driver.streams.native:Reading column name
DEBUG:clickhouse_driver.streams.native:Reading column value
DEBUG:clickhouse_driver.connection:Block "" send time: 0.000056
DEBUG:clickhouse_driver.streams.native:Reading column host_name
DEBUG:clickhouse_driver.streams.native:Reading column current_time
DEBUG:clickhouse_driver.streams.native:Reading column thread_id
DEBUG:clickhouse_driver.streams.native:Reading column type
DEBUG:clickhouse_driver.streams.native:Reading column name
DEBUG:clickhouse_driver.streams.native:Reading column value
INFO:__main__:Data created
DEBUG:clickhouse_driver.connection:Query: SELECT id, JSONExtractKeysAndValuesRaw(toJSONString(json_data)) FROM `default`.test_new_json_type
DEBUG:clickhouse_driver.connection:Block "" send time: 0.000015
DEBUG:clickhouse_driver.streams.native:Reading column id
DEBUG:clickhouse_driver.streams.native:Reading column JSONExtractKeysAndValuesRaw(toJSONString(json_data))
DEBUG:clickhouse_driver.streams.native:Reading column host_name
DEBUG:clickhouse_driver.streams.native:Reading column current_time
DEBUG:clickhouse_driver.streams.native:Reading column thread_id
DEBUG:clickhouse_driver.streams.native:Reading column type
DEBUG:clickhouse_driver.streams.native:Reading column name
DEBUG:clickhouse_driver.streams.native:Reading column value
INFO:__main__:ID: 1, Extracted JSON: [('attr', '{"attr16":"100"}')]
INFO:__main__:ID: 2, Extracted JSON: [('attr', '{"attr16":"string"}')]
INFO:__main__:ID: 3, Extracted JSON: [('attr', '{"attr16":[["aaaaa","bbbbb"]]}')]
INFO:__main__:ID: 4, Extracted JSON: [('attr', '{"attr16":[["ccccc"],"aaaaa","bbbbb"]}')]
INFO:__main__:ID: 5, Extracted JSON: [('attr', '{"attr_test":[["ccccc"],"aaaaa","bbbbb"]}')]
INFO:__main__:Data selected

Before

image

After

image

@gsergey418
Copy link
Author

@yuangezhizao Thanks for the test! Glad to hear it works

@maciej-or
Copy link

maciej-or commented May 22, 2025

@gsergey418 thanks for the PR. Works smoothly with JSON types but with Array(JSON) I get an error by insert

DB::Exception: Array sizes are too large: 7164775599414010632

even if I try to insert empty arrays,
sample table:

CREATE TABLE gcp_billing_test (
  user_id Int64,
  cost Float64,
  resource JSON DEFAULT '{}',
  service JSON DEFAULT '{}',
  labels Array(JSON) DEFAULT [],
  credits Array(JSON) DEFAULT []
)
ENGINE = MergeTree
PARTITION BY (user_id)
ORDER BY (user_id)

sample insert:

INSERT INTO gcp_billing_test (
  user_id,
  cost,
  resource,
  service,
  labels,
  credits
)
VALUES (
  123456789,
  42.75,
  '{"type": "compute", "location": "us-central1"}',
  '{"name": "Compute Engine", "category": "IaaS"}',
  [
    '{"env": "prod"}',
    '{"team": "infra"}'
  ],
  [
    '{"name": "Sustained Use Discount", "amount": 5.25}',
    '{"name": "Committed Use Discount", "amount": 3.10}'
  ]
);

stack trace:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/clickhouse_driver/dbapi/cursor.py", line 111, in execute
    response = execute(
  File "/usr/local/lib/python3.10/site-packages/clickhouse_backend/driver/client.py", line 44, in execute
    rv = self.process_insert_query(
  File "/usr/local/lib/python3.10/site-packages/clickhouse_driver/client.py", line 607, in process_insert_query
    rv = self.send_data(sample_block, data,
  File "/usr/local/lib/python3.10/site-packages/clickhouse_driver/client.py", line 665, in send_data
    self.receive_profile_events()
  File "/usr/local/lib/python3.10/site-packages/clickhouse_driver/client.py", line 747, in receive_profile_events
    raise packet.exception
clickhouse_driver.errors.ServerException: Code: 128.
DB::Exception: Array sizes are too large: 7164775599414010632

@ronaldorcampos
Copy link

ronaldorcampos commented Aug 4, 2025

Thanks for the PR. We really need this supported officially, as we rely heavily on json type.

@jgb
Copy link

jgb commented Aug 5, 2025

I would also like to voice my support for this PR (or any other PR that can implement JSON support for this library).
clickhouse-connect offers JSON support, but it isn't able to automatically transform null values into default values for non-nullable columns. clickhouse-driver does offer that feature, but doesn't support JSON.

@PierreF
Copy link

PierreF commented Sep 3, 2025

When I tested this PR, I hit one issue: when the JSON contains dynamic keys (like user-defined keys) the SELECT could fail if the data part contains more than 128 distincts keys.

My usage is JSON fields that store attribute of a log line. Each log line's attribute could be user-defined (not a fixed set of keys, e.g. one line could have http.response.status = 200 the other postgresql.version = 17). It result that all log lines could collectively have more than 128 keys.

The following test exhibit the bug:

    def test_json_13(self):
        """ Reproduce an issue when we read a JSON column that had more than 127 distinct keys across all rows
        """
        # The issue don't occur with Memory engine, so use MergeTree
        create_table_template = 'CREATE TABLE test ({}) ENGINE = MergeTree ORDER BY tuple()'

        with self.create_table("a JSON", template=create_table_template):
            unique_keys = [
                f"unique_key_{idx}"
                for idx in range(128)
            ]

            # There is also an issue when we try to insert more than 127 distinct keys in a single
            # query, but this isn't what this test show, so send the unique keys in multiple inserts.
            max_batch_size = 64
            batches = [unique_keys[i:i + max_batch_size] for i in range(0, len(unique_keys), max_batch_size)]
            for batch in batches:
                data = [
                    ({key: 1},)
                    for key in batch
                ]
                self.client.execute("INSERT INTO test (a) VALUES", data)
            
            # We submitted multiple inserts (which create multiple data parts). Each of these data parts contains
            # only max_batch_size distinct keys.
            # Once merged into a single data part, the resulting data part will
            # contains all unique_keys which exceed the threshold of ~120 unique keys.
            # Force merge of the data parts into a single one
            self.client.execute("OPTIMIZE TABLE test FINAL")

            query = "SELECT * FROM test"
            result = self.client.execute(query)
            result_keys = {key for row in result for key in row[0].keys()}
            self.assertEqual(result_keys, set(unique_keys))

@shachibista
Copy link

Any updates on this?

@KabirSingh-Marrow
Copy link

When will this get merged?

@anand-bakerev
Copy link

Hi Team, any update in this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

JSON Column Type support Support the new JSON type