Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metadata types should be duckdb types #28

Open
tarasglek opened this issue Jan 14, 2025 · 16 comments
Open

Metadata types should be duckdb types #28

tarasglek opened this issue Jan 14, 2025 · 16 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@tarasglek
Copy link

I noticed that metadata returned when POSTing to /?default_format=JSONCompact uses non-duckdb types and they are usually strings. Is this to be compatible with clickhouse?

I think it would make more sense to use duckdb types (eg VARCHAR > String), (INT > Int32)
eg

select 1

returns

{
  "meta": [
    {
      "name": "1",
      "type": "Int32"
    }
  ],
  "data": [
    [
      "1"
    ]
  ],
  "rows": 1
}
WITH l33t_uuid AS (
  SELECT 'deadbeef-1337-4b1d-8008-0123456789ab'::UUID AS uuid_value
)
SELECT
  uuid_value::VARCHAR AS uuid_string,
  uuid_value AS uuid_uuid,
  typeof(uuid_value::VARCHAR) AS uuid_string_type,
  typeof(uuid_value) AS uuid_uuid_type
FROM l33t_uuid;

Returns

{
  "meta": [
    {
      "name": "uuid_string",
      "type": "String"
    },
    {
      "name": "uuid_uuid",
      "type": "String"
    },
    {
      "name": "uuid_string_type",
      "type": "String"
    },
    {
      "name": "uuid_uuid_type",
      "type": "String"
    }
  ],
  "data": [
    [
      "deadbeef-1337-4b1d-8008-0123456789ab",
      "deadbeef-1337-4b1d-8008-0123456789ab",
      "VARCHAR",
      "UUID"
    ]
  ],
  "rows": 1
}
@tarasglek
Copy link
Author

tarasglek commented Jan 14, 2025

Eg I was expecting meta for uuid example to be

{
"meta": [
    {
      "name": "uuid_string",
      "type": "VARCHAR"
    },
    {
      "name": "uuid_uuid",
      "type": "UUID"
    },
    {
      "name": "uuid_string_type",
      "type": "VARCHAR"
    },
    {
      "name": "uuid_uuid_type",
      "type": "VARCHAR"
    }
  ]
}

@tarasglek
Copy link
Author

Is representing numbers as strings intentional?

"data": [
    [
      "1"
    ]

@lmangani
Copy link
Collaborator

@tarasglek absolutely those are great points! The initial intention of http_server was to emulate ClickHouse alongside chsql (a reimplementation of quackpipe as a set of extensions from the inside out) but it seems we should steer this back into being a DuckDB native resource for other developers and extensions to build on top of without having to fork or redo the job.

Shall we retain the clickhouse API compatibility? I'd be more than happy introducing a "switch" or new formats to be DuckDB native if there's interest in leveraging this for integrations.

@lmangani lmangani added enhancement New feature or request help wanted Extra attention is needed labels Jan 14, 2025
@tarasglek
Copy link
Author

This is the defacto duckdb rpc interface for now :)

@lmangani
Copy link
Collaborator

Then its time to do right. If there's a proposal, please let me know. My instinct would be to "default" format to DuckDB unless one of the ClickHouse emulation formats is specified as a starting point. WDYT?

@tarasglek
Copy link
Author

Your proposal makes sense.

@mskyttner
Copy link

Dropping in on the conversation (sorry!) just reflecting outside of json on - on the response format mimetype, there is now json and what about "text/csv"? The CSV reader in duckdb is fast and it would be nice to be able to serve csv for various dirty integrations or scenarios where parquet and json are too modern inventions :) ... (ideally supporting streaming with those range requests which I believe the CSV reader in duckdb can deal with).

And at the other end of the spectrum, for stronger typing and more modern applications maybe parquet and Arrow Flight SQL would be nice (or maybe gRPC is not in scope for httpserver)?

@lmangani
Copy link
Collaborator

lmangani commented Feb 6, 2025

And at the other end of the spectrum, for stronger typing and more modern applications maybe parquet and Arrow Flight SQL would be nice (or maybe gRPC is not in scope for httpserver)?

@mskyttner we're building a functional prorotype (quackpy) to add a Flight SQL interface on top of http_server but it'll take a little more time

@lmangani
Copy link
Collaborator

lmangani commented Feb 6, 2025

Is representing numbers as strings intentional?

"data": [
    [
      "1"
    ]

Need to recheck but I believe this might just be there to overcome JSON's limitations for large numbers. will confirm during the next batch of changes for sure

@lmangani
Copy link
Collaborator

lmangani commented Feb 6, 2025

@tarasglek since the serializer is now separate from the main logic (thanks to our awesome contributors @NiclasHaderer and @gropaul) we could think of implementing header based control for types?

We could choose some headers or any other approach, ie:

// Handle both GET and POST requests
void HandleHttpRequest(const duckdb_httplib_openssl::Request& req, duckdb_httplib_openssl::Response& res) {
    // ...
    bool convert_types = true;  // Default to true

    if (req.has_param("convert_types")) {
        convert_types = req.get_param_value("convert_types") == "true";
    } else if (req.has_header("X-Convert-Types")) {
        convert_types = req.get_header_value("X-Convert-Types") == "true";
    }

and then extend the serializer to skip conversion based on parrameters?

void ResultSerializer::SerializeValue(yyjson_mut_val *parent, const Value &value, optional_ptr<string> name, const LogicalType &type, const bool convert_types) {
    yyjson_mut_val *val = nullptr;

    if (value.IsNull()) {
        goto null_handle;
    }

    // If no conversion is needed, serialize directly
    if (!convert_types) {
        val = yyjson_mut_strcpy(doc, value.ToString().c_str());
        if (!name) {
            YY_APPEND_FAIL(yyjson_mut_arr_append(parent, val));
        } else {
            yyjson_mut_val *key = yyjson_mut_strcpy(doc, name->c_str());
            D_ASSERT(key);
            YY_APPEND_FAIL(yyjson_mut_obj_add(parent, key, val));
        }
        return;
    }

    switch (type.id()) {
        // Type conversion cases (as before)
        // ...
    }

Any thoughts?

@NiclasHaderer
Copy link
Collaborator

Tbh. I would never convert number to string. I don't know why clickouse would do that (and if they really do it), but it sounds wrong to me and incentivizes people to do things like store everything in strings. This is something that is already happening, but I really do not want to make bad things worse by adding features like this...

The Serializer (mind you, only the JsonCompact one) tries to convert every viable type to its correct corresponding value.
The JSONEachRow still does things the old way and serializes most things as strings.

I however agree, that the meta description would benefit from some love and polishing. I think Int32 is the correct value to return (even though we might want to format it a bit nicer), but complex types like maps and structs could be represented nicer

@gropaul
Copy link
Collaborator

gropaul commented Feb 6, 2025

I think that the behaviour mentioned in the first message is outdated since #26, in fact it should serialize as a number since this PR. But there was not a release to the duckdb community repo, so the old serialization is still active for everybody installing via the community extensions repo.

@gropaul
Copy link
Collaborator

gropaul commented Feb 6, 2025

Need to recheck but I believe this might just be there to overcome JSON's limitations for large numbers. will confirm during the next batch of changes for sure

Everything larger than a uint64 will be serialized as a string, e.g. HUGEINT or VARINT.

@lmangani
Copy link
Collaborator

lmangani commented Feb 6, 2025

I think that the behaviour mentioned in the first message is outdated since #26, in fact it should serialize as a number since this PR. But there was not a release to the duckdb community repo, so the old serialization is still active for everybody installing via the community extensions repo.

correct, we were waiting for 1.2.0 and its time to release this week!
PR open if anyone has suggestions or docs to add: duckdb/community-extensions#283

@lmangani
Copy link
Collaborator

lmangani commented Feb 6, 2025

PR Merged 0.1.5 is on the way out
@tarasglek any chance to retest with DuckDB 1.2.0 and http_server 0.1.5 once its published?

@tarasglek
Copy link
Author

tarasglek commented Feb 7, 2025

I can but my testing depends on go binding which is not on 1.2 yet

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants