-
Notifications
You must be signed in to change notification settings - Fork 106
sqlserver_column.py: Handle string dtype of nvarchar #606
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
nvarchar is not considered a string in the base adapter, nor in the fabric adapter. This leads to issues when dealing with nvarchar columns, such as in https://github.com/dbt-msft/dbt-sqlserver-utils/blob/master/macros/sql/union.sql In this commit we add nvarchar to the list of string dtypes, return the correct string_size for nvarchar (which is doubled by default), and use the correct dtype in string_type
apologies for the noise, I renamed the branch which closed the first pull request |
As you say nvarchar uses twice the bytes to store the same length. Looking at the macros we grab the max_length value from the sys.columns view and I suspect that ends up being reported as the char_size. That means for nvarchar columns its actually reporting double the size of the string that can be stored in it and the code your adding would handle this, it might need to be handled in the macro itself if the code expects char_size to the number of characters we can store in a column so that its correctly reported through the codebase. I assume this means nchar columns would also be affected. |
My reading of the existing code is that char_size should be the declared maximum number of characters allowed, not the number of bytes allocated. Anywhere it is used it is assumed to be the number of characters. To date I doubt many are using it at all, as nvarchar is not currently interpreted as a string type, so the width of any columns being created is being set to the default (which is 30 I think everywhere?) As per https://stackoverflow.com/questions/37187654/alternative-to-sys-columns-max-length we could potentially use INFORMATION_SCHEMA.COLUMNS instead of sys.columns in https://github.com/microsoft/dbt-fabric/blob/d3b8665f4e116d0fcbd14e544071f5d3cf7eb15e/dbt/include/fabric/macros/adapters/columns.sql#L9 This would probably require us maintaining a fork of that macro in https://github.com/dbt-msft/dbt-sqlserver/blob/master/dbt/include/sqlserver/macros/adapter/columns.sql unless we can get it fixed in the fabric adapter. Fabric doesn't have nvarchar datatypes so not sure if it makes sense for them to fix it there or not. |
A SQL server specific version of the macro that gets the data seems the safest method to me, as you say it will not be implemented at the Fabric level. That should mean that the only code change would need to be the update to is_string to handle nchar and nvarchar. We will want a variation of https://github.com/microsoft/dbt-fabric/blob/main/tests/functional/adapter/test_column_types.py as well to support the new nvarchar and nchar handling. |
nvarchar is not considered a string in the base adapter, nor in the fabric adapter. This leads to issues when dealing with nvarchar columns, such as in https://github.com/dbt-msft/dbt-sqlserver-utils/blob/master/macros/sql/union.sql
In this commit we add nvarchar to the list of string dtypes, return the correct string_size for nvarchar (which is doubled by default), and use the correct dtype in string_type.
I believe this mishandling of nvarchar is the root cause of #446
I ran into this issue when unioning multiple relations which used nvarchar. The generated code had no string length specified for the nvarchar fields, which resulted in the default string length of 30 being used.
I'm not sure if how I handle string_type is best, as it is a class method in the base class, but I think we want it to be instance specific as that is how the dtypes are handled.
Additionally, I'm not sure why char_size is twice the string length for nvarchar but I noticed it in my testing which is why I halve it in string_size(). I do understand that nvarchar uses twice the bytes as varchar, so suspect it is related, but I don't know if we would want to fix char_size to be the actual number of characters or just fix it in string_size.