[Discuss] Doris Roadmap 2025 #47948

morningman · 2025-02-16T14:10:43Z

Roadmap 2024
Roadmap 2023
Roadmap 2022

Apache Doris 2025 Roadmap

In 2025, Apache Doris will focus on lakehouse and semi-structured data analysis, continuing to optimize core areas such as query execution, storage, and query optimizer to further improve performance, stability, and ecosystem compatibility to meet more complex scenarios and large-scale data processing requirements. Meanwhile, Doris will strengthen cloud-native capabilities and security, and explore AI integration scenarios, including vector search and AI training data management, as well as utilizing AI capabilities to assist with system monitoring and operations, providing users with a more comprehensive, efficient, and secure modern data analysis platform.

Lakehouse

1. Performance and Stability

IO Optimization
- Parquet/ORC lazy materialization for complex data type: Improve query performance for complex data types.
- Optimize Scan task scheduling, improve small query long-tail issues.
- Support dynamic partition pruning: Optimize query efficiency for partitioned tables.
- Optimize Data Cache small file issues: Resolve performance problems caused by too many small files.
Metadata Optimization
- Metadata cache sharing within single query: Improve query performance, reduce redundant metadata loading.
- Optimize Hive, Iceberg, Paimon metadata access performance: Improve metadata access and Plan performance.

2. Open Table Format

Iceberg
- Support Iceberg branch/tag access and management.
- Support more Iceberg system tables.
- Support Iceberg Update/Delete: Enhance write operation support for Iceberg tables.
- Support Iceberg small file compaction and Snapshot management.
- Support AWS S3Tables.
- Support Snowflake Iceberg table engine.
- Support Databricks Uniform DeltaLake table engine.
Paimon
- Support Paimon data write-back: Implement write support for Paimon data.
- Support Paimon snapshot read: Support historical data queries based on snapshots.
- Support more Paimon system tables.
Hive
- Support multi-Kerberos environment.
- Support multiple Hadoop configuration file management.
- Support Hive 4 transaction table.
Doris
- Support Doris Catalog: Provide federated queries across multiple Doris clusters.
Delta Lake/Hudi
- Optimize ecosystem compatibility with Iceberg.
Catalog
- Support Unity Catalog.
- Support Apache Polaris.
- Support Apache Gravitino.

3. Code Refactoring

Optimize and unify data source property names: Improve data source configuration consistency.
JDBC Catalog pluginization: Enhance JDBC Catalog extensibility.
File system pluginization: Improve file system pluggability.

Semi-structured and Log Analysis

1. Inverted Index Enhancement

Support more tokenizers
- Chinese ik tokenizer
- Unicode icu tokenizer
- High-performance simple tokenizer for log scenarios
Support custom dictionary and management for tokenizers.
Support incremental index building in disaggregated storage mode.
Further optimize inverted index space usage.
Enhanced index observability, including write and query performance metrics.

2. VARIANT Data Type Enhancement

Supports 10,000 sub-columns in compute-storage decoupled architechture.
Sparse columns support more sparse sub-columns
Supports complex structure expansion of JSON array nested objects
Supports specifying sub-column types
Supports building indexes for specified fields

3. Log and Observability Ecosystem Improvement

Query Engine

1. Query Performance Optimization

Dynamic algorithm detection and adjustment for data skew: Optimize query execution, improve performance in big data scenarios.
ARM architecture tuning and optimization: Support more hardware architectures, improve operational efficiency.
Adaptive concurrency: Dynamically adjust parallel task numbers based on system load and resources, improve stability in query queue and spill scenarios.
More general top-n and global lazy-materialization ability.
Global dict.

2. Resource Management

Unified resource management framework for resource auditing and observability for query, load, compaction, schema change.
Provide realtime resource monitor system tables and metrics.
Unify resource control logics such as Workload Group Policy, Spill Disk, Query Breaker.
More smarter scheduling algorithm to allocate resource between multi queries in a single workload group to reduce affect between big queries and small queries.

3. Vector Search

Support vector search

4. Function Compatibility

Enhance function compatibility with ClickHouse.
Enhance function compatibility with Presto.

5. ETL

Combine workload group with spill disk to reduce concurrency and limit per query's resource usage dynamically to avoid cancelling query during resource shortage.
Enhance the stability of spill disk, for example could support 5 concurrent TPC-DS 10TB jobs on 48 core 192G memory cluster.
Provide realtime metric for spill stage.
Enhance mix-load memory management.

Storage and Security

1. Compute Storage decoupled

Optimize cold reads on object storage: Improve cold data read performance.
More user-friendly Cache strategies: Optimize Cache strategy configuration and usage.
More user-friendly read-write separation.
Support more cloud vendor authentication methods: Enhance security in cloud environments.
- IAM Role authentication

2. Security

Support storage encryption: Enhance data storage security.
Improve HTTP interface security, including HTTPS support and interface authentication.

3. ETL Enhancement

Support temporary tables: Enhance data processing capabilities in ETL scenarios.
Support write-write conflict handling in multi-statement transactions: Improve transaction operation reliability.

4. Disaster Recovery and High Availability

Support backup and recovery in compute storage decouple architechture.
Cross-cluster replication (CCR)
- Feature completeness: Ensure production environment stability through thorough chaos testing.
- Support disaggregated storage: Improve CCR adaptation in cloud-native architecture.
- Support primary-secondary switchover: Enhance high availability capabilities.

5. Real-time Data Streaming

Support Binlog for incremental computation: Support real-time data streaming scenarios.

Query Optimizer

1. Asynchronous Materialized Views

Data lake table format (Iceberg, Paimon, Hudi) partition incremental build: Improve materialized view build efficiency.
Enhance observability using monitoring information and system tables: Improve operational capabilities.
Data lineage information interface: Provide data lineage tracking capabilities.
Logical view and materialized view interconversion: Improve view management flexibility.
Automatic materialized views: Implement intelligent management of materialized views.

2. Feature Enhancement

Recursive CTE: Support recursive queries.
Filter aggregation (FILTER Clause): Improve SQL feature standard compatibility.
Pivot and Unpivot: Support data pivoting and unpivoting operations.
More reasonable implicit type conversion rules: Optimize type conversion logic.
Standard SQL compatibility improvement: Enhance standard SQL support.

3. Execution Optimization

Compression materialization: Optimize storage space utilization.
Global lazy materialization: Improve query performance.

4. Plan Quality Enhancement

5. Plan Management

Execution plan fixing: Support plan controllability.
Execution plan evolution: Improve plan flexibility and intelligence.

6. Framework Optimization

Small query scenario planning performance optimization: Improve small query execution efficiency.
Old optimizer code removal: Simplify code maintenance.

7. Operations Enhancement

Statistics status collection monitoring and system tables: Improve statistics observability.
Planning time monitoring and system tables: Enhance query planning diagnostic capabilities.
Enrich query-related information in audit logs: Improve audit capabilities.
Error message categorization and content optimization: Improve error message readability and diagnostic capabilities.

htyoung · 2025-02-18T09:49:11Z

The upgrade with no impact feature is not in the 2025 Roadmap, but it is mentioned in the 2024 Roadmap. Does the community have plans for this feature or has it been postponed?

dsproten · 2025-02-25T21:56:25Z

What is the idea behind supporting these 3 new features? Does this mean we can federate queries to Snowflake/DB?
If so, why? Is it to help migrate away from Databricks and Snowflake onto VeloDB?
• Support Snowflake Iceberg table engine.
• Support Databricks Uniform DeltaLake table engine.
• Support Unity Catalog.
I am happy to see these 2 features: • IAM Role authentication
Query Optimizer • Data lineage information interface: Provide data lineage tracking capabilities
Execution plan tweaking can be a very powerful feature, but in the past I have seen that it can also cause engine crashes.
How is it made safe?

Plan Management - Execution plan fixing: Support plan controllability.

morningman · 2025-03-01T15:23:22Z

The upgrade with no impact feature is not in the 2025 Roadmap, but it is mentioned in the 2024 Roadmap. Does the community have plans for this feature or has it been postponed?

What we try to do is to minimize the impact when user upgrade between bug fix version likse 2.1.2 to 2.1.3, especially taking care of the commits with behavior changes or large code refactor. So to make the release branch more stable.
But what you mentioned may be the "no impact when rolling upgrade the Doris cluster"? If yes, there is not plan for that because it is hard to keep all tasks working well when rolling upgrade. What we try to do is to minimize the upgrade time and user can simply retry their tasks after failure.

morningman · 2025-03-01T15:30:59Z

What is the idea behind supporting these 3 new features? Does this mean we can federate queries to Snowflake/DB?
If so, why? Is it to help migrate away from Databricks and Snowflake onto VeloDB?
• Support Snowflake Iceberg table engine.
• Support Databricks Uniform DeltaLake table engine.
• Support Unity Catalog.

I am happy to see these 2 features: • IAM Role authentication
Query Optimizer • Data lineage information interface: Provide data lineage tracking capabilities

Execution plan tweaking can be a very powerful feature, but in the past I have seen that it can also cause engine crashes.
How is it made safe?

Plan Management - Execution plan fixing: Support plan controllability.

Yes. Since both Snowflake and Databricks supported Iceberg table format, and open sourced their catalog (Polaris and Unity), I think we can easily access the data on them, and to help user either migrate their data from one to each other, or simply do the federate quries
if the engine crashed after plan tweaking, I am sure that is bug so we need to fix it. But what we try to do is to make the query plan more stable, to avoid plan changing after irrelevant features being introduced

morningman added Discuss kind/community Issues or PRs related to Doris community labels Feb 16, 2025

morningman pinned this issue Feb 16, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discuss] Doris Roadmap 2025 #47948

[Discuss] Doris Roadmap 2025 #47948

morningman commented Feb 16, 2025 •

edited by luzhijing088

Loading

htyoung commented Feb 18, 2025

dsproten commented Feb 25, 2025

morningman commented Mar 1, 2025

morningman commented Mar 1, 2025

[Discuss] Doris Roadmap 2025 #47948

[Discuss] Doris Roadmap 2025 #47948

Comments

morningman commented Feb 16, 2025 • edited by luzhijing088 Loading

Apache Doris 2025 Roadmap

Lakehouse

1. Performance and Stability

2. Open Table Format

3. Code Refactoring

Semi-structured and Log Analysis

1. Inverted Index Enhancement

2. VARIANT Data Type Enhancement

3. Log and Observability Ecosystem Improvement

Query Engine

1. Query Performance Optimization

2. Resource Management

3. Vector Search

4. Function Compatibility

5. ETL

Storage and Security

1. Compute Storage decoupled

2. Security

3. ETL Enhancement

4. Disaster Recovery and High Availability

5. Real-time Data Streaming

Query Optimizer

1. Asynchronous Materialized Views

2. Feature Enhancement

3. Execution Optimization

4. Plan Quality Enhancement

5. Plan Management

6. Framework Optimization

7. Operations Enhancement

htyoung commented Feb 18, 2025

dsproten commented Feb 25, 2025

morningman commented Mar 1, 2025

morningman commented Mar 1, 2025

morningman commented Feb 16, 2025 •

edited by luzhijing088

Loading