Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Discuss] Doris Roadmap 2025 #47948

Open
morningman opened this issue Feb 16, 2025 · 4 comments
Open

[Discuss] Doris Roadmap 2025 #47948

morningman opened this issue Feb 16, 2025 · 4 comments
Labels
Discuss kind/community Issues or PRs related to Doris community

Comments

@morningman
Copy link
Contributor

morningman commented Feb 16, 2025

Roadmap 2024
Roadmap 2023
Roadmap 2022

Apache Doris 2025 Roadmap

In 2025, Apache Doris will focus on lakehouse and semi-structured data analysis, continuing to optimize core areas such as query execution, storage, and query optimizer to further improve performance, stability, and ecosystem compatibility to meet more complex scenarios and large-scale data processing requirements. Meanwhile, Doris will strengthen cloud-native capabilities and security, and explore AI integration scenarios, including vector search and AI training data management, as well as utilizing AI capabilities to assist with system monitoring and operations, providing users with a more comprehensive, efficient, and secure modern data analysis platform.

Lakehouse

1. Performance and Stability

  • IO Optimization
    • Parquet/ORC lazy materialization for complex data type: Improve query performance for complex data types.
    • Optimize Scan task scheduling, improve small query long-tail issues.
    • Support dynamic partition pruning: Optimize query efficiency for partitioned tables.
    • Optimize Data Cache small file issues: Resolve performance problems caused by too many small files.
  • Metadata Optimization
    • Metadata cache sharing within single query: Improve query performance, reduce redundant metadata loading.
    • Optimize Hive, Iceberg, Paimon metadata access performance: Improve metadata access and Plan performance.

2. Open Table Format

  • Iceberg

    • Support Iceberg branch/tag access and management.
    • Support more Iceberg system tables.
    • Support Iceberg Update/Delete: Enhance write operation support for Iceberg tables.
    • Support Iceberg small file compaction and Snapshot management.
    • Support AWS S3Tables.
    • Support Snowflake Iceberg table engine.
    • Support Databricks Uniform DeltaLake table engine.
  • Paimon

    • Support Paimon data write-back: Implement write support for Paimon data.
    • Support Paimon snapshot read: Support historical data queries based on snapshots.
    • Support more Paimon system tables.
  • Hive

    • Support multi-Kerberos environment.
    • Support multiple Hadoop configuration file management.
    • Support Hive 4 transaction table.
  • Doris

    • Support Doris Catalog: Provide federated queries across multiple Doris clusters.
  • Delta Lake/Hudi

    • Optimize ecosystem compatibility with Iceberg.
  • Catalog

    • Support Unity Catalog.
    • Support Apache Polaris.
    • Support Apache Gravitino.

3. Code Refactoring

  • Optimize and unify data source property names: Improve data source configuration consistency.
  • JDBC Catalog pluginization: Enhance JDBC Catalog extensibility.
  • File system pluginization: Improve file system pluggability.

Semi-structured and Log Analysis

1. Inverted Index Enhancement

  • Support more tokenizers
    • Chinese ik tokenizer
    • Unicode icu tokenizer
    • High-performance simple tokenizer for log scenarios
  • Support custom dictionary and management for tokenizers.
  • Support incremental index building in disaggregated storage mode.
  • Further optimize inverted index space usage.
  • Enhanced index observability, including write and query performance metrics.

2. VARIANT Data Type Enhancement

  • Supports 10,000 sub-columns in compute-storage decoupled architechture.
  • Sparse columns support more sparse sub-columns
  • Supports complex structure expansion of JSON array nested objects
  • Supports specifying sub-column types
  • Supports building indexes for specified fields

3. Log and Observability Ecosystem Improvement

  • Output plugin supports writing to multiple tables

    • filebeat
    • logstash
  • Observability ecosystem integration

    • Opentelemetry
    • Jeager
  • Support more log collector plugins

    • ilogtail
    • vector

Query Engine

1. Query Performance Optimization

  • Dynamic algorithm detection and adjustment for data skew: Optimize query execution, improve performance in big data scenarios.
  • ARM architecture tuning and optimization: Support more hardware architectures, improve operational efficiency.
  • Adaptive concurrency: Dynamically adjust parallel task numbers based on system load and resources, improve stability in query queue and spill scenarios.
  • More general top-n and global lazy-materialization ability.
  • Global dict.

2. Resource Management

  • Unified resource management framework for resource auditing and observability for query, load, compaction, schema change.
  • Provide realtime resource monitor system tables and metrics.
  • Unify resource control logics such as Workload Group Policy, Spill Disk, Query Breaker.
  • More smarter scheduling algorithm to allocate resource between multi queries in a single workload group to reduce affect between big queries and small queries.

3. Vector Search

  • Support vector search

4. Function Compatibility

  • Enhance function compatibility with ClickHouse.
  • Enhance function compatibility with Presto.

5. ETL

  • Combine workload group with spill disk to reduce concurrency and limit per query's resource usage dynamically to avoid cancelling query during resource shortage.
  • Enhance the stability of spill disk, for example could support 5 concurrent TPC-DS 10TB jobs on 48 core 192G memory cluster.
  • Provide realtime metric for spill stage.
  • Enhance mix-load memory management.

Storage and Security

1. Compute Storage decoupled

  • Optimize cold reads on object storage: Improve cold data read performance.
  • More user-friendly Cache strategies: Optimize Cache strategy configuration and usage.
  • More user-friendly read-write separation.
  • Support more cloud vendor authentication methods: Enhance security in cloud environments.
    • IAM Role authentication

2. Security

  • Support storage encryption: Enhance data storage security.
  • Improve HTTP interface security, including HTTPS support and interface authentication.

3. ETL Enhancement

  • Support temporary tables: Enhance data processing capabilities in ETL scenarios.
  • Support write-write conflict handling in multi-statement transactions: Improve transaction operation reliability.

4. Disaster Recovery and High Availability

  • Support backup and recovery in compute storage decouple architechture.
  • Cross-cluster replication (CCR)
    • Feature completeness: Ensure production environment stability through thorough chaos testing.
    • Support disaggregated storage: Improve CCR adaptation in cloud-native architecture.
    • Support primary-secondary switchover: Enhance high availability capabilities.

5. Real-time Data Streaming

  • Support Binlog for incremental computation: Support real-time data streaming scenarios.

Query Optimizer

1. Asynchronous Materialized Views

  • Data lake table format (Iceberg, Paimon, Hudi) partition incremental build: Improve materialized view build efficiency.
  • Enhance observability using monitoring information and system tables: Improve operational capabilities.
  • Data lineage information interface: Provide data lineage tracking capabilities.
  • Logical view and materialized view interconversion: Improve view management flexibility.
  • Automatic materialized views: Implement intelligent management of materialized views.

2. Feature Enhancement

  • Recursive CTE: Support recursive queries.
  • Filter aggregation (FILTER Clause): Improve SQL feature standard compatibility.
  • Pivot and Unpivot: Support data pivoting and unpivoting operations.
  • More reasonable implicit type conversion rules: Optimize type conversion logic.
  • Standard SQL compatibility improvement: Enhance standard SQL support.

3. Execution Optimization

  • Compression materialization: Optimize storage space utilization.
  • Global lazy materialization: Improve query performance.

4. Plan Quality Enhancement

  • HBO support.
  • Enhance optimization rules like constant propagation, NULL propagation.
  • Enhance optimization rules utilizing data characteristics.
  • Data skew adaptive optimization.
  • Common subplan extraction.
  • Cost-based CTE materialization selection.
  • Cost-based aggregation stage selection.
  • Runtime Filter wait time adaptation.
  • Enhance Shuffle algorithm selection for distributed plans.
  • Adaptive parallelism control.

5. Plan Management

  • Execution plan fixing: Support plan controllability.
  • Execution plan evolution: Improve plan flexibility and intelligence.

6. Framework Optimization

  • Small query scenario planning performance optimization: Improve small query execution efficiency.
  • Old optimizer code removal: Simplify code maintenance.

7. Operations Enhancement

  • Statistics status collection monitoring and system tables: Improve statistics observability.
  • Planning time monitoring and system tables: Enhance query planning diagnostic capabilities.
  • Enrich query-related information in audit logs: Improve audit capabilities.
  • Error message categorization and content optimization: Improve error message readability and diagnostic capabilities.
@morningman morningman added Discuss kind/community Issues or PRs related to Doris community labels Feb 16, 2025
@morningman morningman pinned this issue Feb 16, 2025
@htyoung
Copy link
Contributor

htyoung commented Feb 18, 2025

The upgrade with no impact feature is not in the 2025 Roadmap, but it is mentioned in the 2024 Roadmap. Does the community have plans for this feature or has it been postponed?

@dsproten
Copy link

  1. What is the idea behind supporting these 3 new features? Does this mean we can federate queries to Snowflake/DB?
    If so, why? Is it to help migrate away from Databricks and Snowflake onto VeloDB?
    • Support Snowflake Iceberg table engine.
    • Support Databricks Uniform DeltaLake table engine.
    • Support Unity Catalog.

  2. I am happy to see these 2 features: • IAM Role authentication
    Query Optimizer • Data lineage information interface: Provide data lineage tracking capabilities

  3. Execution plan tweaking can be a very powerful feature, but in the past I have seen that it can also cause engine crashes.
    How is it made safe?

  1. Plan Management - Execution plan fixing: Support plan controllability.

@morningman
Copy link
Contributor Author

The upgrade with no impact feature is not in the 2025 Roadmap, but it is mentioned in the 2024 Roadmap. Does the community have plans for this feature or has it been postponed?

What we try to do is to minimize the impact when user upgrade between bug fix version likse 2.1.2 to 2.1.3, especially taking care of the commits with behavior changes or large code refactor. So to make the release branch more stable.
But what you mentioned may be the "no impact when rolling upgrade the Doris cluster"? If yes, there is not plan for that because it is hard to keep all tasks working well when rolling upgrade. What we try to do is to minimize the upgrade time and user can simply retry their tasks after failure.

@morningman
Copy link
Contributor Author

  1. What is the idea behind supporting these 3 new features? Does this mean we can federate queries to Snowflake/DB?
    If so, why? Is it to help migrate away from Databricks and Snowflake onto VeloDB?
    • Support Snowflake Iceberg table engine.
    • Support Databricks Uniform DeltaLake table engine.
    • Support Unity Catalog.

  2. I am happy to see these 2 features: • IAM Role authentication
    Query Optimizer • Data lineage information interface: Provide data lineage tracking capabilities

  3. Execution plan tweaking can be a very powerful feature, but in the past I have seen that it can also cause engine crashes.
    How is it made safe?

  4. Plan Management - Execution plan fixing: Support plan controllability.

  1. Yes. Since both Snowflake and Databricks supported Iceberg table format, and open sourced their catalog (Polaris and Unity), I think we can easily access the data on them, and to help user either migrate their data from one to each other, or simply do the federate quries

  2. if the engine crashed after plan tweaking, I am sure that is bug so we need to fix it. But what we try to do is to make the query plan more stable, to avoid plan changing after irrelevant features being introduced

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discuss kind/community Issues or PRs related to Doris community
Projects
None yet
Development

No branches or pull requests

3 participants