-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
api,agent,server,engine-schema: scalability improvements #9840
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #9840 +/- ##
============================================
- Coverage 15.78% 15.60% -0.19%
- Complexity 12564 12624 +60
============================================
Files 5627 5631 +4
Lines 492250 492895 +645
Branches 61405 59709 -1696
============================================
- Hits 77710 76906 -804
- Misses 406066 407490 +1424
- Partials 8474 8499 +25
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
2c750db
to
080e5af
Compare
Following changes and improvements have been added: - Improvements in handling of PingRoutingCommand 1. Added global config - `vm.sync.power.state.transitioning`, default value: true, to control syncing of power states for transitioning VMs. This can be set to false to prevent computation of transitioning state VMs. 2. Improved VirtualMachinePowerStateSync to allow power state sync for host VMs in a batch 3. Optimized scanning stalled VMs - Added option to set worker threads for capacity calculation using config - `capacity.calculate.workers` - Added caching framework based on Caffeine in-memory caching library, https://github.com/ben-manes/caffeine - Added caching for account/use role API access with expiration after write can be configured using config - `dynamic.apichecker.cache.period`. If set to zero then there will be no caching. Default is 0. - Added caching for account/use role API access with expiration after write set to 60 seconds. - Added caching for some recurring DB retrievals 1. CapacityManager - listing service offerings - beneficial in host capacity calculation 2. LibvirtServerDiscoverer existing host for the cluster - beneficial for host joins 3. DownloadListener - hypervisors for zone - beneficial for host joins 5. VirtualMachineManagerImpl - VMs in progress- beneficial for processing stalled VMs during PingRoutingCommands - Optimized MS list retrieval for agent connect - Optimize finding ready systemvm template for zone - Database retrieval optimisations - fix and refactor for cases where only IDs or counts are used mainly for hosts and other infra entities. Also similar cases for VMs and other entities related to host concerning background tasks - Changes in agent-agentmanager connection with NIO client-server classes 1. Optimized the use of the executor service 2. Refactore Agent class to better handle connections. 3. Do SSL handshakes within worker threads 5. Added global configs to control the behaviour depending on the infra. SSL handshake could be a bottleneck during agent connections. Configs - `agent.ssl.handshake.min.workers` and `agent.ssl.handshake.max.workers` can be used to control number of new connections management server handles at a time. `agent.ssl.handshake.timeout` can be used to set number of seconds after which SSL handshake times out at MS end. 6. On agent side backoff and sslhandshake timeout can be controlled by agent properties. `backoff.seconds` and `ssl.handshake.timeout` properties can be used. - Improvements in StatsCollection - minimize DB retrievals. - Improvements in DeploymentPlanner allow for the retrieval of only desired host fields and fewer retrievals. - Improvements in hosts connection for a storage pool. Added config - `storage.pool.host.connect.workers` to control the number of worker threads that can be used to connect hosts to a storage pool. Worker thread approach is followed currently only for NFS and ScaleIO pools. - Minor improvements in resource limit calculations wrt DB retrievals Signed-off-by: Abhishek Kumar <[email protected]> Co-authored-by: Abhishek Kumar <[email protected]> Co-authored-by: Rohit Yadav <[email protected]>
080e5af
to
e3cf7fd
Compare
Honestly, I don't like PRs with thousand of lines doing thousand of things. It is hard to review and test. I encourage you to separate it in several minor PRs that address each one of the changes you are proposing. |
Signed-off-by: Abhishek Kumar <[email protected]>
Signed-off-by: Abhishek Kumar <[email protected]>
Signed-off-by: Abhishek Kumar <[email protected]>
Signed-off-by: Abhishek Kumar <[email protected]>
@blueorangutan package |
@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11441 |
@blueorangutan package |
@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11486 |
@blueorangutan test |
@shwstppr a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
[SF] Trillian test result (tid-11738)
|
Signed-off-by: Abhishek Kumar <[email protected]>
@blueorangutan package |
@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great job @shwstppr
over lgtm
it seems this will reduce a large number of database queries, therefore improve the performance a lot.
engine/schema/src/main/java/com/cloud/storage/dao/StoragePoolHostDaoImpl.java
Show resolved
Hide resolved
server/src/main/java/org/apache/cloudstack/acl/RoleManagerImpl.java
Outdated
Show resolved
Hide resolved
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11496 |
Signed-off-by: Abhishek Kumar <[email protected]>
Signed-off-by: Abhishek Kumar <[email protected]>
Signed-off-by: Abhishek Kumar <[email protected]>
Signed-off-by: Abhishek Kumar <[email protected]>
Thanks @weizhouapache for the review. I've made the changes as per your suggestions and responded to the queries. @blueorangutan package |
@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11505 |
@blueorangutan test |
@shwstppr a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
@blueorangutan test |
@weizhouapache a [SL] Trillian-Jenkins test job (ol8 mgmt + kvm-ol8) has been kicked to run smoke tests |
[SF] Trillian test result (tid-11749)
|
[SF] Trillian test result (tid-11754)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comment, overall LGTM. Nice work @shwstppr
@@ -58,6 +58,7 @@ public InputStream[] getPrepareScripts() { | |||
@Override |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making a note to change this file after 4.20 release
Signed-off-by: Abhishek Kumar <[email protected]>
Signed-off-by: Abhishek Kumar <[email protected]>
Signed-off-by: Abhishek Kumar <[email protected]>
Thank you @harikrishna-patnala for the review @GutoVeronezi thank you for the feedback and suggestion. I agree with the rationale of smaller PRs but I would request you and others to consider this as an exception. The work in this PR is a part of larger scalability work which I'll also be presenting at CCC. In this PR, the focus is mainly on persistence layer and infrastructure entities. There are also some changes wrt schema and other areas which were quick fixes and yield performance improvements. I'll be adding more details after the conference, including some benchmarking tests and unit tests. |
@blueorangutan package |
@shwstppr a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM, XenServer and VMware SystemVM templates. I'll keep you posted as I make progress. |
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ debian ✔️ suse15. SL-JID 11549 |
Description
Following changes and improvements have been added:
Improvements in handling of PingRoutingCommand
vm.sync.power.state.transitioning
, default value: true, to control syncing of power states for transitioning VMs. This can be set to false to prevent computation of transitioning state VMs.Added option to set worker threads for capacity calculation using config -
capacity.calculate.workers
Added caching for account/use role API access with expiration after write can be configured using config -
dynamic.apichecker.cache.period
. If set to zero then there will be no caching. Default is 0.Added caching for account/use role API access with expiration after write set to 60 seconds.
Added caching for some recurring DB retrievals
Optimized MS list retrieval for agent connect
Optimize finding ready systemvm template for zone
Database retrieval optimisations - fix and refactor for cases where only IDs or counts are used mainly for hosts and other infra entities. Also similar cases for VMs and other entities related to host concerning background tasks
Changes in agent-agentmanager connection with NIO client-server classes
agent.ssl.handshake.min.workers
andagent.ssl.handshake.max.workers
can be used to control number of new connections management server handles at a time.agent.ssl.handshake.timeout
can be used to set number of seconds after which SSL handshake times out at MS end.backoff.seconds
andssl.handshake.timeout
properties can be used.Improvements in StatsCollection - minimize DB retrievals.
Improvements in DeploymentPlanner allow for the retrieval of only desired host fields and fewer retrievals.
Improvements in hosts connection for a storage pool. Added config -
storage.pool.host.connect.workers
to control the number of worker threads that can be used to connect hosts to a storage pool. Worker thread approach is followed currently only for NFS and ScaleIO pools.Minor improvements in resource limit calculations wrt DB retrievals
Types of changes
Feature/Enhancement Scale or Bug Severity
Feature/Enhancement Scale
Bug Severity
Screenshots (if appropriate):
How Has This Been Tested?
How did you try to break this feature and the system with this change?