Update without downtime #6870
Replies: 4 comments 2 replies
-
Thank you, @nikita-sakharin. We prefer using GitHub discussions for easier access and discovery than email lists or Slack. Regarding rolling upgrades, the main concern is compatibility between different versions and the implications of running in mixed mode. Currently, we ensure all writes in OM/SCM are flushed, replicated and no new mutating requests are accepted before upgrading. To enable rolling upgrades, we need to guarantee that any writes accepted during the upgrade are compatible with older versions to allow for downgrade and correct processing of requests across all nodes. We all need to correctly share version information for Datanodes to the client such that the client does not invoke a newer API against a Datanode that is running an older version. There is a lot more to enabling rolling upgrade in addition to what I laid out but take a look at the current upgrade framework Ref: |
Beta Was this translation helpful? Give feedback.
-
Hi @nikita-sakharin, first of all, let me thank you for your interest and continued efforts to get this figured out. You may look at the original JIRA for non-rolling upgrade under HDDS-3698, and the spillover things under HDDS-5444 to get a grasp on what problems we needed to think about and solve for the non-rolling upgrades to go smooth. With rolling upgrades, you need to think of an awful lot of things... Request/Response payload compatibility is somewhat solved at least for Ozone Manager under HDDS-6390, however that is a half baked solution the same things are applied in the request processing code of SCM and maybe the DN also, but without the proposed annotations and such. The biggest problem is how we handle transactions in Ratis as far as I remember, and afaik it did not changed since then. In Ratis we have the statemachine that applies the transactions in order to the current leader, and the transaction data is sent to the other OMs and they also apply the transaction to their state separately. Once the majority applied the transaction, then the transaction is committed and acked back to the client. Now if we have two different versions of the code, then we need to solve that the transactions are processed identically in two OMs or SCMs and are applied exactly the same way to the DB from the two different codebase. There are a few other problems, like Recon, as Recon gets the OM DB changes via an API, and it has to be able to process all the changes that happen within OM, so if the DB is changing you need to guarantee that Recon is also upgraded before you finalize the upgrade in OMs. Similar problem occurs with heartbeats, as Datanodes send the heartbeat to the Recon server also, so compatibility needs to be taken care of there also before turning on the new features with finalizing the upgrade. I am not sure, but maybe @szetszwo can add some thoughts on Ratis version differences and whether you need to consider that also, as Ratis is an embedded raft implementation, but it has its own protocols, which might need some care in case of different versions running within a raft ring. I am certain that I forget to mention or ignorant about a few things, so I hope @errose28 will have the time to also correct and complement this list. At the time we worked on non-rolling upgrades, we had some discussions on problems with rolling upgrade but we did not go too deep to figure out a feasible solution for the rolling upgrades back at that time. I hope I did not take away your ambitions with all the things listed above, I am sure this is a huge undertake, and I hope we can give you enough support to help you figure out a way and implement it. |
Beta Was this translation helpful? Give feedback.
-
Hi @nikita-sakharin, I can add to the explanations provided here already. As you've probably gathered from the above responses, rolling upgrades will be large cross-cutting feature in Ozone touching every component in a different way. While it is on the roadmap to eventually be supported, design or code has not been started in the community and there is currently no timeline for when work will start and how long that will take. As an open source project we are open to any ideas or designs you would like to share, but I recommend starting with some smaller Ozone tasks to get familiar with the system before diving into a challenging task like this. That said, I can fill you in on the current state of upgrades, downgrades, and compatibility within Ozone so you can start exploring this. Currently Supported by OzoneI did a presentation on this back in 2022. Unfortunately it was not recorded but the slides are here if you want to reference them. The current state of upgrades and cross compatibility support is still the same:
Current Testing
If you would like to try these tests out locally and are having difficulty running them I can provide more details to help you get set up. Docker images allow us to test the current version of the code against previous releases by pulling them from Docker Hub exactly as they were released. Current ChallengesI would summarize the following as the biggest challenges for rolling upgrades in Ozone. Ideally I would say that only "request versioning" would be the issue that needs to be solved, but due to architecture decisions that were made early on in the system there are other issues as well. Ratis
|
Beta Was this translation helpful? Give feedback.
-
@kerneltime, @errose28, @fapifta It has taken for me some amount of time to get acquainted with the information you gave. I examined all Issues and presentation you mentioned. For now I have got some understanding of the problem. And as far as I have understood Update without downtime is pretty huge feature with respect to code changes to be made. As I have already mentioned in this Monday meeting I am looking forward to splitting that feature for sub tasks. Could I kindly ask you to point on some sub-issues/sub-tasks this feature depends on? If it possible I need something small and simple to start with and implement something tiny to step forward. |
Beta Was this translation helpful? Give feedback.
-
I have already asked my question in a letter. Also, I attended at a meeting on Monday and was advised to move that conversation to GitHub. The following text partially repeats the letter.
I would like to research the possibility of updating Ozone cluster from one version to higher one without downtime.
Let I have Ozone cluster configured High Availability, so there are multiple Ozone Managers and multiple Storage Container Managers.
There are two things to discuss:
Under rolling update I mean the following:
I am looking forward to getting links clarifying the current status of that issue and some thoughts/ideas on the implementation.
Having that information got I will try to draft some concept or implementation plan. Having this plan reviewed by community I will try to implement it.
Beta Was this translation helpful? Give feedback.
All reactions