To support Meta's high availability, we need to do some refactor in Meta service:
meta_client
Some helpful Resources. Please let me know if I forgot something.
Looking forward to work on this task. However I am still working on a different kernel task and will be at KubeCon next week. I will start with this the earliest at the beginning of November.
First draft version ready: Failover works, but Hummock crashes: https://github.com/risingwavelabs/risingwave/issues/6534
support connecting to multiple meta-node on compute-node
@yezizp2012 I do not understand this part: My understanding is that the meta HA setup is a single-leader-system (see design docs). Therefore we only ever connect to one instance. Is my understanding correct @fuyufjh ?
support connecting to multiple meta-node on compute-node
@yezizp2012 I do not understand this part: My understanding is that the meta HA setup is a single-leader-system (see design docs). Therefore we only ever connect to one instance. Is my understanding correct @fuyufjh ?
IIUC, this should refers to the refactoring of the meta client: meta client should be able to connect and retrieve leader information from multi-meta nodes, which requires a get_leader_addr
interface in meta.
IIUC, this should refers to the refactoring of the meta client: meta client should be able to connect and retrieve leader information from multi-meta nodes, which requires a get_leader_addr interface in meta.
Yes. After that we can modify risedev to provide multiple meta addresses to compute nodes and frontend nodes. Currently it only picks the first one (see risedev warning)
Thank you very much for the clarifying. I would suggest that the change in the client
and the change in the meta enabling a follower service which supports get_leader_addr
will be done in separate PRs.
First version of the failover handling is done. Please have a look: https://github.com/risingwavelabs/risingwave/pull/6466
There are a few things in the PR missing:
added a new task "support connecting to multiple meta-node on compute-node"
Is this still up to date? I do not see the subtask @skyzh
added a new task "support connecting to multiple meta-node on compute-node"
Is this still up to date? I do not see the subtask @skyzh
That should be a part of #6755 .
Tests are still running, but my guess is that https://github.com/risingwavelabs/risingwave/pull/6771 is ready for review. This is a minor tasks overall, but merging it would still simplify further steps. Let me know if you have any objections or suggestions to the PR.
I currently have 2 PRs that are ready for review.
https://github.com/risingwavelabs/risingwave/pull/6771: Introduces graceful shutdown of services. This is only indirectly related to the HA setup. It will simplify further development.
https://github.com/risingwavelabs/risingwave/pull/6792: Introduces the election mechanism, but does not introduce any failover (see features and limitations here). My expectation is that this does not affect our current setup, because we only ever deploy one meta node, which hopefully always becomes leader. PR reduces code coverage, because I delete a unittest which is no longer needed in a HA setup (also see #3398 )
CC @arkbriar
https://github.com/risingwavelabs/risingwave/pull/6937 is ready for review. Please view the issue for features and limitations.
The fencing PR is ready for review. The CI pipeline is green. Also see https://github.com/risingwavelabs/risingwave/issues/6786.
Maybe we can get this merged before New Years :)
Fencing is merged. Thank you very much for your guidance and approval @yezizp2012