Opening this issues to collect some scenarios that should be covered by tests. Feel free to add
This node lost its leadership. New host address is 127.0.0.1:15690. Killing leader
I would like to discuss how these tests should be written. What kind of testing framework should we use? Is it possible to use our simulation tests for this? The advantage here would be that we may be able to reproduce bugs, because the simulation tests are only pseudo random. However, I do not think that we are able to control the randomness inside ETCD and the randomness of the leader election part. Do you think it is possible/feasable to write these tests in a simulation test @wangrunji0408?
An alternative would be to use e2e tests. Just start clusters, kill nodes and check which nodes are alive and what the logs say. This would be very close to reality, however may result in flaky tests.
Please let me know what you think and if you got better suggestions.
CC @arkbriar @yezizp2012
I think most of the scenarios can be tested under unit test just like original one test_leader_lease
. If the whole HA setup is ready, we can start multi meta nodes in deterministic recovery test and kill any or some of them(just like what we did currently to other components like FE/CN/Compactor, and we only deploy one meta node and alway kill it when kill option enabled).
I have another question regarding the unittests:
I believe that executingtest_that_we_only_have_one_leader_node
once is not enough. Bad situations in the election setup will occur because of some random coincidence. I thus believe that we should repeat these tests with some pseudo-randomness baked in. We could e.g. delay the startup times of the different meta nodes or start different number of meta nodes. Repeating these tests would give us more certainty and using pseudo randomness could make is possible to reproduce errors.
Does this approach make sense to you? Or should we instead just rely on the deterministic sim tests? IMHO this would give us the opportunity to catch election (and related) bugs early.
If it does make sense, here are two follow-up questions:
I have another question regarding the unittests:
I believe that executing
test_that_we_only_have_one_leader_node
once is not enough. Bad situations in the election setup will occur because of some random coincidence. I thus believe that we should repeat these tests with some pseudo-randomness baked in. We could e.g. delay the startup times of the different meta nodes or start different number of meta nodes. Repeating these tests would give us more certainty and using pseudo randomness could make is possible to reproduce errors.
I think you can do all these tests in unit test. Just feel free to introduce any tests for specific scenes: delay the startup times, start multi nodes, etc.
- Can we execute these tests in parallel? Running e.g. 100 tests here would take quite some time if we execute sequentialy
We can run any unit test in parallel, you can define the repetition logic inside the unit test.
- How do we log this? When running into an error, we need to make sure that we can differentiate between the different meta nodes.
I guess some panic information is already enough to identify the problem.
And back to some random coincidence you mentioned above, I think we can leave it to the deterministic sim tests. From my previous experience, it will help us to find many more cases than we can think of. 😄
Failover tests and single leader setup tests are added in https://github.com/risingwavelabs/risingwave/pull/6937
Closing this issue, since we decided to go without unit-tests and instead go for simulation tests only (see PR comment)