Is your feature request related to a problem? Please describe.

Opening this issues to collect some scenarios that should be covered by tests. Feel free to add

  • [ ] There should still be only 1 leader after deleting the leader lease info in ETCD
  • [ ] There should still be only 1 leader after deleting the lease info in ETCD
  • [ ] Leader failover should trigger the fencing mechanism in the leader node
  • [ ] One follower node should become leader after leader failover
  • [ ] All follower nodes survive leader failover
  • [ ] Starting one meta will result in one leader
  • [ ] Starting two meat will result in one leader and one follower
  • [ ] Fencing mechanism: Old leader should always display correct new leader, e.g This node lost its leadership. New host address is 127.0.0.1:15690. Killing leader
2

I would like to discuss how these tests should be written. What kind of testing framework should we use? Is it possible to use our simulation tests for this? The advantage here would be that we may be able to reproduce bugs, because the simulation tests are only pseudo random. However, I do not think that we are able to control the randomness inside ETCD and the randomness of the leader election part. Do you think it is possible/feasable to write these tests in a simulation test @wangrunji0408?

An alternative would be to use e2e tests. Just start clusters, kill nodes and check which nodes are alive and what the logs say. This would be very close to reality, however may result in flaky tests.

Please let me know what you think and if you got better suggestions.

CC @arkbriar @yezizp2012

0

I think most of the scenarios can be tested under unit test just like original one test_leader_lease. If the whole HA setup is ready, we can start multi meta nodes in deterministic recovery test and kill any or some of them(just like what we did currently to other components like FE/CN/Compactor, and we only deploy one meta node and alway kill it when kill option enabled).

1

Okay. Thank you very much for the suggestion 👍

0

Agree with @yezizp2012's idea on the simulation test. 👍

1

I have another question regarding the unittests:

I believe that executingtest_that_we_only_have_one_leader_node once is not enough. Bad situations in the election setup will occur because of some random coincidence. I thus believe that we should repeat these tests with some pseudo-randomness baked in. We could e.g. delay the startup times of the different meta nodes or start different number of meta nodes. Repeating these tests would give us more certainty and using pseudo randomness could make is possible to reproduce errors.

Does this approach make sense to you? Or should we instead just rely on the deterministic sim tests? IMHO this would give us the opportunity to catch election (and related) bugs early.

If it does make sense, here are two follow-up questions:

  • Can we execute these tests in parallel? Running e.g. 100 tests here would take quite some time if we execute sequentialy
  • How do we log this? When running into an error, we need to make sure that we can differentiate between the different meta nodes.
0

I have another question regarding the unittests:

I believe that executingtest_that_we_only_have_one_leader_node once is not enough. Bad situations in the election setup will occur because of some random coincidence. I thus believe that we should repeat these tests with some pseudo-randomness baked in. We could e.g. delay the startup times of the different meta nodes or start different number of meta nodes. Repeating these tests would give us more certainty and using pseudo randomness could make is possible to reproduce errors.

I think you can do all these tests in unit test. Just feel free to introduce any tests for specific scenes: delay the startup times, start multi nodes, etc.

  • Can we execute these tests in parallel? Running e.g. 100 tests here would take quite some time if we execute sequentialy

We can run any unit test in parallel, you can define the repetition logic inside the unit test.

  • How do we log this? When running into an error, we need to make sure that we can differentiate between the different meta nodes.

I guess some panic information is already enough to identify the problem.

And back to some random coincidence you mentioned above, I think we can leave it to the deterministic sim tests. From my previous experience, it will help us to find many more cases than we can think of. 😄

1

Failover tests and single leader setup tests are added in https://github.com/risingwavelabs/risingwave/pull/6937

0

Closing this issue, since we decided to go without unit-tests and instead go for simulation tests only (see PR comment)

0
© 2022 pullanswer.com - All rights reserved.