Reproduction

# launch multinode gcp cluster
sky launch -y -d -c test --num-nodes 2 --cloud gcp tests/test_yamls/minimal.yaml

# Try sky status --refresh. This works fine.
sky status test --refresh

# Schedule autostop
sky autostop -y test -i 1

# Wait for autostop
sleep 120

# Try refreshing status - fails
sky status test --refresh

Log

[email protected]:/sky_repo$ sky status test --refresh
Refreshing status for 1 cluster ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% -:--:--W 01-26 08:49:45 backend_utils.py:1346] Expected 1 worker IP(s); found 0: []
W 01-26 08:49:45 backend_utils.py:1346] This could happen if there is extra output from `ray get-worker-ips`, which should be inspected below.
W 01-26 08:49:45 backend_utils.py:1346] == Output ==
W 01-26 08:49:45 backend_utils.py:1346]
W 01-26 08:49:45 backend_utils.py:1346]
W 01-26 08:49:45 backend_utils.py:1346] == Output ends ==
Clusters
NAME  LAUNCHED    RESOURCES              STATUS   AUTOSTOP  COMMAND
test  5 mins ago  2x GCP(n2-standard-8)  STOPPED  1m        sky launch -y -d -c test ...

1 cluster has auto{stop,down} scheduled. Refresh statuses with: sky status --refresh

Notes

  • This happens for only a short duration (~1-2 min?) while the worker is stopping (checked in cloud console). Running sky status test --refresh afterwards works normally and shows status as stopped.
  • This happens only when using autostop. I could not replicate by manually stopping a cluster sky stop test and then refresh status.
  • This bug is also caught by test_smoke.py::test_autostop
  • Not fully sure if this is related to multi-node or GCP - need to check if it also happens for single node and on different clouds.
0

Just to add, in test_autostop the state is shown to be at INIT instead of UP or STOPPED, which causes the test to fail. Logs here.

0

Partially addressed by #1634 and #1638, removing P0 label

0
© 2022 pullanswer.com - All rights reserved.