System Info

My code runs inside an NVIDIA docker container nvcr.io/nvidia/pytorch:22.05-py3.

The installed dependencies are listed here: https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html#framework-matrix-2022

I'm using the following versions for transformers and deepspeed:

  • transformers==4.24.0
  • deepspeed==0.7.5

Who can help?

@stas00

Information

  • [ ] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

I want to train a model on multiple GPUs. The server I'm using has 8x A100 GPUs with 40GB each. I'm using deepspeed zero3 to partition the model across GPUs. Unfortunately, the code "hangs" mid execution and runs forever.

I can run the same code successfully on a different server with V100 GPUs. So I am assuming the issue might be related to the communcation between the GPUs? Not sure.

Below are the files I am using. I have also attached to output of the script below.

Thanks for your help!

Deepspeed config file:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },
    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto",
            "warmup_type": "linear"
        }
    },
    "zero_optimization": {
        "stage": 3,
        "stage3_gather_16bit_weights_on_model_save": true,
        "reduce_scatter": true,
        "overlap_comm": true,
        "contiguous_gradients": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 100,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

Minimal python example:

import os

from transformers import AutoConfig, AutoModelForSequenceClassification, TrainingArguments, HfArgumentParser, Trainer


def main():
    parser = HfArgumentParser(TrainingArguments)
    training_args = parser.parse_args_into_dataclasses()[0]

    config = AutoConfig.from_pretrained(
        "facebook/opt-1.3b",
        cache_dir=os.getenv("HF_MODELS_CACHE"),
    )

    model = AutoModelForSequenceClassification.from_pretrained(
        "facebook/opt-1.3b",
        from_tf=False,
        config=config,
        cache_dir=os.getenv("HF_MODELS_CACHE"),
    )

    trainer = Trainer(
        model=model,
        args=training_args,
    )


if __name__ == "__main__":
    main()

bash script to start the python script:

export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
export CUDA_LAUNCH_BLOCKING=1

export HF_MODELS_CACHE=/cache-dir
OUTPUT_DIR=/output-dir

deepspeed \
    --num_gpus 2 \
    --master_port 60000 \
    ./debug.py \
    --output_dir $OUTPUT_DIR \
    --deepspeed ./deepspeed_configs/ds_config_zero3.json

What happens:

  • The code will run forever. No error message is shown.

Expected behavior

The script terminates successfully.

0

This is the output produced by the minimal example. It keeps running forever and does not produce any new output.

Detected CUDA_VISIBLE_DEVICES=GPU-460af155,GPU-457e4df4,GPU-08f1eba5,GPU-4793f3fd,GPU-cbc5b6ef,GPU-aa661638,GPU-a39d482a,GPU-dc0ceb93 but ignoring it because one or several of --include/--exclude/--num_gpus/--num_nodes cl args were used. If you want to use CUDA_VISIBLE_DEVICES don't pass any of these arguments to deepspeed.
[2022-11-24 14:38:15,640] [INFO] [runner.py:508:main] cmd = /home/mmosbach/miniconda3/envs/llmft/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=60000 /home/mmosbach/projects/llmft/debug.py --output_dir /home/mmosbach/logs/llmft/logfiles --deepspeed /home/mmosbach/projects/llmft/deepspeed_configs/ds_config_zero3.json
[2022-11-24 14:38:18,207] [INFO] [launch.py:135:main] 0 NCCL_VERSION=2.12.10+cuda11.6
[2022-11-24 14:38:18,207] [INFO] [launch.py:135:main] 0 NCCL_DEBUG_SUBSYS=ALL
[2022-11-24 14:38:18,207] [INFO] [launch.py:135:main] 0 NCCL_DEBUG=INFO
[2022-11-24 14:38:18,207] [INFO] [launch.py:142:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2022-11-24 14:38:18,207] [INFO] [launch.py:148:main] nnodes=1, num_local_procs=2, node_rank=0
[2022-11-24 14:38:18,207] [INFO] [launch.py:161:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2022-11-24 14:38:18,208] [INFO] [launch.py:162:main] dist_world_size=2
[2022-11-24 14:38:18,208] [INFO] [launch.py:164:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2022-11-24 14:38:24,319] [INFO] [comm.py:633:init_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
mmosbach-20307:535:535 [0] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
mmosbach-20307:535:535 [0] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
mmosbach-20307:535:535 [0] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v4)
mmosbach-20307:535:535 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
mmosbach-20307:535:535 [0] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v4)
mmosbach-20307:535:535 [0] NCCL INFO cudaDriverVersion 11070
NCCL version 2.14.3+cuda11.7
mmosbach-20307:535:535 [0] NCCL INFO init.cc:1147 Cuda Host Alloc Size 4 pointer 0x7f18dc200000
mmosbach-20307:536:536 [1] NCCL INFO cudaDriverVersion 11070
mmosbach-20307:535:717 [0] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
mmosbach-20307:535:717 [0] NCCL INFO P2P plugin IBext
mmosbach-20307:535:717 [0] NCCL INFO NET/IB : No device found.
mmosbach-20307:535:717 [0] NCCL INFO NET/IB : No device found.
mmosbach-20307:535:717 [0] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
mmosbach-20307:535:717 [0] NCCL INFO Using network Socket
mmosbach-20307:536:536 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0>
mmosbach-20307:536:536 [1] NCCL INFO NET/Plugin: Failed to find ncclNetPlugin_v6 symbol.
mmosbach-20307:536:536 [1] NCCL INFO NET/Plugin: Loaded net plugin NCCL RDMA Plugin (v4)
mmosbach-20307:536:536 [1] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
mmosbach-20307:536:536 [1] NCCL INFO NET/Plugin: Loaded coll plugin SHARP (v4)
mmosbach-20307:536:536 [1] NCCL INFO init.cc:1147 Cuda Host Alloc Size 4 pointer 0x7feb60200000
mmosbach-20307:536:718 [1] NCCL INFO Plugin Path : /opt/hpcx/nccl_rdma_sharp_plugin/lib/libnccl-net.so
mmosbach-20307:536:718 [1] NCCL INFO P2P plugin IBext
mmosbach-20307:536:718 [1] NCCL INFO NET/IB : No device found.
mmosbach-20307:536:718 [1] NCCL INFO NET/IB : No device found.
mmosbach-20307:536:718 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0>
mmosbach-20307:536:718 [1] NCCL INFO Using network Socket
mmosbach-20307:536:718 [1] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'eth0'
mmosbach-20307:535:717 [0] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'eth0'
mmosbach-20307:536:718 [1] NCCL INFO transport/p2p.cc:151 Cuda Alloc Size 2097152 pointer 0x7feb60c00000
mmosbach-20307:536:718 [1] NCCL INFO === System : maxBw 24.0 totalBw 24.0 ===
mmosbach-20307:535:717 [0] NCCL INFO transport/p2p.cc:151 Cuda Alloc Size 2097152 pointer 0x7f18dcc00000
mmosbach-20307:536:718 [1] NCCL INFO CPU/0 (1/2/-1)
mmosbach-20307:536:718 [1] NCCL INFO + PCI[5000.0] - NIC/0
mmosbach-20307:536:718 [1] NCCL INFO + PCI[24.0] - GPU/1000 (0)
mmosbach-20307:536:718 [1] NCCL INFO + PCI[24.0] - GPU/25000 (1)
mmosbach-20307:536:718 [1] NCCL INFO ==========================================
mmosbach-20307:536:718 [1] NCCL INFO GPU/1000 :GPU/1000 (0/5000.000000/LOC) GPU/25000 (2/24.000000/PHB) CPU/0 (1/24.000000/PHB) 
mmosbach-20307:536:718 [1] NCCL INFO GPU/25000 :GPU/1000 (2/24.000000/PHB) GPU/25000 (0/5000.000000/LOC) CPU/0 (1/24.000000/PHB) 
mmosbach-20307:536:718 [1] NCCL INFO Setting affinity for GPU 1 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff
mmosbach-20307:535:717 [0] NCCL INFO === System : maxBw 24.0 totalBw 24.0 ===
mmosbach-20307:535:717 [0] NCCL INFO CPU/0 (1/2/-1)
mmosbach-20307:535:717 [0] NCCL INFO + PCI[5000.0] - NIC/0
mmosbach-20307:535:717 [0] NCCL INFO + PCI[24.0] - GPU/1000 (0)
mmosbach-20307:535:717 [0] NCCL INFO + PCI[24.0] - GPU/25000 (1)
mmosbach-20307:535:717 [0] NCCL INFO ==========================================
mmosbach-20307:535:717 [0] NCCL INFO GPU/1000 :GPU/1000 (0/5000.000000/LOC) GPU/25000 (2/24.000000/PHB) CPU/0 (1/24.000000/PHB) 
mmosbach-20307:535:717 [0] NCCL INFO GPU/25000 :GPU/1000 (2/24.000000/PHB) GPU/25000 (0/5000.000000/LOC) CPU/0 (1/24.000000/PHB) 
mmosbach-20307:535:717 [0] NCCL INFO Setting affinity for GPU 0 to ffffffff,ffffffff,00000000,00000000,ffffffff,ffffffff
mmosbach-20307:536:718 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 2, bw 12.000000/12.000000, type PHB/PIX, sameChannels 1
mmosbach-20307:536:718 [1] NCCL INFO  0 : GPU/0 GPU/1
mmosbach-20307:536:718 [1] NCCL INFO  1 : GPU/0 GPU/1
mmosbach-20307:536:718 [1] NCCL INFO Pattern 1, crossNic 0, nChannels 2, bw 22.000000/22.000000, type PHB/PIX, sameChannels 0
mmosbach-20307:536:718 [1] NCCL INFO  0 : GPU/0 GPU/1
mmosbach-20307:536:718 [1] NCCL INFO  1 : GPU/1 GPU/0
mmosbach-20307:536:718 [1] NCCL INFO Pattern 3, crossNic 0, nChannels 2, bw 22.000000/22.000000, type PHB/PIX, sameChannels 0
mmosbach-20307:536:718 [1] NCCL INFO  0 : GPU/0 GPU/1
mmosbach-20307:536:718 [1] NCCL INFO  1 : GPU/1 GPU/0
mmosbach-20307:535:717 [0] NCCL INFO Pattern 4, crossNic 0, nChannels 2, bw 12.000000/12.000000, type PHB/PIX, sameChannels 1
mmosbach-20307:535:717 [0] NCCL INFO  0 : GPU/0 GPU/1
mmosbach-20307:535:717 [0] NCCL INFO  1 : GPU/0 GPU/1
mmosbach-20307:535:717 [0] NCCL INFO Pattern 1, crossNic 0, nChannels 2, bw 22.000000/22.000000, type PHB/PIX, sameChannels 0
mmosbach-20307:535:717 [0] NCCL INFO  0 : GPU/0 GPU/1
mmosbach-20307:535:717 [0] NCCL INFO  1 : GPU/1 GPU/0
mmosbach-20307:535:717 [0] NCCL INFO Pattern 3, crossNic 0, nChannels 2, bw 22.000000/22.000000, type PHB/PIX, sameChannels 0
mmosbach-20307:535:717 [0] NCCL INFO  0 : GPU/0 GPU/1
mmosbach-20307:535:717 [0] NCCL INFO  1 : GPU/1 GPU/0
mmosbach-20307:536:718 [1] NCCL INFO Tree 0 : 0 -> 1 -> -1/-1/-1
mmosbach-20307:536:718 [1] NCCL INFO Tree 2 : 0 -> 1 -> -1/-1/-1
mmosbach-20307:536:718 [1] NCCL INFO Tree 1 : -1 -> 1 -> 0/-1/-1
mmosbach-20307:536:718 [1] NCCL INFO Tree 3 : -1 -> 1 -> 0/-1/-1
mmosbach-20307:535:717 [0] NCCL INFO Tree 0 : -1 -> 0 -> 1/-1/-1
mmosbach-20307:535:717 [0] NCCL INFO Tree 2 : -1 -> 0 -> 1/-1/-1
mmosbach-20307:536:718 [1] NCCL INFO Ring 00 : 0 -> 1 -> 0
mmosbach-20307:535:717 [0] NCCL INFO Tree 1 : 1 -> 0 -> -1/-1/-1
mmosbach-20307:536:718 [1] NCCL INFO Ring 01 : 0 -> 1 -> 0
mmosbach-20307:535:717 [0] NCCL INFO Tree 3 : 1 -> 0 -> -1/-1/-1
mmosbach-20307:536:718 [1] NCCL INFO Ring 02 : 0 -> 1 -> 0
mmosbach-20307:536:718 [1] NCCL INFO Ring 03 : 0 -> 1 -> 0
mmosbach-20307:535:717 [0] NCCL INFO Channel 00/04 :    0   1
mmosbach-20307:536:718 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1
mmosbach-20307:535:717 [0] NCCL INFO Channel 01/04 :    0   1
mmosbach-20307:535:717 [0] NCCL INFO Channel 02/04 :    0   1
mmosbach-20307:536:718 [1] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
mmosbach-20307:535:717 [0] NCCL INFO Channel 03/04 :    0   1
mmosbach-20307:535:717 [0] NCCL INFO Ring 00 : 1 -> 0 -> 1
mmosbach-20307:535:717 [0] NCCL INFO Ring 01 : 1 -> 0 -> 1
mmosbach-20307:535:717 [0] NCCL INFO Ring 02 : 1 -> 0 -> 1
mmosbach-20307:535:717 [0] NCCL INFO Ring 03 : 1 -> 0 -> 1
mmosbach-20307:535:717 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1
mmosbach-20307:535:717 [0] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
mmosbach-20307:536:718 [1] NCCL INFO channel.cc:23 Cuda Alloc Size 1152 pointer 0x7feb60c00000
mmosbach-20307:536:718 [1] NCCL INFO channel.cc:27 Cuda Alloc Size 8 pointer 0x7feb60c00600
mmosbach-20307:536:718 [1] NCCL INFO channel.cc:23 Cuda Alloc Size 1152 pointer 0x7feb60c00800
mmosbach-20307:536:718 [1] NCCL INFO channel.cc:27 Cuda Alloc Size 8 pointer 0x7feb60c00e00
mmosbach-20307:535:717 [0] NCCL INFO channel.cc:23 Cuda Alloc Size 1152 pointer 0x7f18dcc00000
mmosbach-20307:536:718 [1] NCCL INFO channel.cc:23 Cuda Alloc Size 1152 pointer 0x7feb60c01000
mmosbach-20307:536:718 [1] NCCL INFO channel.cc:27 Cuda Alloc Size 8 pointer 0x7feb60c01600
mmosbach-20307:535:717 [0] NCCL INFO channel.cc:27 Cuda Alloc Size 8 pointer 0x7f18dcc00600
mmosbach-20307:536:718 [1] NCCL INFO channel.cc:23 Cuda Alloc Size 1152 pointer 0x7feb60c01800
mmosbach-20307:535:717 [0] NCCL INFO channel.cc:23 Cuda Alloc Size 1152 pointer 0x7f18dcc00800
mmosbach-20307:536:718 [1] NCCL INFO channel.cc:27 Cuda Alloc Size 8 pointer 0x7feb60c01e00
mmosbach-20307:535:717 [0] NCCL INFO channel.cc:27 Cuda Alloc Size 8 pointer 0x7f18dcc00e00
mmosbach-20307:535:717 [0] NCCL INFO channel.cc:23 Cuda Alloc Size 1152 pointer 0x7f18dcc01000
mmosbach-20307:535:717 [0] NCCL INFO channel.cc:27 Cuda Alloc Size 8 pointer 0x7f18dcc01600
mmosbach-20307:535:717 [0] NCCL INFO channel.cc:23 Cuda Alloc Size 1152 pointer 0x7f18dcc01800
mmosbach-20307:535:717 [0] NCCL INFO channel.cc:27 Cuda Alloc Size 8 pointer 0x7f18dcc01e00
mmosbach-20307:536:719 [1] NCCL INFO Mem Realloc old size 0, new size 8 pointer 0x7feb48002c70
mmosbach-20307:536:718 [1] NCCL INFO Connection to proxy localRank 1 -> connection 0x7feb48002e10
mmosbach-20307:535:720 [0] NCCL INFO Mem Realloc old size 0, new size 8 pointer 0x7f18d0000b60
mmosbach-20307:536:719 [1] NCCL INFO New proxy recv connection 0 from local rank 1, transport 0
mmosbach-20307:535:717 [0] NCCL INFO Connection to proxy localRank 0 -> connection 0x7f18d0002ea0
mmosbach-20307:535:720 [0] NCCL INFO New proxy recv connection 0 from local rank 0, transport 0
mmosbach-20307:536:719 [1] NCCL INFO transport/p2p.cc:449 Cuda Alloc Size 10485760 pointer 0x7feb60e00000
mmosbach-20307:535:720 [0] NCCL INFO transport/p2p.cc:449 Cuda Alloc Size 10485760 pointer 0x7f18dce00000
mmosbach-20307:536:718 [1] NCCL INFO Connection to proxy localRank 1 -> connection 0x7feb48002e50
mmosbach-20307:536:719 [1] NCCL INFO New proxy recv connection 1 from local rank 1, transport 0
mmosbach-20307:535:717 [0] NCCL INFO Connection to proxy localRank 0 -> connection 0x7f18d0002ee0
mmosbach-20307:535:720 [0] NCCL INFO New proxy recv connection 1 from local rank 0, transport 0
mmosbach-20307:536:719 [1] NCCL INFO transport/p2p.cc:449 Cuda Alloc Size 10485760 pointer 0x7feb58000000
mmosbach-20307:535:720 [0] NCCL INFO transport/p2p.cc:449 Cuda Alloc Size 10485760 pointer 0x7f18d4000000
mmosbach-20307:536:719 [1] NCCL INFO New proxy recv connection 2 from local rank 1, transport 0
mmosbach-20307:536:718 [1] NCCL INFO Connection to proxy localRank 1 -> connection 0x7feb48002e90
mmosbach-20307:535:720 [0] NCCL INFO New proxy recv connection 2 from local rank 0, transport 0
mmosbach-20307:535:717 [0] NCCL INFO Connection to proxy localRank 0 -> connection 0x7f18d0002f20
mmosbach-20307:536:719 [1] NCCL INFO transport/p2p.cc:449 Cuda Alloc Size 10485760 pointer 0x7feb58a00000
mmosbach-20307:536:719 [1] NCCL INFO New proxy recv connection 3 from local rank 1, transport 0
mmosbach-20307:536:718 [1] NCCL INFO Connection to proxy localRank 1 -> connection 0x7feb48002ed0
mmosbach-20307:535:720 [0] NCCL INFO transport/p2p.cc:449 Cuda Alloc Size 10485760 pointer 0x7f18d4a00000
mmosbach-20307:535:720 [0] NCCL INFO New proxy recv connection 3 from local rank 0, transport 0
mmosbach-20307:535:717 [0] NCCL INFO Connection to proxy localRank 0 -> connection 0x7f18d0002f60
mmosbach-20307:536:719 [1] NCCL INFO transport/p2p.cc:449 Cuda Alloc Size 10485760 pointer 0x7feb59400000
mmosbach-20307:536:718 [1] NCCL INFO Channel 00/0 : 1[25000] -> 0[1000] via P2P/IPC
mmosbach-20307:536:719 [1] NCCL INFO New proxy send connection 4 from local rank 1, transport 0
mmosbach-20307:536:718 [1] NCCL INFO Connection to proxy localRank 1 -> connection 0x7feb48002f10
mmosbach-20307:535:720 [0] NCCL INFO transport/p2p.cc:449 Cuda Alloc Size 10485760 pointer 0x7f18d5400000
mmosbach-20307:535:717 [0] NCCL INFO Channel 00/0 : 0[1000] -> 1[25000] via P2P/IPC
mmosbach-20307:535:720 [0] NCCL INFO New proxy send connection 4 from local rank 0, transport 0
mmosbach-20307:535:717 [0] NCCL INFO Connection to proxy localRank 0 -> connection 0x7f18d0002fa0
mmosbach-20307:536:719 [1] NCCL INFO transport/p2p.cc:430 Cuda Alloc Size 2097152 pointer 0x7feb59e00000
mmosbach-20307:536:718 [1] NCCL INFO Channel 01/0 : 1[25000] -> 0[1000] via P2P/IPC
mmosbach-20307:535:720 [0] NCCL INFO transport/p2p.cc:430 Cuda Alloc Size 2097152 pointer 0x7f18d5e00000
mmosbach-20307:536:719 [1] NCCL INFO New proxy send connection 5 from local rank 1, transport 0
mmosbach-20307:536:718 [1] NCCL INFO Connection to proxy localRank 1 -> connection 0x7feb48002f50
mmosbach-20307:535:717 [0] NCCL INFO Channel 01/0 : 0[1000] -> 1[25000] via P2P/IPC
mmosbach-20307:535:720 [0] NCCL INFO New proxy send connection 5 from local rank 0, transport 0
mmosbach-20307:535:717 [0] NCCL INFO Connection to proxy localRank 0 -> connection 0x7f18d0002fe0
mmosbach-20307:536:719 [1] NCCL INFO transport/p2p.cc:430 Cuda Alloc Size 2097152 pointer 0x7feb61800000
mmosbach-20307:536:718 [1] NCCL INFO Channel 02/0 : 1[25000] -> 0[1000] via P2P/IPC
mmosbach-20307:535:720 [0] NCCL INFO transport/p2p.cc:430 Cuda Alloc Size 2097152 pointer 0x7f18dd800000
mmosbach-20307:536:719 [1] NCCL INFO New proxy send connection 6 from local rank 1, transport 0
mmosbach-20307:536:718 [1] NCCL INFO Connection to proxy localRank 1 -> connection 0x7feb48002f90
mmosbach-20307:535:717 [0] NCCL INFO Channel 02/0 : 0[1000] -> 1[25000] via P2P/IPC
mmosbach-20307:535:720 [0] NCCL INFO New proxy send connection 6 from local rank 0, transport 0
mmosbach-20307:535:717 [0] NCCL INFO Connection to proxy localRank 0 -> connection 0x7f18d0003020
mmosbach-20307:536:719 [1] NCCL INFO transport/p2p.cc:430 Cuda Alloc Size 2097152 pointer 0x7feb61a00000
mmosbach-20307:536:718 [1] NCCL INFO Channel 03/0 : 1[25000] -> 0[1000] via P2P/IPC
mmosbach-20307:535:720 [0] NCCL INFO transport/p2p.cc:430 Cuda Alloc Size 2097152 pointer 0x7f18dda00000
mmosbach-20307:536:719 [1] NCCL INFO New proxy send connection 7 from local rank 1, transport 0
mmosbach-20307:536:718 [1] NCCL INFO Connection to proxy localRank 1 -> connection 0x7feb48002fd0
mmosbach-20307:535:717 [0] NCCL INFO Channel 03/0 : 0[1000] -> 1[25000] via P2P/IPC
mmosbach-20307:535:720 [0] NCCL INFO New proxy send connection 7 from local rank 0, transport 0
mmosbach-20307:535:717 [0] NCCL INFO Connection to proxy localRank 0 -> connection 0x7f18d0003060
mmosbach-20307:536:719 [1] NCCL INFO transport/p2p.cc:430 Cuda Alloc Size 2097152 pointer 0x7feb61c00000
mmosbach-20307:535:720 [0] NCCL INFO transport/p2p.cc:430 Cuda Alloc Size 2097152 pointer 0x7f18ddc00000
mmosbach-20307:536:718 [1] NCCL INFO Connected all rings
mmosbach-20307:536:718 [1] NCCL INFO Connected all trees
mmosbach-20307:536:718 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
mmosbach-20307:536:718 [1] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mmosbach-20307:535:717 [0] NCCL INFO Connected all rings
mmosbach-20307:535:717 [0] NCCL INFO Connected all trees
mmosbach-20307:536:719 [1] NCCL INFO Allocated 4194656 bytes of shared memory in /dev/shm/nccl-JKUXpI

mmosbach-20307:535:717 [0] NCCL INFO Latency/AlgBw |    Tree/    LL |    Tree/ LL128 |    Tree/Simple |    Ring/    LL |    Ring/ LL128 |    Ring/Simple | CollNetDirect/    LL | CollNetDirect/ LL128 | CollNetDirect/Simple | CollNetChain/    LL | CollNetChain/ LL128 | CollNetChain/Simple |
mmosbach-20307:535:717 [0] NCCL INFO  Max NThreads |            512 |            640 |            512 |            512 |            640 |            512 |              0 |              0 |            512 |              0 |              0 |            512 |
mmosbach-20307:535:717 [0] NCCL INFO     Broadcast |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     4.6/   8.0 |    12.5/   0.0 |    14.1/  24.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
mmosbach-20307:535:717 [0] NCCL INFO        Reduce |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     4.6/   6.0 |    12.5/   0.0 |    14.1/  24.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
mmosbach-20307:535:717 [0] NCCL INFO     AllGather |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     4.6/  16.0 |    12.5/   0.0 |    14.1/  48.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
mmosbach-20307:535:717 [0] NCCL INFO ReduceScatter |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     4.6/  16.0 |    12.5/   0.0 |    14.1/  48.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |     0.0/   0.0 |
mmosbach-20307:535:717 [0] NCCL INFO     AllReduce |     6.4/   5.3 |     8.2/   0.0 |    56.0/  20.2 |     5.6/   6.0 |    15.0/   0.0 |    19.8/  24.0 |     5.4/   0.0 |     5.4/   0.0 |    27.7/   0.0 |     4.4/   0.0 |     4.4/   0.0 |    16.0/   0.0 |
mmosbach-20307:535:717 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512
mmosbach-20307:535:717 [0] NCCL INFO 4 coll channels, 4 p2p channels, 2 p2p channels per peer
mmosbach-20307:535:720 [0] NCCL INFO Allocated 4194656 bytes of shared memory in /dev/shm/nccl-ihjCjE

mmosbach-20307:536:719 [1] NCCL INFO New proxy send connection 8 from local rank 1, transport 2
mmosbach-20307:536:718 [1] NCCL INFO Connection to proxy localRank 1 -> connection 0x7feb48003010
mmosbach-20307:535:720 [0] NCCL INFO New proxy send connection 8 from local rank 0, transport 2
mmosbach-20307:535:717 [0] NCCL INFO Connection to proxy localRank 0 -> connection 0x7f18d00030a0
mmosbach-20307:536:719 [1] NCCL INFO transport/net.cc:376 Cuda Alloc Size 8388608 pointer 0x7feb47200000
mmosbach-20307:536:718 [1] NCCL INFO init.cc:367 Cuda Alloc Size 5168 pointer 0x7feb60c02000
mmosbach-20307:535:717 [0] NCCL INFO init.cc:367 Cuda Alloc Size 5168 pointer 0x7f18dcc02000
mmosbach-20307:535:720 [0] NCCL INFO transport/net.cc:376 Cuda Alloc Size 8388608 pointer 0x7f18c3200000
mmosbach-20307:535:717 [0] NCCL INFO init.cc:392 Cuda Host Alloc Size 33554432 pointer 0x7f18b6000000
mmosbach-20307:535:717 [0] NCCL INFO init.cc:398 Cuda Host Alloc Size 128 pointer 0x7f18dc200200
mmosbach-20307:535:717 [0] NCCL INFO comm 0x447a30b0 rank 0 nranks 2 cudaDev 0 busId 1000 - Init COMPLETE
mmosbach-20307:535:535 [0] NCCL INFO Broadcast: opCount 0 sendbuff 0x7f190a000000 recvbuff 0x7f190a000000 count 411828224 datatype 0 op 0 root 0 comm 0x447a30b0 [nranks=2] stream 0x447a2580
mmosbach-20307:536:718 [1] NCCL INFO init.cc:392 Cuda Host Alloc Size 33554432 pointer 0x7feb3a000000
mmosbach-20307:535:535 [0] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
mmosbach-20307:536:718 [1] NCCL INFO init.cc:398 Cuda Host Alloc Size 128 pointer 0x7feb60200200
mmosbach-20307:536:718 [1] NCCL INFO comm 0x43bb7070 rank 1 nranks 2 cudaDev 1 busId 25000 - Init COMPLETE
mmosbach-20307:536:536 [1] NCCL INFO Broadcast: opCount 0 sendbuff 0x7feb8a000000 recvbuff 0x7feb8a000000 count 411828224 datatype 0 op 0 root 0 comm 0x43bb7070 [nranks=2] stream 0x43bb63e0
mmosbach-20307:536:536 [1] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
0

Thank you for an excellent report, @mmarius

This is almost certain an issue that you'd need to report to the Deepspeed, since the hanging isn't related to HF integration. The only hanging that could happen in the integration is in generate if one doesn't set the gpu sync flag on. but I don't see you using it. The rest is core deepspeed.

But here are some suggestions based on my experience that might help:

  1. This could be a hardware issue. Can you try the same code on a different server of the same setup?

  2. Sometimes these help (try one at a time and see if the hanging goes away.

# do not remove or the training will hang and nodes will be lost w/o this workaround
export CUDA_LAUNCH_BLOCKING=1

# force crashing on nccl issues like hanging broadcast
export NCCL_ASYNC_ERROR_HANDLING=1

I see you have already tried the first one, I suppose it didn't help. it solved one huge hanging for BLOOM training.

  1. if none of the above helps, time to get your hands dirty and run py-spy and see where it hangs.

You can of course run it on the process directly, as you only have 2.

But also you may want to read some multi-gpu py-spy recipes in:

  • https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles-prequel.md
  • https://github.com/bigscience-workshop/bigscience/blob/master/train/tr11-176B-ml/chronicles.md

and in general you might find some helpful notes in there. We had several hanging issues before we managed to get BLOOM-176B training on 384 A100s, albeit it was using Megatron-Deepspeed, which wasn't using ZeRO3, but sort of ZeRO-1 but customized to bf16, but the code is relatively similar and there is a lot of overlap with zero3.

when you report to Deepspeed they will definitely ask you for an output of py-spy

p.s. pip install py-spy; py-spy dump --pid PID

0

Thanks for your detailed reply, @stas00

I tried using

# force crashing on nccl issues like hanging broadcast
export NCCL_ASYNC_ERROR_HANDLING=1

but it didn't help.

Before getting into py-spy , I ran the this script (https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#gpu-to-gpu-communication) to see if the GPU-to-GPU communication is working correctly on the server I am using and it seems that there are indeed some problems there. The latency is way too large.

P2P=Enabled Latency (P2P Writes) Matrix (us)
  GPU     0      1      2      3      4      5      6      7  
    0   4.95 49206.88 49206.64 49206.69 49206.75 49206.68 49206.72 49206.72  
    1 49206.62   2.08 49206.51 49206.52 49206.42 49206.42 49206.39 49206.43  
    2 49206.70 49206.45   2.21 49206.45 49206.47 49206.56 49206.43 49206.49  
    3 49206.73 49206.53 49206.55   2.21 49206.59 49206.55 49206.55 49206.52  
    4 49206.77 49206.59 49206.57 49206.61   2.11 49206.60 49206.66 49206.60  
    5 49206.66 49206.47 49206.51 49206.49 49206.51   2.11 49206.46 49206.45  
    6 49206.82 49206.57 49206.61 49206.58 49206.62 49206.59   2.08 49206.60  
    7 49206.67 49206.51 49206.49 49206.46 49206.46 49206.47 49206.50   2.11  

I will get back with more information once we resolved the problem.

Feel free to close the issue as it's definitely not a transformers problem.

0

oh, so the hardware issue! Thank you for the update

Also you can try this diagnostics script: https://github.com/stas00/toolbox/blob/master/pytorch/torch-distributed-gpu-test.py

0

I ran your diagnostic script and as with my minimal example above it simply runs forever ...

0

yeah, so it's almost certain a hardware issue then.

0
© 2022 pullanswer.com - All rights reserved.