🐛 Describe the bug

Issue caught by @xwang233 from nightly log

CUDA NVRTC compile error: __tmp_kernel47.cu(6701): Error: Formal parameter space overflowed (5040 bytes required, max 4096 bytes allowed) in function _ZN11CudaCodeGen8kernel47ENS_6TensorINS_6__halfELi4EEES2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_S2_

...

00:17:26 RuntimeError: false INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/torch/csrc/jit/codegen/cuda/executor_utils.cpp":1170, please report a bug to PyTorch. namespace CudaCodeGen {

Example log https://gitlab-master.nvidia.com/dl/pytorch/update-scripts/-/jobs/44769317/raw. I think it started in the past few days. I can check the exact dates and pytorch versions later if needed.

We had this issue patched back then with an arbitrarily set threshold on IO tensors: https://github.com/pytorch/pytorch/blob/master/torch/csrc/jit/codegen/cuda/graph_fuser.cpp#L356-L360 The old WAR breaks since we have changed the kernel argument and increase its memory footprint for each tensor. We can lower the threshold to WAR this issue for TS temporarily. But we would need similar thing for python stack.

However, I don't think that is a reasonable solution. Since we now have fusion segmentation where we can codegen multiple kernels inside a single Fusion, we should be able to consider IO buffer limits and factor that into segmentation as well. That's a more general solution that would naturally work for both frontend.

Versions

Repro on upstream master. Although unverified, devel should have the same issue.

0

Repro step on V100

TIMM_BENCHMARK_ENABLE_AOT_AUTOGRAD=1 python -u benchmark.py --bench train --model seresnet152d --img-size 224 -b 24 --fuser nvfuser --aot-autograd
0

We could also try to reduce the size requirement. Each tensor parameter consists of a 64-bit pointer and size and stride arrays, but the size and stride arrays are not necessary used. I don't have a quick idea to hack this into our lowering system, but basically we only need unique axis sizes and strides, which can be much lower than what we currently pass from host to device.

0

We could also try to reduce the size requirement. Each tensor parameter consists of a 64-bit pointer and size and stride arrays, but the size and stride arrays are not necessary used. I don't have a quick idea to hack this into our lowering system, but basically we only need unique axis sizes and strides, which can be much lower than what we currently pass from host to device.

Sounds great if we would be able to squeeze more parameters into the fusion. Let's keep that option open.

Meanwhile, we'd still need a proper mechanism to ensure that we are not erroring out on gigantic fusion. I think it's still necessary to have segmenter breaking fusions on IO buffer sizes.

0
© 2022 pullanswer.com - All rights reserved.