2.4K Star 8.2K Fork 4.4K

GVPMindSpore / mindspore

 / 详情

【ST】【ms】【2.2】910A环境,开启ge流程,pynative模式,异常dump设置为1,Resnet50网络训练失败

TODO
Bug-Report
创建于  
2024-03-07 21:22
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

910A环境,开启ge流程,pynative模式,异常dump设置为1,Resnet50网络训练失败

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):
    ms版本:commit_id = '[sha1]:a0f42760,[branch]:(HEAD,origin/r2.2,r2.2)'
    run包版本:Milan_C15/20240302

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative

Related testcase / 关联用例 (Mandatory / 必填)

test_ms_exception_dump_006

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

1.拷贝网络脚本
2.设置异常dump为1
3.启动脚本训练
4.检验性能是否正常

Describe the expected behavior / 预期结果 (Mandatory / 必填)

正常训练,性能达标

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

报错日志:
Traceback (most recent call last):
File "train.py", line 242, in
train_net()
File "/home/jenkins/wenli/solution_test/cases/03subject_test/00reliability_availability/02reliability_availability_features/10exception_dump/test_ms_exception_dump_006_pynative_mode/scripts/train_parallel0/src/model_utils/moxing_adapter.py", line 104, in wrapped_func
run_func(*args, **kwargs)
File "train.py", line 236, in train_net
sink_size=100, dataset_sink_mode=dataset_sink_mode)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 1073, in train
initial_epoch=initial_epoch)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 114, in wrapper
func(self, *args, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 624, in _train
cb_params, sink_size, initial_epoch, valid_infos)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 708, in _train_dataset_sink_process
outputs = train_network(*inputs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/dataset_helper.py", line 103, in construct
return self.network(*outputs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/wrap/cell_wrapper.py", line 422, in construct
grads = self.grad_reducer(grads)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 705, in call
raise err
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 701, in call
output = self._run_construct(args, kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 482, in _run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 718, in staging_specialize
out = _MindsporeFunctionExecutor(func, hash_obj, input_signature, process_obj, jit_config)(*args, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 121, in wrapper
results = fn(*arg, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 356, in call
output = self._graph_executor(tuple(new_inputs), phase)
RuntimeError:

  • Kernel error:

Launch kernel failed: Default/AllReduce-op1536


  • Ascend Error Message:

EI0002: The wait execution of the Notify register times out. Reason: The Notify register has not received the Notify record from remote rank [4].base information: [streamID:[313], taskID[2345489], taskType[Notify Wait], tag[AllReduce_hccl_world_group].] task information: [notify id:[0x00000000000000d8], stage:[ffffffff], remote rank:[4].]
Possible Cause: 1. An exception occurs during the execution on some NPUs in the cluster. As a result, collective communication operation failed.2. The execution speed on some NPU in the cluster is too slow to complete a communication operation within the timeout interval. (default 1800s, You can set the interval by using HCCL_EXEC_TIMEOUT.)3. The number of training samples of each NPU is inconsistent.4. Packet loss or other connectivity problems occur on the communication link.
Solution: 1. If this error is reported on part of these ranks, check other ranks to see whether other errors have been reported earlier.2. If this error is reported for all ranks, check whether the error reporting time is consistent (the maximum difference must not exceed 1800s). If not, locate the cause or adjust the locate the cause or set the HCCL_EXEC_TIMEOUT environment variable to a larger value.3. Check whether the completion queue element (CQE) of the error exists in the plog(grep -rn 'error cqe'). If so, check the network connection status. (For details, see the TLS command and HCCN connectivity check examples.)4. Ensure that the number of training samples of each NPU is consistent. For details:https://www.hiascend.com/document
TraceBack (most recent call last):
SubmitTask fail for stream full in failure abort, stream_id=313, pendingNum=10239[FUNC:SubmitTask][FILE:engine.cc][LINE:484]
Notify record failed, notifyId=14, retCode=0x7030007[FUNC:NotifyRecord][FILE:api_impl.cc][LINE:3497]
rtNotifyRecord execute failed, reason=[stream task buffer full][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]

Special notes for this issue/备注 (Optional / 选填)

走给 唐慧康

评论 (3)

wenli 创建了Bug-Report
wenli 添加了
 
kind/bug
标签
wenli 添加了
 
v2.2.12
标签
wenli 添加了
 
sig/parallel
标签
wenli 添加了
 
attr/function
标签
wenli 添加了
 
stage/func-debug
标签
wenli 添加协作者tanghuikang
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@wenli

感谢您的反馈,您可以评论//mindspore-assistant更快获取帮助,更多标签可以查看标签列表

  1. 如果您刚刚接触MindSpore,或许您可以在教程找到答案
  2. 如果您是资深Pytorch用户,您或许需要:
    与PyTorch典型区别 / PyTorch与MindSpore API映射表
  3. 如果您遇到动态图问题,可以设置mindspore.set_context(pynative_synchronize=True)查看报错栈协助定位
  4. 模型精度调优问题可参考官网调优指南
  5. 如果您反馈的是框架BUG,请确认您在ISSUE中提供了MindSpore版本、使用的后端类型(CPU、GPU、Ascend)、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
  6. 如果您已经定位出问题根因,欢迎提交PR参与MindSpore开源社区,我们会尽快review
i-robot 添加了
 
dts-szv
标签
duanjiali 添加协作者duanjiali
duanjiali 负责人duanjiali 修改为tanghuikang
duanjiali 取消协作者tanghuikang
wenli 移除了
 
v2.2.12
标签
wenli 移除了
 
v2.2.12
标签
wenli 添加了
 
v2.2.13
标签
zhunaipan 添加了
 
rct/cann
标签
zhunaipan 添加了
 
v2.2.14
标签
zhunaipan 添加了
 
r2.2
标签

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(3)
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助