2.3K Star 8.1K Fork 4.3K

GVPMindSpore / mindspore

 / 详情

【ST】【MS】【2.3】910B3环境,transformer网络训练失败,报addn算子相关错误

DONE
Bug-Report
Opened this issue  
2024-04-27 17:07
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

910B环境,transformer网络训练失败,报addn算子相关错误

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):
    commit_id = '[sha1]:a87635b6,[branch]:(HEAD,origin/master,origin/HEAD,master)'
    runpkg_version:Milan_C17/20240414

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative

Related testcase / 关联用例 (Mandatory / 必填)

test_ms_pynative_memory_optimization_transformer_normal_1p_0001

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

1、取mindspore model_zoo中网络transformer,设置网络训练模式,
2、将默认的图模式替换成为pynative模式, 采用默认内存接口来设置
3、训练网络,看是否有预期结果

用例执行步骤:
source /home/miniconda3/bin/activate feature_39
export TRAIN_MODE=GRAPH_MODE
export DEVICE_TYPE=Ascend910B_Arm
export ENV_DEVICE=0
source solution_test/env_set.source -e ascend

cd solution_test/cases/01frame_func/02pynative/memory_optimization
pytest -s test_ms_pynative_memory_optimization_transformer_normal_1p_0001.py

Describe the expected behavior / 预期结果 (Mandatory / 必填)

网络训练正常,用例pass

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

报错信息:
Traceback (most recent call last):
File "/data/jenkins_workspace/TDT_deployment/solution_test/cases/01frame_func/02pynative/memory_optimization/test_ms_pynative_memory_optimization_transformer_normal_1p_0001/run_standalone_train/train.py", line 209, in
run_transformer_train()
File "/data/jenkins_workspace/TDT_deployment/solution_test/cases/01frame_func/02pynative/memory_optimization/test_ms_pynative_memory_optimization_transformer_normal_1p_0001/run_standalone_train/src/model_utils/moxing_adapter.py", line 108, in wrapped_func
run_func(*args, **kwargs)
File "/data/jenkins_workspace/TDT_deployment/solution_test/cases/01frame_func/02pynative/memory_optimization/test_ms_pynative_memory_optimization_transformer_normal_1p_0001/run_standalone_train/train.py", line 205, in run_transformer_train
model.train(2, dataset, callbacks=callbacks, dataset_sink_mode=False)
File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/train/model.py", line 1082, in train
self._train(epoch,
File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/train/model.py", line 115, in wrapper
func(self, *args, **kwargs)
File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/train/model.py", line 630, in _train
self._train_process(epoch, train_dataset, list_callback, cb_params, initial_epoch, valid_infos)
File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/train/model.py", line 932, in _train_process
outputs = self._train_network(*next_element)
File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 697, in call
out = self.compile_and_run(*args, **kwargs)
File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/nn/cell.py", line 1018, in compile_and_run
return _cell_graph_executor(self, *new_args, phase=self.phase)
File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/common/api.py", line 1672, in call
return self.run(obj, *args, phase=phase)
File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/common/api.py", line 1711, in run
return self._exec_pip(obj, *args, phase=phase_real)
File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/common/api.py", line 132, in wrapper
results = fn(*arg, **kwargs)
File "/home/miniconda3/envs/feature_39/lib/python3.9/site-packages/mindspore/common/api.py", line 1691, in _exec_pip
return self._graph_executor(args, phase)
RuntimeError: Exec graph failed


  • Ascend Error Message:

E39999: Inner Error!
E39999: 2024-04-27-15:11:29.794.740 Aicpu kernel execute failed, device_id=1, stream_id=11, task_id=23, errorCode=91.[FUNC:PrintAicpuErrorInfo][FILE:task_info.cc][LINE:1579]
TraceBack (most recent call last):
Op execute failed. origin_op_name [Default/AddN-op1], op_name [Default/AddN-op1/AddN1], error_info: task_id 23, stream_id 11, tid 3959187, device_id 1, retcode 0x7bc83[FUNC:ErrorTrackingCallback][FILE:error_tracking.cc][LINE:105]
Aicpu kernel execute failed, device_id=1, stream_id=11, task_id=23, fault op_name=[FUNC:GetError][FILE:stream.cc][LINE:1512]
rtStreamSynchronizeWithTimeout execute failed, reason=[the model stream execute failed][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:53]
Assert ((rt_ret) == 0) failed[FUNC:DoRtStreamSyncWithTimeout][FILE:utils.cc][LINE:52]
Failed to execute rt v2 model for graph kernel_graph0_1, model_id 1.[FUNC:ExecuteWithStreamAsync][FILE:hybrid_model_rt_v2_executor.cc][LINE:791]
GraphManager RunGrapWithStreamhAsync failed,session id = 0, graph id = 1, stream = 0x24978b60.[FUNC:RunGraphWithStreamAsync][FILE:inner_session.cc][LINE:513]
[Run][Graph]Run graph with stream asyn failed, error code = 1343225857, session id = 0,graph id = 1, stream = 0x24978b60.[FUNC:RunGraphWithStreamAsync][FILE:ge_api.cc][LINE:800]

Special notes for this issue/备注 (Optional / 选填)

走给 王睿

Comments (12)

wenli createdBug-Report
wenli added
 
kind/bug
label
wenli added
 
v2.3.0.rc2
label
wenli added
 
sig/compiler
label
wenli added
 
attr/function
label
wenli added
 
stage/func-debug
label
wenli assigned collaborator andy_wangrui
Expand operation logs

Please assign maintainer to check this issue.
请为此issue分配处理人。
@wenli

感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:

  1. 如果您刚刚接触MindSpore,或许您可以在教程找到答案
  2. 如果您是资深Pytorch用户,您或许需要:
  1. 如果您遇到动态图问题,可以设置set_context(pynative_synchronize=True)查看报错栈协助定位
  2. 模型精度调优问题可参考官网调优指南
  3. 如果您反馈的是框架BUG,请确认您在ISSUE中提供了MindSpore版本、使用的后端类型(CPU、GPU、Ascend)、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
  4. 如果您已经定位出问题根因,欢迎提交PR参与MindSpore开源社区,我们会尽快review
linzhengshu added
 
device/ascend
label
linzhengshu added
 
device/ascend
label
mudongrui assigned collaborator mudongrui
mudongrui changed assignee from mudongrui to andy_wangrui
mudongrui unassigned collaborator andy_wangrui

目前没910B机器复现,从测试提供的日志信息定位:
第一个ERROR日志报错在RUNTIME,正在联系海思同事协助定位。
输入图片说明

pynative模式下当前用例可以跑通,图模式下存在问题。
用例当前强制将模式设置成PYNATIVE模式,但由于用例中切换模式失败,导致跑到图模式下的流程中。
测试@慕冬蕊请假了,待与她对齐当前用例需要验证什么场景。

在图模式下开了export GRAPH_OP_RUN=1,用例可以跑通。

pynative没有问题
kbk没有问题
GE在后端报错

需要GE后端同事协助定位。

经讨论,初步判断是海思算子处理流程问题。计划节后第一时间联系海思负责人分析定位。

目前定界到是自定义aicpu算子执行失败,正在dump数据定位。

根据 自定义aicpu算子日志落盘
https://wiki.huawei.com/domains/21427/wiki/40193/WIKI202305091134974?title=%E7%A1%AE%E5%AE%9A%E6%98%AF%E5%90%A6%E8%B0%83%E7%94%A8AICPU%E7%AE%97%E5%AD%90
仅dump获取到的自定义aicpu算子日志没有够定位问题,需要自己加日志编译包来定位。

i-robot added
 
wiki
label

dump得到的文件显示没有log data,正在咨询算子组同事@潘智辉。
输入图片说明

i-robot added
 
foruda
label

已与测试 @慕冬蕊 @文理 对齐,当前用例只验证pynative模式。
由于测试用例切换模式强制字符串匹配未成功,导致跑到图模式失败,需要适配用例。
经过验证,pynative模式该用例没有问题,可以跑通。
输入图片说明
输入图片说明
输入图片说明

Margaret_wangrui changed issue state from TODO to VALIDATION
Margaret_wangrui assigned collaborator Margaret_wangrui
Margaret_wangrui changed assignee from Margaret_wangrui to wenli
Margaret_wangrui added
 
rct/oldrelease
label
Margaret_wangrui added
 
rca/others
label
Margaret_wangrui added
 
ctl/solutiontest
label

回归版本:
commit_id = '[sha1]:d8802c69,[branch]:(HEAD,origin/master,origin/HEAD,master)'
runpkg_version:Milan_C17/20240414
回归步骤:参考issue复现步骤
基本功能:适配用例后,跑测正常
INFO 2024-05-07 12:06:12 - test_ms_pynative_memory_optimization_transformer_normal_1p_0001 - base.py:teardown:140 - The base teardown is running

=== 1 passed, 4 warnings in 234.04s (0:03:54) ====

测试结论:回归通过

wenli changed issue state from VALIDATION to DONE
fangwenyi removed
 
v2.3.0.rc2
label
fangwenyi added
 
master
label

Sign in to comment

Status
Assignees
Projects
Milestones
Pull Requests
Successfully merging a pull request will close this issue.
Branches
Planed to start   -   Planed to end
-
Top level
Priority
Duration (hours)
参与者(5)
7379189 margaret wangrui 1588991307
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

Search