name | about | labels |
---|---|---|
Bug Report | Use this template for reporting a bug | kind/bug |
模型仓地址:https://gitee.com/mindspore/models/tree/master/official/nlp/Pangu_alpha
pangu-alpha 2.6B dp=1,mp=2,pp=4,将中间某一个stage的参数全部冻结,开启pipeline 910B3 8p训练失败
Ascend
/GPU
/CPU
) / 硬件环境:Please delete the backend not involved / 请删除不涉及的后端:
/device Ascend910B3
Software Environment / 软件环境 (Mandatory / 必填):
-- MindSpore version (e.g., 1.7.0.Bxxx) :
-- Python version (e.g., Python 3.7.5) :
-- OS platform and distribution (e.g., Linux Ubuntu 16.04):
-- GCC/Compiler version (if compiled from source):
run包:Milan_C17/20240321
MindSpore 版本:r2.3.q1_20240329061516_c99698ba
mindformer:r1.1.tr5_20240329061516_6cd5b33a72
Excute Mode / 执行模式 (Mandatory / 必填)(PyNative
/Graph
):
Please delete the mode not involved / 请删除不涉及的模式:
/mode graph
测试仓库地址:solution_test/cases/03subject_test/07frame_large_granularity/00distributed_parallelism/01pipeline
用例:
test_ms_pipeline_parallel_gradient_freezing_001
src/callbacks.py::classLossCallback:step_end:
if self._dataset_size > 0 and self.local_rank % 8 == 0: 改为 if self._dataset_size > 0:
3)src/pangu_alpha.py::set_parallel_configure_for_layer
在print(f"pipeline stage id is {pp_id}", flush=True)下面插入这段代码
if pp_id == 1:
for param in network.trainable_params():
param.requires_grad = False
网络训练成功
Traceback (most recent call last):
File "/home/jenkins0/zjc/solution_test/cases/02network/02nlp/pangu_alpha/train/test_ms_pangu_2_6b_param_freezing_train_910_8p_0001/train.py", line 558, in <module>
run_train_pipeline(opt)
File "/home/jenkins0/zjc/solution_test/cases/02network/02nlp/pangu_alpha/train/test_ms_pangu_2_6b_param_freezing_train_910_8p_0001/train.py", line 543, in run_train_pipeline
sink_size=callback_size, dataset_sink_mode=True)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 1085, in train
initial_epoch=initial_epoch)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 115, in wrapper
func(self, *args, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 635, in _train
cb_params, sink_size, initial_epoch, valid_infos)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 719, in _train_dataset_sink_process
outputs = train_network(*inputs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 662, in __call__
out = self.compile_and_run(*args, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 981, in compile_and_run
self.compile(*args, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 965, in compile
jit_config_dict=self._jit_config_dict, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 1590, in compile
result = self._graph_executor.compile(obj, args, kwargs, phase, self._use_vm_mode())
RuntimeError: Compile graph kernel_graph1 failed.
----------------------------------------------------
- Ascend Error Message:
----------------------------------------------------
E19999: Inner Error!
E19999 Verifying Default/MakeTuple-op3 failed.[FUNC:InferShapeAndType][FILE:infershape_pass.cc][LINE:132]
TraceBack (most recent call last):
Call InferShapeAndType for node:Default/MakeTuple-op3(IdentityN) failed[FUNC:Infer][FILE:infershape_pass.cc][LINE:120]
process pass InferShapePass on node:Default/MakeTuple-op3 failed, ret:4294967295[FUNC:RunPassesOnNode][FILE:base_pass.cc][LINE:570]
[Call][PreRun] Failed, graph_id:2, session_id:0.[FUNC:CompileGraph][FILE:graph_manager.cc][LINE:4408]
[Compile][Graph]Compile graph failed, error code:1343225857, session_id:0, graph_id:2.[FUNC:CompileGraph][FILE:ge_api.cc][LINE:1159]
(Please search "CANN Common Error Analysis" at https://www.mindspore.cn for error code description)
----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/ascend/hal/hardware/ge_graph_executor.cc:969 CompileGraph
走给李晨
Please assign maintainer to check this issue.
请为此issue分配处理人。
@zhongjicheng
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:
登录 后才可以发表评论