2.4K Star 8.2K Fork 4.4K

GVPMindSpore / mindspore

 / 详情

[ST][MS][分布式并行][pangu-alpha 2.6B][910B3 8p]dp=1,mp=2,pp=4,将中间某一个stage的参数全部冻结,开启pipeline 910B3 8p训练失败 RuntimeError: Compile graph kernel_graph1 failed

TODO
Bug-Report
创建于  
2024-04-03 15:40
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

模型仓地址:https://gitee.com/mindspore/models/tree/master/official/nlp/Pangu_alpha
pangu-alpha 2.6B dp=1,mp=2,pp=4,将中间某一个stage的参数全部冻结,开启pipeline 910B3 8p训练失败

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device Ascend910B3

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):
    run包:Milan_C17/20240321
    MindSpore 版本:r2.3.q1_20240329061516_c99698ba
    mindformer:r1.1.tr5_20240329061516_6cd5b33a72

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

测试仓库地址:solution_test/cases/03subject_test/07frame_large_granularity/00distributed_parallelism/01pipeline
用例:
test_ms_pipeline_parallel_gradient_freezing_001

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

  1. get code from models
  2. cd models/official/nlp/Pangu_alpha
    配置设置:
    1)src/pangu_alpha_config.py::set_parse_2_6B args_opt.op_level_model_parallel_num = 2

src/callbacks.py::classLossCallback:step_end:
if self._dataset_size > 0 and self.local_rank % 8 == 0: 改为 if self._dataset_size > 0:
3)src/pangu_alpha.py::set_parallel_configure_for_layer
在print(f"pipeline stage id is {pp_id}", flush=True)下面插入这段代码
if pp_id == 1:
for param in network.trainable_params():
param.requires_grad = False

  1. sh scripts/run_distribute_train.sh ./pangu-data/pangu_30_step_bs64 ./hccl_8p.json 8 fp32 2.6B 4 4 2 0 8
  2. 验证网络训练成功
  3. 训练日志正常

Describe the expected behavior / 预期结果 (Mandatory / 必填)

网络训练成功

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

Traceback (most recent call last):
 File "/home/jenkins0/zjc/solution_test/cases/02network/02nlp/pangu_alpha/train/test_ms_pangu_2_6b_param_freezing_train_910_8p_0001/train.py", line 558, in <module>
   run_train_pipeline(opt)
 File "/home/jenkins0/zjc/solution_test/cases/02network/02nlp/pangu_alpha/train/test_ms_pangu_2_6b_param_freezing_train_910_8p_0001/train.py", line 543, in run_train_pipeline
   sink_size=callback_size, dataset_sink_mode=True)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 1085, in train
   initial_epoch=initial_epoch)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 115, in wrapper
   func(self, *args, **kwargs)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 635, in _train
   cb_params, sink_size, initial_epoch, valid_infos)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 719, in _train_dataset_sink_process
   outputs = train_network(*inputs)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 662, in __call__
   out = self.compile_and_run(*args, **kwargs)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 981, in compile_and_run
   self.compile(*args, **kwargs)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 965, in compile
   jit_config_dict=self._jit_config_dict, **kwargs)
 File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 1590, in compile
   result = self._graph_executor.compile(obj, args, kwargs, phase, self._use_vm_mode())
RuntimeError: Compile graph kernel_graph1 failed.

----------------------------------------------------
- Ascend Error Message:
----------------------------------------------------
E19999: Inner Error!
E19999  Verifying Default/MakeTuple-op3 failed.[FUNC:InferShapeAndType][FILE:infershape_pass.cc][LINE:132]
       TraceBack (most recent call last):
       Call InferShapeAndType for node:Default/MakeTuple-op3(IdentityN) failed[FUNC:Infer][FILE:infershape_pass.cc][LINE:120]
       process pass InferShapePass on node:Default/MakeTuple-op3 failed, ret:4294967295[FUNC:RunPassesOnNode][FILE:base_pass.cc][LINE:570]
       [Call][PreRun] Failed, graph_id:2, session_id:0.[FUNC:CompileGraph][FILE:graph_manager.cc][LINE:4408]
       [Compile][Graph]Compile graph failed, error code:1343225857, session_id:0, graph_id:2.[FUNC:CompileGraph][FILE:ge_api.cc][LINE:1159]

(Please search "CANN Common Error Analysis" at https://www.mindspore.cn for error code description)

----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/plugin/device/ascend/hal/hardware/ge_graph_executor.cc:969 CompileGraph

Special notes for this issue/备注 (Optional / 选填)

走给李晨

评论 (2)

zhongjicheng 创建了Bug-Report
zhongjicheng 添加了
 
sig/parallel
标签
zhongjicheng 添加了
 
device/ascend
标签
zhongjicheng 添加了
 
attr/function
标签
zhongjicheng 添加了
 
stage/func-debug
标签
zhongjicheng 添加了
 
kind/bug
标签
zhongjicheng 添加了
 
v2.3.0.rc2
标签
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@zhongjicheng

感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:

  1. 如果您刚刚接触MindSpore,或许您可以在教程找到答案
  2. 如果您是资深Pytorch用户,您或许需要:
  1. 如果您遇到动态图问题,可以设置set_context(pynative_synchronize=True)查看报错栈协助定位
  2. 模型精度调优问题可参考官网调优指南
  3. 如果您反馈的是框架BUG,请确认您在ISSUE中提供了MindSpore版本、使用的后端类型(CPU、GPU、Ascend)、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
  4. 如果您已经定位出问题根因,欢迎提交PR参与MindSpore开源社区,我们会尽快review
zhongjicheng 修改了描述
zhongjicheng 负责人lichen 修改为duanjiali
duanjiali 添加协作者duanjiali
duanjiali 负责人duanjiali 修改为lichen
fangwenyi 移除了
 
v2.3.0.rc2
标签
fangwenyi 移除了
 
v2.3.0.rc2
标签
fangwenyi 添加了
 
v2.3.0
标签

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(3)
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助