2.3K Star 8.1K Fork 4.3K

GVPMindSpore / mindspore

 / 详情

[ST][MS] dit+vae网络在Pynative模式在使用zero2优化器并行报错

DONE
Bug-Report 成员
创建于  
2024-04-27 10:44
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

dit+vae网络在Pynative模式在使用zero2优化器并行报错

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) : 26号master分支包
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative

Related testcase / 关联用例 (Mandatory / 必填)

test_ms_parallel_pynative_optimpara_zero2

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

  1. dit+vae网络,配置optimizer = AdamWeightDecayZeRO2(group_params, learning_rate=lr, beta1=betas[0], beta2=betas[1], eps=eps, use_parallel=True, opt_parallel_group=GlobalComm.WORLD_COMM_GROUP, cpu_offload=False)
  2. pynative模式八卡训练

Describe the expected behavior / 预期结果 (Mandatory / 必填)

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

Traceback (most recent call last):
  File "train_vae_dit_v1.py", line 395, in <module>
    main(args)
  File "train_vae_dit_v1.py", line 388, in main
    initial_epoch=start_epoch,
  File "/home/jenkins0/.local/lib/python3.7/site-packages/mindspore/train/model.py", line 1087, in train
    initial_epoch=initial_epoch)
  File "/home/jenkins0/.local/lib/python3.7/site-packages/mindspore/train/model.py", line 115, in wrapper
    func(self, *args, **kwargs)
  File "/home/jenkins0/.local/lib/python3.7/site-packages/mindspore/train/model.py", line 637, in _train
    cb_params, sink_size, initial_epoch, valid_infos)
  File "/home/jenkins0/.local/lib/python3.7/site-packages/mindspore/train/model.py", line 721, in _train_dataset_sink_process
    outputs = train_network(*inputs)
  File "/home/jenkins0/.local/lib/python3.7/site-packages/mindspore/nn/cell.py", line 715, in __call__
    raise err
  File "/home/jenkins0/.local/lib/python3.7/site-packages/mindspore/nn/cell.py", line 711, in __call__
    output = self._run_construct(args, kwargs)
  File "/home/jenkins0/.local/lib/python3.7/site-packages/mindspore/nn/cell.py", line 483, in _run_construct
    output = self.construct(*cast_inputs, **kwargs)
  File "/home/jenkins0/.local/lib/python3.7/site-packages/mindspore/train/dataset_helper.py", line 109, in construct
    return self.network(*outputs)
  File "/home/jenkins0/.local/lib/python3.7/site-packages/mindspore/nn/cell.py", line 715, in __call__
    raise err
  File "/home/jenkins0/.local/lib/python3.7/site-packages/mindspore/nn/cell.py", line 711, in __call__
    output = self._run_construct(args, kwargs)
  File "/home/jenkins0/.local/lib/python3.7/site-packages/mindspore/nn/cell.py", line 483, in _run_construct
    output = self.construct(*cast_inputs, **kwargs)
  File "/home/jenkins0/bmz/solution_test/cases/03subject_test/08frame_large_granularity/00distributed_parallelism/05pynative_parallel/test_ms_dit_vae_0001/DIT_2D/mindone/trainers/train_step.py", line 106, in construct
    grads = self.grad(self.network, weights)(*inputs, scaling_sens_filled)
  File "/home/jenkins0/.local/lib/python3.7/site-packages/mindspore/ops/composite/base.py", line 389, in after_grad
    return grad_(fn, weights)(*args, **kwargs)
  File "/home/jenkins0/.local/lib/python3.7/site-packages/mindspore/common/api.py", line 132, in wrapper
    results = fn(*arg, **kwargs)
  File "/home/jenkins0/.local/lib/python3.7/site-packages/mindspore/ops/composite/base.py", line 378, in after_grad
    out = _pynative_executor.grad(fn, grad_, weights, self.grad_position, *args, **kwargs)
  File "/home/jenkins0/.local/lib/python3.7/site-packages/mindspore/common/api.py", line 1336, in grad
    return self._executor.grad(grad, obj, weights, grad_position, *args, *(kwargs.values()))
RuntimeError: Get input type Tensor(shape=[1, 16, 32, 32], dtype=Float32, value=[...]), but want to get St10shared_ptrIN9mindspore10ValueTupleEE

----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/runtime/pynative/op_function/value_converter.h:35 Convert

Special notes for this issue/备注 (Optional / 选填)

评论 (4)

duanjiali 创建了Bug-Report
duanjiali 添加了
 
v2.3.0.rc2
标签
duanjiali 添加了
 
kind/bug
标签
duanjiali 添加了
 
attr/function
标签
duanjiali 添加了
 
sig/pynative
标签
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@duanjiali

感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:

  1. 如果您刚刚接触MindSpore,或许您可以在教程找到答案
  2. 如果您是资深Pytorch用户,您或许需要:
  1. 如果您遇到动态图问题,可以设置set_context(pynative_synchronize=True)查看报错栈协助定位
  2. 模型精度调优问题可参考官网调优指南
  3. 如果您反馈的是框架BUG,请确认您在ISSUE中提供了MindSpore版本、使用的后端类型(CPU、GPU、Ascend)、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
  4. 如果您已经定位出问题根因,欢迎提交PR参与MindSpore开源社区,我们会尽快review
baimz 修改了描述
duanjiali 添加了
 
device/ascend
标签
duanjiali 添加了
 
device/ascend
标签

#Appearance & Root Cause
当前直接对bprop_graph_run_by_single_op_设置,会存在一个bug,如果先true了,后可能会被改成false。
#Fix Solution
使用set_bprop_graph_run_by_single_op()函数中的 或运算 避免这个问题.
relevant pr:
!68762:fix bprop_graph_run_by_single_op bug

Self-test Report & DT Review
是否需要补充ST/UT:否,属于泛化测试出错,需要构造特定场景。

验证commit:fb35b3faa9553f00c5fb273977e42b7151c7138e
验证结果:用例pass。

i-robot 添加了
 
gitee
标签
zjun 任务状态TODO 修改为VALIDATION
zjun 添加协作者zjun
zjun 负责人zjun 修改为duanjiali
zjun 添加了
 
rca/algorithm
标签
zjun 添加了
 
rct/newfeature
标签
zjun 添加了
 
ctl/solutiontest
标签
zjun 添加了
 
ctl/solutiontest
标签
zjun 里程碑B-SIG-PYNATIVE 修改为B-SolutionTest

回归版本:
commit_id = '[sha1]:39ac2284,[branch]:(HEAD,origin/master,origin/HEAD,master)'
runpkg_version:Milan_C17/20240414
回归步骤:参考issue复现步骤
基本功能:适配用例后,跑测正常
[2024-05-05 15:52:06] INFO: epoch: 3 step: 10, lr: 0.0000807, loss: 1.014917, loss scale: 1.
Train epoch time: 37207.433 ms, per step time: 3720.743 ms

测试结论:回归通过

wenli 任务状态VALIDATION 修改为DONE
fangwenyi 移除了
 
v2.3.0.rc2
标签
fangwenyi 添加了
 
master
标签

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(4)
6580807 zjun3021 1615805932
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助