name | about | labels |
---|---|---|
Bug Report | Use this template for reporting a bug | kind/bug |
dit+vae网络在Pynative模式在使用zero2优化器并行报错
Ascend
/GPU
/CPU
) / 硬件环境:Please delete the backend not involved / 请删除不涉及的后端:
/device ascend
Software Environment / 软件环境 (Mandatory / 必填):
-- MindSpore version (e.g., 1.7.0.Bxxx) : 26号master分支包
-- Python version (e.g., Python 3.7.5) :
-- OS platform and distribution (e.g., Linux Ubuntu 16.04):
-- GCC/Compiler version (if compiled from source):
Excute Mode / 执行模式 (Mandatory / 必填)(PyNative
/Graph
):
Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative
test_ms_parallel_pynative_optimpara_zero2
Traceback (most recent call last):
File "train_vae_dit_v1.py", line 395, in <module>
main(args)
File "train_vae_dit_v1.py", line 388, in main
initial_epoch=start_epoch,
File "/home/jenkins0/.local/lib/python3.7/site-packages/mindspore/train/model.py", line 1087, in train
initial_epoch=initial_epoch)
File "/home/jenkins0/.local/lib/python3.7/site-packages/mindspore/train/model.py", line 115, in wrapper
func(self, *args, **kwargs)
File "/home/jenkins0/.local/lib/python3.7/site-packages/mindspore/train/model.py", line 637, in _train
cb_params, sink_size, initial_epoch, valid_infos)
File "/home/jenkins0/.local/lib/python3.7/site-packages/mindspore/train/model.py", line 721, in _train_dataset_sink_process
outputs = train_network(*inputs)
File "/home/jenkins0/.local/lib/python3.7/site-packages/mindspore/nn/cell.py", line 715, in __call__
raise err
File "/home/jenkins0/.local/lib/python3.7/site-packages/mindspore/nn/cell.py", line 711, in __call__
output = self._run_construct(args, kwargs)
File "/home/jenkins0/.local/lib/python3.7/site-packages/mindspore/nn/cell.py", line 483, in _run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/home/jenkins0/.local/lib/python3.7/site-packages/mindspore/train/dataset_helper.py", line 109, in construct
return self.network(*outputs)
File "/home/jenkins0/.local/lib/python3.7/site-packages/mindspore/nn/cell.py", line 715, in __call__
raise err
File "/home/jenkins0/.local/lib/python3.7/site-packages/mindspore/nn/cell.py", line 711, in __call__
output = self._run_construct(args, kwargs)
File "/home/jenkins0/.local/lib/python3.7/site-packages/mindspore/nn/cell.py", line 483, in _run_construct
output = self.construct(*cast_inputs, **kwargs)
File "/home/jenkins0/bmz/solution_test/cases/03subject_test/08frame_large_granularity/00distributed_parallelism/05pynative_parallel/test_ms_dit_vae_0001/DIT_2D/mindone/trainers/train_step.py", line 106, in construct
grads = self.grad(self.network, weights)(*inputs, scaling_sens_filled)
File "/home/jenkins0/.local/lib/python3.7/site-packages/mindspore/ops/composite/base.py", line 389, in after_grad
return grad_(fn, weights)(*args, **kwargs)
File "/home/jenkins0/.local/lib/python3.7/site-packages/mindspore/common/api.py", line 132, in wrapper
results = fn(*arg, **kwargs)
File "/home/jenkins0/.local/lib/python3.7/site-packages/mindspore/ops/composite/base.py", line 378, in after_grad
out = _pynative_executor.grad(fn, grad_, weights, self.grad_position, *args, **kwargs)
File "/home/jenkins0/.local/lib/python3.7/site-packages/mindspore/common/api.py", line 1336, in grad
return self._executor.grad(grad, obj, weights, grad_position, *args, *(kwargs.values()))
RuntimeError: Get input type Tensor(shape=[1, 16, 32, 32], dtype=Float32, value=[...]), but want to get St10shared_ptrIN9mindspore10ValueTupleEE
----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/runtime/pynative/op_function/value_converter.h:35 Convert
Please assign maintainer to check this issue.
请为此issue分配处理人。
@duanjiali
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:
#Appearance & Root Cause
当前直接对bprop_graph_run_by_single_op_设置,会存在一个bug,如果先true了,后可能会被改成false。
#Fix Solution
使用set_bprop_graph_run_by_single_op()函数中的 或运算 避免这个问题.
relevant pr:
!68762:fix bprop_graph_run_by_single_op bug
Self-test Report & DT Review
是否需要补充ST/UT:否,属于泛化测试出错,需要构造特定场景。
验证commit:fb35b3faa9553f00c5fb273977e42b7151c7138e
验证结果:用例pass。
回归版本:
commit_id = '[sha1]:39ac2284,[branch]:(HEAD,origin/master,origin/HEAD,master)'
runpkg_version:Milan_C17/20240414
回归步骤:参考issue复现步骤
基本功能:适配用例后,跑测正常
[2024-05-05 15:52:06] INFO: epoch: 3 step: 10, lr: 0.0000807, loss: 1.014917, loss scale: 1.
Train epoch time: 37207.433 ms, per step time: 3720.743 ms
测试结论:回归通过
登录 后才可以发表评论