2.3K Star 8.1K Fork 4.3K

GVPMindSpore / mindspore

 / 详情

[ST][MS] Bert-base Pynative模式GPU八卡训练出现段错误

WIP
Bug-Report 成员
创建于  
2024-04-28 18:15
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

Bert-base Pynative模式GPU八卡训练出现段错误

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device GPU

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :commit_id = '[sha1]:a87635b6,[branch]:(HEAD,origin/master,origin/HEAD,master)'
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative

Related testcase / 关联用例 (Mandatory / 必填)

test_ms_long_stability_nlp_bert_base_8p_0001

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

  1. export DEVICE_TYPE=GPU_PCIE; export TRAIN_MODE=PYNATIVE_MODE;unset CUDA_VISIBLE_DEVICES; export CaseBVersion='MindSpore 2.3.X_Daily';export CaseCVersion='MindSpore 2.3.X';export VersionBranch=master;export VersionCommitId=a87635b6f65bc6225173fec53883351ebf449b66;export VersionBuildDate=20240426230938; source solution_test/env_set.source -e cuda11.1
  2. cd solution_test/cases/03subject_test/00reliability_availability/00long_stability/bert
  3. pytest -s -v test_ms_long_stability_nlp_bert_base_8p_0001.py

Describe the expected behavior / 预期结果 (Mandatory / 必填)

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

Core was generated by `python run_pretrain.py --device_target=GPU --distribute=true --epoch_size=1 --e'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007fc765de0040 in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
[Current thread is 1 (Thread 0x7fc792c19740 (LWP 68124))]
(gdb) bt
#0  0x00007fc765de0040 in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#1  0x00007fc765d9226a in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#2  0x00007fc765b318bf in mindspore::opt::PassManager::RunPass(std::shared_ptr<mindspore::FuncGraph> const&, unsigned long, std::shared_ptr<mindspore::opt::Pass> const&) const ()
   from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#3  0x00007fc765dc69a1 in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#4  0x00007fc765aeb205 in mindspore::opt::GraphOptimizer::Optimize(std::shared_ptr<mindspore::FuncGraph> const&, bool) ()
   from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#5  0x00007fc765d44984 in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#6  0x00007fc765d44e0a in mindspore::graphkernel::GraphKernelOptimize(std::shared_ptr<mindspore::session::KernelGraph> const&) ()
   from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#7  0x00007fc7569488cf in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/plugin/libmindspore_gpu.so.11.1
#8  0x00007fc76630ad6e in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#9  0x00007fc766016d20 in mindspore::compile::MindRTBackend::RealCompileGraphBeforeRunActor(mindspore::runtime::GraphCompilerInfo const&, mindspore::VectorRef const&, bool) ()
   from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#10 0x00007fc766018154 in mindspore::compile::MindRTBackend::RunGraphByActors(std::string const&, mindspore::runtime::GraphCompilerInfo const&, mindspore::VectorRef const&, mindspore::VectorRef*) () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#11 0x00007fc7660208fe in mindspore::compile::MindRTBackend::RunGraphByCondition(std::string const&, mindspore::runtime::GraphCompilerInfo const&, mindspore::VectorRef const&, mindspore::VectorRef*) () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#12 0x00007fc76603a055 in mindspore::compile::MindRTBackendBase::RunGraph(std::string const&, mindspore::VectorRef const&, mindspore::VectorRef*) ()
   from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#13 0x00007fc76c7c42de in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/_c_expression.cpython-37m-x86_64-linux-gnu.so
#14 0x00007fc76ccb89b3 in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/_c_expression.cpython-37m-x86_64-linux-gnu.so
#15 0x00007fc76ccba3f9 in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/_c_expression.cpython-37m-x86_64-linux-gnu.so
#16 0x00007fc76ceceb24 in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/_c_expression.cpython-37m-x86_64-linux-gnu.so
#17 0x00007fc76cec7499 in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/_c_expression.cpython-37m-x86_64-linux-gnu.so
#18 0x00007fc76c949fdd in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/_c_expression.cpython-37m-x86_64-linux-gnu.so
#19 0x00007fc76cecb58f in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/_c_expression.cpython-37m-x86_64-linux-gnu.so
#20 0x00007fc76cecb8d1 in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/_c_expression.cpython-37m-x86_64-linux-gnu.so
#21 0x00007fc76ceda0d1 in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/_c_expression.cpython-37m-x86_64-linux-gnu.so
#22 0x00007fc76bd85d75 in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/_c_expression.cpython-37m-x86_64-linux-gnu.so
#23 0x000055f83b8c1f1d in _PyMethodDef_RawFastCallDict () at /tmp/build/80754af9/python_1572016129546/work/Objects/call.c:515
#24 0x000055f83b8c20a1 in _PyCFunction_FastCallDict (func=0x7fc78890f730, args=<optimized out>, nargs=<optimized out>, kwargs=<optimized out>)
    at /tmp/build/80754af9/python_1572016129546/work/Objects/call.c:586
#25 0x000055f83b8bfeae in _PyObject_Call_Prepend () at /tmp/build/80754af9/python_1572016129546/work/Objects/call.c:908
#26 0x000055f83b8b2a3e in PyObject_Call () at /tmp/build/80754af9/python_1572016129546/work/Objects/call.c:245
#27 0x000055f83b95a12a in do_call_core (kwdict=0x0, callargs=0x7fc6b34c4910, func=0x7fc77c19fbe0) at /tmp/build/80754af9/python_1572016129546/work/Python/ceval.c:4645
#28 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1572016129546/work/Python/ceval.c:3191
#29 0x000055f83b8a01b9 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1572016129546/work/Python/ceval.c:3930
#30 0x000055f83b8a12a5 in _PyFunction_FastCallDict () at /tmp/build/80754af9/python_1572016129546/work/Objects/call.c:376
#31 0x000055f83b8bfeae in _PyObject_Call_Prepend () at /tmp/build/80754af9/python_1572016129546/work/Objects/call.c:908
#32 0x000055f83b8b2a3e in PyObject_Call () at /tmp/build/80754af9/python_1572016129546/work/Objects/call.c:245
#33 0x000055f83b95a12a in do_call_core (kwdict=0x7fc6981afbe0, callargs=0x7fc6b34c42d0, func=0x7fc792bd0320) at /tmp/build/80754af9/python_1572016129546/work/Python/ceval.c:4645
#34 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1572016129546/work/Python/ceval.c:3191
#35 0x000055f83b8a0978 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1572016129546/work/Python/ceval.c:3930
#36 0x000055f83b8a12a5 in _PyFunction_FastCallDict () at /tmp/build/80754af9/python_1572016129546/work/Objects/call.c:376
#37 0x000055f83b95a12a in do_call_core (kwdict=0x7fc69826a5a0, callargs=0x7fc6982c4450, func=0x7fc69836e710) at /tmp/build/80754af9/python_1572016129546/work/Python/ceval.c:4645
#38 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1572016129546/work/Python/ceval.c:3191
#39 0x000055f83b8a049a in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1572016129546/work/Python/ceval.c:3930
#40 0x000055f83b8a12a5 in _PyFunction_FastCallDict () at /tmp/build/80754af9/python_1572016129546/work/Objects/call.c:376
#41 0x000055f83b95a12a in do_call_core (kwdict=0x7fc6981b6dc0, callargs=0x7fc6982c32d0, func=0x7fc6982b5ef0) at /tmp/build/80754af9/python_1572016129546/work/Python/ceval.c:4645
#42 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1572016129546/work/Python/ceval.c:3191
#43 0x000055f83b8a0978 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1572016129546/work/Python/ceval.c:3930
#44 0x000055f83b8f0437 in _PyFunction_FastCallKeywords () at /tmp/build/80754af9/python_1572016129546/work/Objects/call.c:433
#45 0x000055f83b958606 in call_function (kwnames=0x0, oparg=<optimized out>, pp_stack=<synthetic pointer>) at /tmp/build/80754af9/python_1572016129546/work/Python/ceval.c:4616
#46 _PyEval_EvalFrameDefault () at /tmp/build/80754af9/python_1572016129546/work/Python/ceval.c:3124
#47 0x000055f83b8a01b9 in _PyEval_EvalCodeWithName () at /tmp/build/80754af9/python_1572016129546/work/Python/ceval.c:3930
[WARNING] ME(68112,7fe4ff54d740,python):2024-04-28-15:56:16.277.210 [mindspore/ccsrc/runtime/hardware/device_context_manager.cc:193] LoadDynamicLib] Load dynamic library: libmindspore_ascend.so.2 failed. libge_runner.so: cannot open shared object file: No such file or directory
[WARNING] DISTRIBUTED(68112,7fe4ff54d740,python):2024-04-28-15:56:24.511.830 [mindspore/ccsrc/distributed/collective/collective_manager.cc:259] CreateCommunicationGroup] Start to create communication group: nccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}
[WARNING] DISTRIBUTED(68112,7fe4ff54d740,python):2024-04-28-15:56:25.593.742 [mindspore/ccsrc/distributed/collective/collective_manager.cc:335] CreateCommunicationGroup] Begin initialize communication group on the device side: nccl_world_group
[WARNING] DISTRIBUTED(68112,7fe4ff54d740,python):2024-04-28-15:56:26.880.992 [mindspore/ccsrc/distributed/collective/collective_manager.cc:345] CreateCommunicationGroup] End initialize communication group on the device side: nccl_world_group
origin dataset size:  410184
[WARNING] ME(68112:140621513021248,MainProcess):2024-04-28-15:56:29.413.098 [mindspore/train/model.py:1122] For LossCallBack callback, {'step_end'} methods may not be supported in later version, Use methods prefixed with 'on_train' or 'on_eval' instead when using customized callbacks.
[10-90-66-121:68112] *** Process received signal ***
[10-90-66-121:68112] Signal: Segmentation fault (11)
[10-90-66-121:68112] Signal code: Address not mapped (1)
[10-90-66-121:68112] Failing at address: (nil)
[10-90-66-121:68112] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x12980)[0x7fe4ff133980]
[10-90-66-121:68112] [ 1] /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so(+0x3b6c040)[0x7fe4d2714040]
[10-90-66-121:68112] [ 2] /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so(+0x3b1e26a)[0x7fe4d26c626a]
[10-90-66-121:68112] [ 3] /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so(_ZNK9mindspore3opt11PassManager7RunPassERKSt10shared_ptrINS_9FuncGraphEEmRKS2_INS0_4PassEE+0x3f)[0x7fe4d24658bf]
[10-90-66-121:68112] [ 4] /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so(+0x3b529a1)[0x7fe4d26fa9a1]
[10-90-66-121:68112] [ 5] /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so(_ZN9mindspore3opt14GraphOptimizer8OptimizeERKSt10shared_ptrINS_9FuncGraphEEb+0xd5)[0x7fe4d241f205]
[10-90-66-121:68112] [ 6] /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so(+0x3ad0984)[0x7fe4d2678984]
[10-90-66-121:68112] [ 7] /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so(_ZN9mindspore11graphkernel19GraphKernelOptimizeERKSt10shared_ptrINS_7session11KernelGraphEE+0x3a)[0x7fe4d2678e0a]
[10-90-66-121:68112] [ 8] /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/plugin/libmindspore_gpu.so.11.1(+0x46168cf)[0x7fe4c327c8cf]
[10-90-66-121:68112] [ 9] /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so(+0x4096d6e)[0x7fe4d2c3ed6e]
[10-90-66-121:68112] [10] /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so(_ZN9mindspore7compile13MindRTBackend30RealCompileGraphBeforeRunActorERKNS_7runtime17GraphCompilerInfoERKNS_9VectorRefEb+0x610)[0x7fe4d294ad20]
[10-90-66-121:68112] [11] /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so(_ZN9mindspore7compile13MindRTBackend16RunGraphByActorsERKSsRKNS_7runtime17GraphCompilerInfoERKNS_9VectorRefEPS8_+0x884)[0x7fe4d294c154]
[10-90-66-121:68112] [12] /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so(_ZN9mindspore7compile13MindRTBackend19RunGraphByConditionERKSsRKNS_7runtime17GraphCompilerInfoERKNS_9VectorRefEPS8_+0x21e)[0x7fe4d29548fe]
[10-90-66-121:68112] [13] /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so(_ZN9mindspore7compile17MindRTBackendBase8RunGraphERKSsRKNS_9VectorRefEPS4_+0x1095)[0x7fe4d296e055]
[10-90-66-121:68112] [14] /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/_c_expression.cpython-37m-x86_64-linux-gnu.so(+0x21e92de)[0x7fe4d90f82de]
[10-90-66-121:68112] [15] /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/_c_expression.cpython-37m-x86_64-linux-gnu.so(+0x26dd9b3)[0x7fe4d95ec9b3]
[10-90-66-121:68112] [16] /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/_c_expression.cpython-37m-x86_64-linux-gnu.so(+0x26df3f9)[0x7fe4d95ee3f9]
[10-90-66-121:68112] [17] /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/_c_expression.cpython-37m-x86_64-linux-gnu.so(+0x28f3b24)[0x7fe4d9802b24]
[10-90-66-121:68112] [18] /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/_c_expression.cpython-37m-x86_64-linux-gnu.so(+0x28ec499)[0x7fe4d97fb499]
[10-90-66-121:68112] [19] /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/_c_expression.cpython-37m-x86_64-linux-gnu.so(+0x236efdd)[0x7fe4d927dfdd]
[10-90-66-121:68112] [20] /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/_c_expression.cpython-37m-x86_64-linux-gnu.so(+0x28f058f)[0x7fe4d97ff58f]
[10-90-66-121:68112] [21] /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/_c_expression.cpython-37m-x86_64-linux-gnu.so(+0x28f08d1)[0x7fe4d97ff8d1]
[10-90-66-121:68112] [22] /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/_c_expression.cpython-37m-x86_64-linux-gnu.so(+0x28ff0d1)[0x7fe4d980e0d1]
[10-90-66-121:68112] [23] /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/_c_expression.cpython-37m-x86_64-linux-gnu.so(+0x17aad75)[0x7fe4d86b9d75]
[10-90-66-121:68112] [24] python(_PyMethodDef_RawFastCallDict+0x24d)[0x5586c6519f1d]
[10-90-66-121:68112] [25] python(_PyCFunction_FastCallDict+0x21)[0x5586c651a0a1]
[10-90-66-121:68112] [26] python(_PyObject_Call_Prepend+0xde)[0x5586c6517eae]
[10-90-66-121:68112] [27] python(PyObject_Call+0x6e)[0x5586c650aa3e]
[10-90-66-121:68112] [28] python(_PyEval_EvalFrameDefault+0x1f3a)[0x5586c65b212a]
[10-90-66-121:68112] [29] python(_PyEval_EvalCodeWithName+0x2f9)[0x5586c64f81b9]
[10-90-66-121:68112] *** End of error message ***

Special notes for this issue/备注 (Optional / 选填)

评论 (10)

duanjiali 创建了Bug-Report
duanjiali 添加了
 
v2.3.0.rc2
标签
duanjiali 添加了
 
kind/bug
标签
duanjiali 添加了
 
attr/function
标签
duanjiali 添加了
 
sig/akg
标签
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@duanjiali

感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:

  1. 如果您刚刚接触MindSpore,或许您可以在教程找到答案
  2. 如果您是资深Pytorch用户,您或许需要:
  1. 如果您遇到动态图问题,可以设置set_context(pynative_synchronize=True)查看报错栈协助定位
  2. 模型精度调优问题可参考官网调优指南
  3. 如果您反馈的是框架BUG,请确认您在ISSUE中提供了MindSpore版本、使用的后端类型(CPU、GPU、Ascend)、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
  4. 如果您已经定位出问题根因,欢迎提交PR参与MindSpore开源社区,我们会尽快review

堆栈

#0  0x00007fffd4471c6a in mindspore::graphkernel::inner::ConstScalarNode::ConstScalarNode (this=0x5555e847aed0,
    data=std::shared_ptr<mindspore::Value> (use count 3, weak count 1) = {...})
    at /data/hhf/code/ms_tmp/mindspore/mindspore/ccsrc/backend/common/graph_kernel/model/node.h:125
#1  0x00007fffd447ee54 in __gnu_cxx::new_allocator<mindspore::graphkernel::inner::ConstScalarNode>::construct<mindspore::graphkernel::inner::ConstScalarNode, std::shared_ptr<mindspore::Value>&> (this=0x7fffffff91d7, __p=0x5555e847aed0) at /usr/include/c++/7/ext/new_allocator.h:136
#2  0x00007fffd447e698 in std::allocator_traits<std::allocator<mindspore::graphkernel::inner::ConstScalarNode> >::construct<mindspore::graphkernel::inner::ConstScalarNode, std::shared_ptr<mindspore::Value>&> (__a=..., __p=0x5555e847aed0) at /usr/include/c++/7/bits/alloc_traits.h:475
#3  0x00007fffd447da02 in std::_Sp_counted_ptr_inplace<mindspore::graphkernel::inner::ConstScalarNode, std::allocator<mindspore::graphkernel::inner::ConstScalarNode>, (__gnu_cxx::_Lock_policy)2>::_Sp_counted_ptr_inplace<std::shared_ptr<mindspore::Value>&> (this=0x5555e847aec0, __a=...) at /usr/include/c++/7/bits/shared_ptr_base.h:526
#4  0x00007fffd447c349 in std::__shared_count<(__gnu_cxx::_Lock_policy)2>::__shared_count<mindspore::graphkernel::inner::ConstScalarNode, std::allocator<mindspore::graphkernel::inner::ConstScalarNode>, std::shared_ptr<mindspore::Value>&> (this=0x7fffffff94a8, __a=...) at /usr/include/c++/7/bits/shared_ptr_base.h:637
#5  0x00007fffd447a946 in std::__shared_ptr<mindspore::graphkernel::inner::ConstScalarNode, (__gnu_cxx::_Lock_policy)2>::__shared_ptr<std::allocator<mindspore::graphkernel::inner::ConstScalarNode>, std::shared_ptr<mindspore::Value>&> (this=0x7fffffff94a0, __tag=..., __a=...) at /usr/include/c++/7/bits/shared_ptr_base.h:1295
#6  0x00007fffd4478a1f in std::shared_ptr<mindspore::graphkernel::inner::ConstScalarNode>::shared_ptr<std::allocator<mindspore::graphkernel::inner::ConstScalarNode>, std::shared_ptr<mindspore::Value>&> (this=0x7fffffff94a0, __tag=..., __a=...) at /usr/include/c++/7/bits/shared_ptr.h:344
#7  0x00007fffd44761c1 in std::allocate_shared<mindspore::graphkernel::inner::ConstScalarNode, std::allocator<mindspore::graphkernel::inner::ConstScalarNode>, std::shared_ptr<mindspore::Value>&> (__a=...) at /usr/include/c++/7/bits/shared_ptr.h:691
#8  0x00007fffd4473d87 in std::make_shared<mindspore::graphkernel::inner::ConstScalarNode, std::shared_ptr<mindspore::Value>&> ()
    at /usr/include/c++/7/bits/shared_ptr.h:707

#9  0x00007fffd445eef9 in mindspore::graphkernel::GkUtils::AnfGraph2LiteGraph (func_graph=std::shared_ptr<mindspore::FuncGraph> (use count 9, weak count 27) = {...},
    op_node_map=0x0) at /data/hhf/code/ms_tmp/mindspore/mindspore/ccsrc/backend/common/graph_kernel/core/graph_kernel_utils.cc:354
#10 0x00007fffd4358383 in mindspore::graphkernel::ArithmeticSimplify::Run (this=0x5555e7a126d0, func_graph=
    std::shared_ptr<mindspore::FuncGraph> (use count 18, weak count 4727) = {...})
    at /data/hhf/code/ms_tmp/mindspore/mindspore/ccsrc/backend/common/graph_kernel/core/arithmetic_simplify.cc:1091

#11 0x00007fffd3a9bac9 in mindspore::opt::PassManager::RunPass (this=0x5555e45151e0,
    func_graph=std::shared_ptr<mindspore::FuncGraph> (use count 18, weak count 4727) = {...}, pass_id=5, pass=
    std::shared_ptr<mindspore::opt::Pass> (use count 1, weak count 0) = {...})
    at /data/hhf/code/ms_tmp/mindspore/mindspore/ccsrc/backend/common/optimizer/pass_manager.cc:36
#12 0x00007fffd44368f9 in mindspore::graphkernel::GraphKernelPassManager::Run (this=0x5555e45151e0,
    func_graph=std::shared_ptr<mindspore::FuncGraph> (use count 18, weak count 4727) = {...})
    at /data/hhf/code/ms_tmp/mindspore/mindspore/ccsrc/backend/common/graph_kernel/core/graph_kernel_pass_manager.cc:53
#13 0x00007fffd39ff5c8 in mindspore::opt::GraphOptimizer::Optimize (this=0x5555e4571e20,
    func_graph=std::shared_ptr<mindspore::FuncGraph> (use count 18, weak count 4727) = {...}, run_only_once=true)
    at /data/hhf/code/ms_tmp/mindspore/mindspore/ccsrc/backend/common/optimizer/graph_optimizer.cc:40
#14 0x00007fffd4211db4 in mindspore::graphkernel::GraphKernelOptimizer::Run (this=0x7fffffff9ce3,

应该是 @NaCN 最近的如下修改引入:

if ((IsPrimitiveCNode(cnode, prim::kPrimCast) || IsPrimitiveCNode(cnode, prim::kPrimTupleGetItem)) && i == idx) {
        input_node = std::make_shared<inner::ConstScalarNode>(input_value);
NaCN 任务状态TODO 修改为VALIDATION
NaCN 任务状态VALIDATION 修改为TODO

根因分析:Value的Type()存在空指针。
解决方法:使用Abstract获取即可

修复已合入,请求回归

NaCN 负责人hanhuifeng 修改为duanjiali
NaCN 添加协作者hanhuifeng
NaCN 任务状态TODO 修改为VALIDATION
NaCN 里程碑B-SIG-AKG 修改为B-SolutionTest

走单不规范,未打上rca/、rct/、ctl/标签,打回

wenli 负责人duanjiali 修改为NaCN
wenli 添加协作者duanjiali
wenli 任务状态VALIDATION 修改为TODO
wenli 里程碑B-SolutionTest 修改为B-SIG-AKG
NaCN 添加了
 
rca/algorithm
标签
NaCN 添加了
 
rct/bugfix
标签
NaCN 添加了
 
ctl/codereview
标签
NaCN 里程碑B-SIG-AKG 修改为B-SolutionTest
NaCN 任务状态TODO 修改为VALIDATION

已打上标签 求回归

linzhengshu 负责人NaCN 修改为duanjiali
linzhengshu 取消协作者duanjiali
linzhengshu 添加协作者NaCN
linzhengshu 添加协作者wenli

回归时间:2024.5.7
回归版本:commit_id = '[sha1]:6fc7650c,[branch]:(HEAD,origin/master,origin/HEAD,master)'
回归步骤:执行上述用例
回归结果:未出现段错误,但仍有报错

[WARNING] DISTRIBUTED(45940,7fd148b1f740,python):2024-05-07-15:50:50.005.371 [mindspore/ccsrc/distributed/collective/collective_manager.cc:259] CreateCommunicationGroup] Start to create communication group: nccl_world_group [const vector]{0, 1, 2, 3, 4, 5, 6, 7}
[WARNING] DISTRIBUTED(45940,7fd148b1f740,python):2024-05-07-15:50:53.234.228 [mindspore/ccsrc/distributed/collective/collective_manager.cc:335] CreateCommunicationGroup] Begin initialize communication group on the device side: nccl_world_group
[WARNING] DISTRIBUTED(45940,7fd148b1f740,python):2024-05-07-15:50:54.611.406 [mindspore/ccsrc/distributed/collective/collective_manager.cc:345] CreateCommunicationGroup] End initialize communication group on the device side: nccl_world_group
origin dataset size:  410184
[WARNING] ME(45940:140536844515136,MainProcess):2024-05-07-15:50:57.921.662 [mindspore/train/model.py:1122] For LossCallBack callback, {'step_end'} methods may not be supported in later version, Use methods prefixed with 'on_train' or 'on_eval' instead when using customized callbacks.
[CRITICAL] ME(45940,7fd148b1f740,python):2024-05-07-15:51:01.605.155 [mindspore/ccsrc/backend/common/graph_kernel/core/graph_kernel_utils.cc:298] InputValue2Tensor] Unsupported Type in InputValue2Tensor
[WARNING] MD(45940,7fd148b1f740,python):2024-05-07-15:51:01.730.699 [mindspore/ccsrc/minddata/dataset/engine/datasetops/data_queue_op.cc:163] ~DataQueueOp]
preprocess_batch: 4;
batch_queue: 0, 3, 16, 15, 16, 15, 16, 15, 16;
            push_start_time -> push_end_time
2024-05-07-15:50:59.123.569 -> 2024-05-07-15:50:59.663.181
2024-05-07-15:50:59.663.189 -> 2024-05-07-15:50:59.664.361
2024-05-07-15:50:59.664.368 -> 2024-05-07-15:50:59.664.535
2024-05-07-15:50:59.664.541 -> 2024-05-07-15:51:01.648.847
For more details, please refer to the FAQ at https://www.mindspore.cn/docs/en/master/faq/data_processing.html.
Traceback (most recent call last):
  File "run_pretrain.py", line 288, in <module>
    run_pretrain()
  File "/home/jenkins/workspace/TDT_deployment/solution_test/cases/03subject_test/00reliability_availability/00long_stability/bert/test_ms_long_stability_nlp_bert_base_8p_0001/src/model_utils/moxing_adapter.py", line 109, in wrapped_func
    run_func(*args, **kwargs)
  File "run_pretrain.py", line 283, in run_pretrain
    dataset_sink_mode=(cfg.enable_data_sink == "true"), sink_size=cfg.data_sink_steps)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 1087, in train
    initial_epoch=initial_epoch)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 115, in wrapper
    func(self, *args, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 637, in _train
    cb_params, sink_size, initial_epoch, valid_infos)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 721, in _train_dataset_sink_process
    outputs = train_network(*inputs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 713, in __call__
    raise err
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 709, in __call__
    output = self._run_construct(args, kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 481, in _run_construct
    output = self.construct(*cast_inputs, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/dataset_helper.py", line 109, in construct
    return self.network(*outputs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 713, in __call__
    raise err
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 709, in __call__
    output = self._run_construct(args, kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 481, in _run_construct
    output = self.construct(*cast_inputs, **kwargs)
  File "/home/jenkins/workspace/TDT_deployment/solution_test/cases/03subject_test/00reliability_availability/00long_stability/bert/test_ms_long_stability_nlp_bert_base_8p_0001/src/bert_for_pre_training.py", line 492, in construct
    mstype.float32))
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/ops/composite/base.py", line 389, in after_grad
    return grad_(fn, weights)(*args, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 132, in wrapper
    results = fn(*arg, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/ops/composite/base.py", line 378, in after_grad
    out = _pynative_executor.grad(fn, grad_, weights, self.grad_position, *args, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 1337, in grad
    return self._executor.grad(grad, obj, weights, grad_position, *args, *(kwargs.values()))
RuntimeError: Unsupported Type in InputValue2Tensor

----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/backend/common/graph_kernel/core/graph_kernel_utils.cc:298 InputValue2Tensor

回归结论:回归不通过

i-robot 添加了
 
www
标签
duanjiali 任务状态VALIDATION 修改为WIP
duanjiali 添加协作者duanjiali
duanjiali 负责人duanjiali 修改为NaCN
duanjiali 取消协作者NaCN
duanjiali 里程碑B-SolutionTest 修改为B-SIG-AKG
NaCN 添加协作者NaCN
NaCN 负责人NaCN 修改为邹文祥
fangwenyi 移除了
 
v2.3.0.rc2
标签
fangwenyi 添加了
 
master
标签

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(5)
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助