name | about | labels |
---|---|---|
Bug Report | Use this template for reporting a bug | kind/bug |
CuDNN未安装异常场景,resnet50网络拉起训练有core产生
Ascend
/GPU
/CPU
) / 硬件环境:Please delete the backend not involved / 请删除不涉及的后端:
/device GPU
Software Environment / 软件环境 (Mandatory / 必填):
-- MindSpore version (e.g., 1.7.0.Bxxx) :
-- Python version (e.g., Python 3.7.5) :
-- OS platform and distribution (e.g., Linux Ubuntu 16.04):
-- GCC/Compiler version (if compiled from source):
commit_id = '[sha1]:647c6623,[branch]:(HEAD,origin/r2.3,r2.3)'
Excute Mode / 执行模式 (Mandatory / 必填)(PyNative
/Graph
):
Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative
/mode graph
test_ms_fmea_env_cudnn_notmatch_0003
1.拷贝网络脚本,修改脚本等配置操作
2.将CuDNN相关目录移除
3.业务网络训练 & 进程检查
4.查看报错信息
5.恢复CuDNN
无core文件产生,报错符合预期
core文件堆栈:
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007fd6c0a017f1 in __GI_abort () at abort.c:79
#2 0x00007fd6b22cc84a in __gnu_cxx::__verbose_terminate_handler ()
at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/vterminate.cc:95
#3 0x00007fd6b22caf47 in __cxxabiv1::__terminate (handler=)
at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:47
#4 0x00007fd6b22caf7d in std::terminate () at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:57
#5 0x00007fd6b22cb15a in __cxxabiv1::__cxa_throw (obj=0x557259ddd1e0, tinfo=0x7fd6b23847c0 , dest=0x7fd6b22d71c8 std::runtime_error::~runtime_error())
at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_throw.cc:95
#6 0x00007fd68d028c67 in mindspore::LogWriter::operator^(mindspore::LogStream const&) const () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_core.so
#7 0x00007fd6951a59e0 in mindspore::device::DeviceContextManager::SelectGpuPlugin(std::string const&, std::set<std::string, std::lessstd::string, std::allocatorstd::string > const&) ()
from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#8 0x00007fd6951a7c6e in mindspore::device::DeviceContextManager::LoadPlugin() () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#9 0x00007fd6951a7fdd in mindspore::device::DeviceContextManager::GetInstance() () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#10 0x00007fd692f73b95 in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#11 0x00007fd6c0fe18d3 in call_init (env=0x557258e2d0c0, argv=0x7ffc8f0bbbf8, argc=9, l=) at dl-init.c:72
#12 _dl_init (main_map=main_map@entry=0x557259159d20, argc=9, argv=0x7ffc8f0bbbf8, env=0x557258e2d0c0) at dl-init.c:119
#13 0x00007fd6c0fe639f in dl_open_worker (a=a@entry=0x7ffc8f0b7a30) at dl-open.c:522
#14 0x00007fd6c0b2816f in __GI__dl_catch_exception (exception=exception@entry=0x7ffc8f0b7a10, operate=operate@entry=0x7fd6c0fe5f60 <dl_open_worker>, args=args@entry=0x7ffc8f0b7a30)
at dl-error-skeleton.c:196
训练日志:
[CRITICAL] ME(6594,7fd6c11e2740,python):2024-03-20-17:31:21.645.727 [mindspore/ccsrc/runtime/hardware/device_context_manager.cc:645] SelectGpuPlugin] Env CUDA_HOME is /usr/local/cuda-11.6, but dlopen file_name failed, reason: Load dynamic library: libmindspore_ascend.so.2 failed. libge_runner.so: cannot open shared object file: No such file or directory
Load dynamic library: libmindspore_gpu.so.11.6 failed.
[ERROR] CORE(6594,7fd6c11e2740,python):2024-03-20-17:31:21.645.859 [mindspore/core/utils/log_adapter.cc:394] operator^] Runtime error for null exception handler.
terminate called after throwing an instance of 'std::runtime_error'
what(): Env CUDA_HOME is /usr/local/cuda-11.6, but dlopen file_name failed, reason: Load dynamic library: libmindspore_ascend.so.2 failed. libge_runner.so: cannot open shared object file: No such file or directory
Load dynamic library: libmindspore_gpu.so.11.6 failed.
mindspore/ccsrc/runtime/hardware/device_context_manager.cc:645 SelectGpuPlugin
走给 黄勇
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:
单机单卡可复现:, 删除cudnn后三合一的包在import mindspore时导致core
框架不允许抛异常,抛异常就会导致core
Root Cause:
selectgpu中日志级别设置不合理,设置excpetion导致core产生,框架在此处不能抛异常, 即使gpu上一些插件加载失败,但是不应该影响到cpu上的功能
自测如下:报错符合预期
是否补充ST/UT用例:
是否需要补充ST/UT用例:否。
原因:测试已有看护用例,无需补充相关用例
登录 后才可以发表评论