2.4K Star 8.2K Fork 4.4K

GVPMindSpore / mindspore

 / 详情

【ST】【ms】【2.3】CuDNN未安装异常场景,resnet50网络拉起训练有core产生

TODO
Bug-Report
创建于  
2024-03-20 18:34
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

CuDNN未安装异常场景,resnet50网络拉起训练有core产生

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device GPU

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):
    commit_id = '[sha1]:647c6623,[branch]:(HEAD,origin/r2.3,r2.3)'

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

test_ms_fmea_env_cudnn_notmatch_0003

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

1.拷贝网络脚本,修改脚本等配置操作
2.将CuDNN相关目录移除
3.业务网络训练 & 进程检查
4.查看报错信息
5.恢复CuDNN

Describe the expected behavior / 预期结果 (Mandatory / 必填)

无core文件产生,报错符合预期

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

core文件堆栈:
(gdb) bt
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:51
#1 0x00007fd6c0a017f1 in __GI_abort () at abort.c:79
#2 0x00007fd6b22cc84a in __gnu_cxx::__verbose_terminate_handler ()
at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/vterminate.cc:95
#3 0x00007fd6b22caf47 in __cxxabiv1::__terminate (handler=)
at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:47
#4 0x00007fd6b22caf7d in std::terminate () at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:57
#5 0x00007fd6b22cb15a in __cxxabiv1::__cxa_throw (obj=0x557259ddd1e0, tinfo=0x7fd6b23847c0 , dest=0x7fd6b22d71c8 std::runtime_error::~runtime_error())
at /home/nwani/m3/conda-bld/compilers_linux-64_1560109574129/work/.build/x86_64-conda_cos6-linux-gnu/src/gcc/libstdc++-v3/libsupc++/eh_throw.cc:95
#6 0x00007fd68d028c67 in mindspore::LogWriter::operator^(mindspore::LogStream const&) const () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_core.so
#7 0x00007fd6951a59e0 in mindspore::device::DeviceContextManager::SelectGpuPlugin(std::string const&, std::set<std::string, std::lessstd::string, std::allocatorstd::string > const&) ()
from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#8 0x00007fd6951a7c6e in mindspore::device::DeviceContextManager::LoadPlugin() () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#9 0x00007fd6951a7fdd in mindspore::device::DeviceContextManager::GetInstance() () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#10 0x00007fd692f73b95 in ?? () from /home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/lib/libmindspore_backend.so
#11 0x00007fd6c0fe18d3 in call_init (env=0x557258e2d0c0, argv=0x7ffc8f0bbbf8, argc=9, l=) at dl-init.c:72
#12 _dl_init (main_map=main_map@entry=0x557259159d20, argc=9, argv=0x7ffc8f0bbbf8, env=0x557258e2d0c0) at dl-init.c:119
#13 0x00007fd6c0fe639f in dl_open_worker (a=a@entry=0x7ffc8f0b7a30) at dl-open.c:522
#14 0x00007fd6c0b2816f in __GI__dl_catch_exception (exception=exception@entry=0x7ffc8f0b7a10, operate=operate@entry=0x7fd6c0fe5f60 <dl_open_worker>, args=args@entry=0x7ffc8f0b7a30)
at dl-error-skeleton.c:196

训练日志:
[CRITICAL] ME(6594,7fd6c11e2740,python):2024-03-20-17:31:21.645.727 [mindspore/ccsrc/runtime/hardware/device_context_manager.cc:645] SelectGpuPlugin] Env CUDA_HOME is /usr/local/cuda-11.6, but dlopen file_name failed, reason: Load dynamic library: libmindspore_ascend.so.2 failed. libge_runner.so: cannot open shared object file: No such file or directory
Load dynamic library: libmindspore_gpu.so.11.6 failed.

[ERROR] CORE(6594,7fd6c11e2740,python):2024-03-20-17:31:21.645.859 [mindspore/core/utils/log_adapter.cc:394] operator^] Runtime error for null exception handler.
terminate called after throwing an instance of 'std::runtime_error'
what(): Env CUDA_HOME is /usr/local/cuda-11.6, but dlopen file_name failed, reason: Load dynamic library: libmindspore_ascend.so.2 failed. libge_runner.so: cannot open shared object file: No such file or directory
Load dynamic library: libmindspore_gpu.so.11.6 failed.


  • C++ Call Stack: (For framework developers)

mindspore/ccsrc/runtime/hardware/device_context_manager.cc:645 SelectGpuPlugin

Special notes for this issue/备注 (Optional / 选填)

走给 黄勇

评论 (4)

wenli 创建了Bug-Report
wenli 添加了
 
kind/bug
标签
wenli 添加了
 
v2.3.0
标签
wenli 添加了
 
sig/runtime
标签
wenli 添加了
 
attr/function
标签
wenli 添加了
 
stage/func-debug
标签
wenli 添加协作者黄勇
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@wenli

感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:

  1. 如果您刚刚接触MindSpore,或许您可以在教程找到答案
  2. 如果您是资深Pytorch用户,您或许需要:
  1. 如果您遇到动态图问题,可以设置set_context(pynative_synchronize=True)查看报错栈协助定位
  2. 模型精度调优问题可参考官网调优指南
  3. 如果您反馈的是框架BUG,请确认您在ISSUE中提供了MindSpore版本、使用的后端类型(CPU、GPU、Ascend)、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
  4. 如果您已经定位出问题根因,欢迎提交PR参与MindSpore开源社区,我们会尽快review
duanjiali 添加协作者duanjiali
duanjiali 负责人duanjiali 修改为黄勇
duanjiali 取消协作者黄勇

单机单卡可复现:输入图片说明, 删除cudnn后三合一的包在import mindspore时导致core

i-robot 添加了
 
foruda
标签

输入图片说明
框架不允许抛异常,抛异常就会导致core

Root Cause:
selectgpu中日志级别设置不合理,设置excpetion导致core产生,框架在此处不能抛异常, 即使gpu上一些插件加载失败,但是不应该影响到cpu上的功能
自测如下:报错符合预期
输入图片说明
输入图片说明

是否补充ST/UT用例:
是否需要补充ST/UT用例:否。
原因:测试已有看护用例,无需补充相关用例

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(4)
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助