2.4K Star 8.2K Fork 4.4K

GVPMindSpore / mindspore

 / 详情

【ST】【ms】【2.3】Resnet50网络随机添加ms_function, 导出mindir,比较mindir和ckpt推理,进程异常退出,无报错信息。看堆栈信息有 Segmentation fault

TODO
Bug-Report
创建于  
2024-02-23 18:50
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

Resnet50网络随机添加ms_function, 导出mindir,比较mindir和ckpt推理,进程异常退出,无报错信息。看堆栈信息有 Segmentation fault

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend/GPU

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):
    ms版本: commit_id = '[sha1]:8abeb488,[branch]:(HEAD,origin/r2.3,r2.3)'
    run包版本: Milan_C17/20240206

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

test_ms_jit_network_001_mindir_infer

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

1.参数初始化,拷贝网络脚本
2.在Resnet50的网络结构脚本中,并使用动态LossScale, 随机选择3-5个函数和construct,添加@ms_function进行修饰
3.Resnet50网络,cifar10数据集,进行单卡训练
4.执行整网训练脚本,保存ckpt,将ckpt导出为mindir
5.调用load_ckpt_mindir_compare脚本,分别加载ckpt以及mindir,比较结果

用例运行步骤:
source /home/miniconda3/bin/activate ci
export TRAIN_MODE=PYNATIVE_MODE
export DEVICE_TYPE=GPU_PCIE
export ENV_DEVICE=1
source solution_test/env_set.source -e cuda11.6
cd solution_test/cases/01frame_func/04model_save_load/ms_function/ms_function_mindir_infer/test_ms_jit_network_001_mindir_infer.py
pytest -s test_ms_jit_network_001_mindir_infer.py

Describe the expected behavior / 预期结果 (Mandatory / 必填)

正常训练,用例pass

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

进程退出时,info最后的日志如下:
[INFO] RUNTIME_FRAMEWORK(108500,7f51bd67b740,python):2024-02-23-18:19:09.741.761 [mindspore/ccsrc/runtime/graph_scheduler/actor/data_prepare_actor.cc:1016] PrepareDataForSequenceAndScalarValue] Prepare device data for value node: ValueNode 0.0001
[INFO] RUNTIME_FRAMEWORK(108500,7f51bd67b740,python):2024-02-23-18:19:09.741.784 [mindspore/ccsrc/runtime/graph_scheduler/actor/data_prepare_actor.cc:845] PrepareDataForValueNodeTensor] Prepare device data for value node: ValueNode Tensor(shape=[4], dtype=Int64, value=[256 256 3 3]), output index: 0 device address:0x55a4061e88d0
[INFO] RUNTIME_FRAMEWORK(108500,7f51bd67b740,python):2024-02-23-18:19:09.741.805 [mindspore/ccsrc/runtime/graph_scheduler/actor/data_prepare_actor.cc:1016] PrepareDataForSequenceAndScalarValue] Prepare device data for value node: ValueNode 0.0001
[INFO] RUNTIME_FRAMEWORK(108500,7f51bd67b740,python):2024-02-23-18:19:09.741.821 [mindspore/ccsrc/runtime/graph_scheduler/actor/data_prepare_actor.cc:1016] PrepareDataForSequenceAndScalarValue] Prepare device data for value node: ValueNode 0
[INFO] RUNTIME_FRAMEWORK(108500,7f51bd67b740,python):2024-02-23-18:19:09.741.974 [mindspore/ccsrc/runtime/graph_scheduler/actor/data_prepare_actor.cc:695] PrepareDataForHostTensorQueue] Prepare host data, input tensor size: 2, arg size: 0
[INFO] RUNTIME_FRAMEWORK(108500,7f51bd67b740,python):2024-02-23-18:19:09.741.993 [mindspore/ccsrc/runtime/graph_scheduler/actor/data_source_actor.cc:39] FetchData] Data source actor(kernel_graph_4_HostDSActor) fetches data.

堆栈日志:
$ gdb python -c core-108500
GNU gdb (Ubuntu 8.1.1-0ubuntu1) 8.1.1
Copyright (C) 2018 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from python...done.
。。。
。。。
。。。
Core was generated by `python train.py --device_target=GPU --data_path=/home/workspace/mindspore_datas'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f51911d5006 in ?? ()
[Current thread is 1 (LWP 108500)]
(gdb) bt
#0 0x00007f51911d5006 in ?? ()
#1 0x00007f51ae9a1dd0 in ?? ()
#2 0x00007f51ae9a1c18 in ?? ()
#3 0x000055a3ff6b25c9 in ?? ()
#4 0x000055a3ff6b25c9 in ?? ()
#5 0x000055a3ff6b25c9 in ?? ()
#6 0x000055a3ff6b25c8 in ?? ()
#7 0x000055a3ff6b25d9 in ?? ()
#8 0x000055a3ff6b27c8 in ?? ()
#9 0x00007f51ae9ad3c0 in ?? ()
#10 0x00007f5100000010 in ?? ()
#11 0x000055a3ff6b25c8 in ?? ()
#12 0x00007f51ae9a1df8 in ?? ()
#13 0x0000000000000006 in ?? ()
#14 0x0000000000000000 in ?? ()
(gdb)

Special notes for this issue/备注 (Optional / 选填)

走个 周培晨

评论 (2)

wenli 创建了Bug-Report
wenli 添加了
 
kind/bug
标签
wenli 添加了
 
sig/runtime
标签
wenli 添加了
 
v2.3.0
标签
wenli 添加了
 
attr/function
标签
wenli 添加了
 
stage/func-debug
标签
wenli 添加协作者ZPaC
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@wenli

感谢您的反馈,您可以评论//mindspore-assistant更快获取帮助,更多标签可以查看标签列表

  1. 如果您刚刚接触MindSpore,或许您可以在教程找到答案
  2. 如果您是资深Pytorch用户,您或许需要:
    与PyTorch典型区别 / PyTorch与MindSpore API映射表
  3. 如果您遇到动态图问题,可以设置mindspore.set_context(pynative_synchronize=True)查看报错栈协助定位
  4. 模型精度调优问题可参考官网调优指南
  5. 如果您反馈的是框架BUG,请确认您在ISSUE中提供了MindSpore版本、使用的后端类型(CPU、GPU、Ascend)、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
  6. 如果您已经定位出问题根因,欢迎提交PR参与MindSpore开源社区,我们会尽快review
wenli 修改了描述
mudongrui 添加协作者mudongrui
mudongrui 负责人mudongrui 修改为ZPaC
mudongrui 取消协作者ZPaC
wenli 添加了
 
v2.3.0.alpha
标签
wenli 移除了
 
v2.3.0
标签
wenli 移除了
 
v2.3.0
标签
wenli 添加了
 
v2.3.0
标签

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(3)
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助