2.4K Star 8.2K Fork 4.4K

GVPMindSpore / mindspore

 / 详情

[ST][MS][PYNATIVE][dyn]pynative模式下,Ascend环境,动态shape场景训练自定义循环控制流网络以及bert-base网络出现大量DRV内存溢出的报错

TODO
Bug-Report
创建于  
2022-12-13 17:29
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

[ST][MS][PYNATIVE]pynative模式下,Ascend环境,动态shape场景训练bert-base网络出现大量DRV内存溢出的报错

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :2.0.0-alpha x86_64 commit_id = '[sha1]:375750d4,[branch]:(HEAD,origin/r2.0.0-alpha,r2.0.0-alpha)'
    -- Python version (e.g., Python 3.7.5) : Python 3.7.5
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04): x86_64 GNU/Linux
    -- GCC/Compiler version (if compiled from source):

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative

Related testcase / 关联用例 (Mandatory / 必填)

test_ms_dynamic_shape_nc_dy_bert_base_cn_news_0001

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

  1. 准备好cn-news-128数据集, 下载models仓代码到Ascend环境 并将其放在solution_test同一路径
  2. cd /home/jenkins0/workspace/TDT_deployment/solution_test/cases/03subject_test/02usability/model_develop/dynamic_shape/
  3. export TRAIN_MODE=PYNATIVE_MODE
  4. pytest -s test_ms_dynamic_shape_nc_dy_bert_base_cn_news_0001.py

Describe the expected behavior / 预期结果 (Mandatory / 必填)

case pass

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

输入图片说明

Special notes for this issue/备注 (Optional / 选填)

评论 (5)

Mr.Zhou 创建了Bug-Report
Mr.Zhou 添加了
 
kind/bug
标签
Mr.Zhou 添加了
 
v2.0.0.rc1
标签
Mr.Zhou 添加了
 
sig/pynative
标签
Mr.Zhou 添加了
 
attr/function
标签
Mr.Zhou 添加了
 
usability
标签
Mr.Zhou 添加协作者chujinjin
Mr.Zhou 添加协作者leiwei2
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@Mr.Zhou

Please add labels (comp or sig), also you can visit https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md to find more.
为了让代码尽快被审核,请您为Pull Request打上 组件(comp)或兴趣组(sig) 标签,打上标签的PR可直接推送给责任人进行审核。
更多的标签可以查看https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md
以组件相关代码提交为例,如果你提交的是data组件代码,你可以这样评论:
//comp/data
当然你也可以邀请data SIG组来审核代码,可以这样写:
//sig/data
另外你还可以给这个PR标记类型,例如是bugfix或者是特性需求:
//kind/bug or //kind/feature
恭喜你,你已经学会了使用命令来打标签,接下来就在下面的评论里打上标签吧!

Mr.Zhou 修改了描述
Mr.Zhou 修改了描述
Mr.Zhou 添加了
 
v2.0.0.alpha
标签
Mr.Zhou 移除了
 
v2.0.0.alpha
标签
Mr.Zhou 修改了标题

最新代码Ascend动态shape切换到了ACL,当前海思算子有MinimumGrad和MaximumGrad算子问题。
输入图片说明

caifubi 任务状态TODO 修改为ACCEPTED
caifubi 任务状态ACCEPTED 修改为WIP
caifubi 添加了
 
rct/cann
标签

C29 B0207版本run包 pynative模式 resnet50跑动态bs的cifar10数据集
也出现类似内存溢出的报错:
[ERROR] DRV(34091,python):2023-02-22-16:03:50.179.665 [ascend][curpid: 34091, 34091][drv][devmm][devmm_virt_set_alloced_mem_struct 134]<errno:12, 6> Alloc ptr error. (ret_ptr=0x1; alloc_ptr=0x124900000000; alloc_size=812515328; advise=754974726)
[ERROR] DRV(34091,python):2023-02-22-16:03:50.179.826 [ascend][curpid: 34091, 34091][drv][devmm][devmm_alloc_from_base_heap 167]<errno:12, 6> Alloc physical memory from base heap error. (ret_ptr=0x1; va=0x124900000000; alloc_size=812515328; alloc_size=812515328)
[ERROR] DRV(34091,python):2023-02-22-16:03:50.179.839 [ascend][curpid: 34091, 34091][drv][devmm][devmm_rbtree_get_alloced_node 55]<errno:12, 6> Get node failed with key. (key=0x124900000000)
[ERROR] DRV(34091,python):2023-02-22-16:03:50.179.847 [ascend][curpid: 34091, 34091][drv][devmm][devmm_get_and_erase_alloced_mem_node 1062]<errno:12, 6> Virtual address is not alloced, please check. (va=0x124900000000)
[ERROR] DRV(34091,python):2023-02-22-16:03:50.179.874 [ascend][curpid: 34091, 34091][drv][devmm][devmm_free_mem 1148]<errno:12, 6> Virtual address is not alloced, please check. (va=0x124900000000)
[ERROR] DRV(34091,python):2023-02-22-16:03:51.146.544 [ascend][curpid: 34091, 34091][drv][devmm][devmm_virt_set_alloced_mem_struct 134]<errno:12, 6> Alloc ptr error. (ret_ptr=0x1; alloc_ptr=0x124900000000; alloc_size=812515328; advise=754974722)
[ERROR] DRV(34091,python):2023-02-22-16:03:51.171.733 [ascend][curpid: 34091, 34091][drv][devmm][devmm_alloc_from_base_heap 167]<errno:12, 6> Alloc physical memory from base heap error. (ret_ptr=0x1; va=0x124900000000; alloc_size=812515328; alloc_size=812515328)
[ERROR] DRV(34091,python):2023-02-22-16:03:51.171.752 [ascend][curpid: 34091, 34091][drv][devmm][devmm_rbtree_get_alloced_node 55]<errno:12, 6> Get node failed with key. (key=0x124900000000)
[ERROR] DRV(34091,python):2023-02-22-16:03:51.171.759 [ascend][curpid: 34091, 34091][drv][devmm][devmm_get_and_erase_alloced_mem_node 1062]<errno:12, 6> Virtual address is not alloced, please check. (va=0x124900000000)
[ERROR] DRV(34091,python):2023-02-22-16:03:51.171.765 [ascend][curpid: 34091, 34091][drv][devmm][devmm_free_mem 1148]<errno:12, 6> Virtual address is not alloced, please check. (va=0x124900000000)
[ERROR] RUNTIME(34091,python):2023-02-22-16:03:51.171.800 [logger.cc:419]34091 DevMalloc:[FINAL][FINAL]Device malloc failed, size=812515328, type=2.
[ERROR] RUNTIME(34091,python):2023-02-22-16:03:51.171.841 [api_c.cc:1003]34091 rtMalloc:[FINAL][FINAL]ErrCode=207001, desc=[driver error:out of memory], InnerCode=0x7020016
[ERROR] RUNTIME(34091,python):2023-02-22-16:03:51.171.849 [error_message_manage.cc:49]34091 FuncErrorReason:[FINAL][FINAL]report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(34091,python):2023-02-22-16:03:51.171.861 [error_message_manage.cc:49]34091 FuncErrorReason:[FINAL][FINAL]rtMalloc execute failed, reason=[driver error:out of memory]
[ERROR] GE(34091,python):2023-02-22-16:03:51.171.966 [device_allocator.cc:59]34091 Alloc: ErrorNo: 1343225860(Internal errors) [FINAL][FINAL][Malloc][Memory] failed, rt_ret:207001, device_id:0, size:812515328
[ERROR] GE(34091,python):2023-02-22-16:03:51.171.985 [scalable_allocator.cc:133]34091 FetchNewSpan: ErrorNo: 1343225856(Failed to allocate memory!) [FINAL][FINAL][allocator_1] Failed to apply for memory. We will try to free memory from memory pool, the above error log can be ignored. Try to free cached memory...

田桐 修改了标题
fangwenyi 移除了
 
v2.0.0.rc1
标签
fangwenyi 移除了
 
v2.0.0.rc1
标签
fangwenyi 添加了
 
v2.0.0
标签
fangwenyi 添加了
 
v2.1.0
标签
fangwenyi 添加了
 
v2.2.0
标签
linzhengshu 添加了
 
v2.2.10
标签
linzhengshu 添加了
 
v2.2.10
标签
chujinjin 任务状态WIP 修改为ACCEPTED
zhunaipan 添加了
 
v2.2.12
标签
chujinjin 任务状态ACCEPTED 修改为TODO

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(5)
6574868 jojohw 1584546516 10401765 mr  zhou 1654521551 6575291 chujinjin 1605008803
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助