name | about | labels |
---|---|---|
Bug Report | Use this template for reporting a bug | kind/bug |
[ST][MS][PYNATIVE]pynative模式下,Ascend环境,动态shape场景训练bert-base网络出现大量DRV内存溢出的报错
Ascend
/GPU
/CPU
) / 硬件环境:Please delete the backend not involved / 请删除不涉及的后端:
/device ascend
Software Environment / 软件环境 (Mandatory / 必填):
-- MindSpore version (e.g., 1.7.0.Bxxx) :2.0.0-alpha x86_64 commit_id = '[sha1]:375750d4,[branch]:(HEAD,origin/r2.0.0-alpha,r2.0.0-alpha)'
-- Python version (e.g., Python 3.7.5) : Python 3.7.5
-- OS platform and distribution (e.g., Linux Ubuntu 16.04): x86_64 GNU/Linux
-- GCC/Compiler version (if compiled from source):
Excute Mode / 执行模式 (Mandatory / 必填)(PyNative
/Graph
):
Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative
test_ms_dynamic_shape_nc_dy_bert_base_cn_news_0001
case pass
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
Please add labels (comp or sig), also you can visit https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md to find more.
为了让代码尽快被审核,请您为Pull Request打上 组件(comp)或兴趣组(sig) 标签,打上标签的PR可直接推送给责任人进行审核。
更多的标签可以查看https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md
以组件相关代码提交为例,如果你提交的是data组件代码,你可以这样评论:
//comp/data
当然你也可以邀请data SIG组来审核代码,可以这样写:
//sig/data
另外你还可以给这个PR标记类型,例如是bugfix或者是特性需求:
//kind/bug or //kind/feature
恭喜你,你已经学会了使用命令来打标签,接下来就在下面的评论里打上标签吧!
最新代码Ascend动态shape切换到了ACL,当前海思算子有MinimumGrad和MaximumGrad算子问题。
C29 B0207版本run包 pynative模式 resnet50跑动态bs的cifar10数据集
也出现类似内存溢出的报错:
[ERROR] DRV(34091,python):2023-02-22-16:03:50.179.665 [ascend][curpid: 34091, 34091][drv][devmm][devmm_virt_set_alloced_mem_struct 134]<errno:12, 6> Alloc ptr error. (ret_ptr=0x1; alloc_ptr=0x124900000000; alloc_size=812515328; advise=754974726)
[ERROR] DRV(34091,python):2023-02-22-16:03:50.179.826 [ascend][curpid: 34091, 34091][drv][devmm][devmm_alloc_from_base_heap 167]<errno:12, 6> Alloc physical memory from base heap error. (ret_ptr=0x1; va=0x124900000000; alloc_size=812515328; alloc_size=812515328)
[ERROR] DRV(34091,python):2023-02-22-16:03:50.179.839 [ascend][curpid: 34091, 34091][drv][devmm][devmm_rbtree_get_alloced_node 55]<errno:12, 6> Get node failed with key. (key=0x124900000000)
[ERROR] DRV(34091,python):2023-02-22-16:03:50.179.847 [ascend][curpid: 34091, 34091][drv][devmm][devmm_get_and_erase_alloced_mem_node 1062]<errno:12, 6> Virtual address is not alloced, please check. (va=0x124900000000)
[ERROR] DRV(34091,python):2023-02-22-16:03:50.179.874 [ascend][curpid: 34091, 34091][drv][devmm][devmm_free_mem 1148]<errno:12, 6> Virtual address is not alloced, please check. (va=0x124900000000)
[ERROR] DRV(34091,python):2023-02-22-16:03:51.146.544 [ascend][curpid: 34091, 34091][drv][devmm][devmm_virt_set_alloced_mem_struct 134]<errno:12, 6> Alloc ptr error. (ret_ptr=0x1; alloc_ptr=0x124900000000; alloc_size=812515328; advise=754974722)
[ERROR] DRV(34091,python):2023-02-22-16:03:51.171.733 [ascend][curpid: 34091, 34091][drv][devmm][devmm_alloc_from_base_heap 167]<errno:12, 6> Alloc physical memory from base heap error. (ret_ptr=0x1; va=0x124900000000; alloc_size=812515328; alloc_size=812515328)
[ERROR] DRV(34091,python):2023-02-22-16:03:51.171.752 [ascend][curpid: 34091, 34091][drv][devmm][devmm_rbtree_get_alloced_node 55]<errno:12, 6> Get node failed with key. (key=0x124900000000)
[ERROR] DRV(34091,python):2023-02-22-16:03:51.171.759 [ascend][curpid: 34091, 34091][drv][devmm][devmm_get_and_erase_alloced_mem_node 1062]<errno:12, 6> Virtual address is not alloced, please check. (va=0x124900000000)
[ERROR] DRV(34091,python):2023-02-22-16:03:51.171.765 [ascend][curpid: 34091, 34091][drv][devmm][devmm_free_mem 1148]<errno:12, 6> Virtual address is not alloced, please check. (va=0x124900000000)
[ERROR] RUNTIME(34091,python):2023-02-22-16:03:51.171.800 [logger.cc:419]34091 DevMalloc:[FINAL][FINAL]Device malloc failed, size=812515328, type=2.
[ERROR] RUNTIME(34091,python):2023-02-22-16:03:51.171.841 [api_c.cc:1003]34091 rtMalloc:[FINAL][FINAL]ErrCode=207001, desc=[driver error:out of memory], InnerCode=0x7020016
[ERROR] RUNTIME(34091,python):2023-02-22-16:03:51.171.849 [error_message_manage.cc:49]34091 FuncErrorReason:[FINAL][FINAL]report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(34091,python):2023-02-22-16:03:51.171.861 [error_message_manage.cc:49]34091 FuncErrorReason:[FINAL][FINAL]rtMalloc execute failed, reason=[driver error:out of memory]
[ERROR] GE(34091,python):2023-02-22-16:03:51.171.966 [device_allocator.cc:59]34091 Alloc: ErrorNo: 1343225860(Internal errors) [FINAL][FINAL][Malloc][Memory] failed, rt_ret:207001, device_id:0, size:812515328
[ERROR] GE(34091,python):2023-02-22-16:03:51.171.985 [scalable_allocator.cc:133]34091 FetchNewSpan: ErrorNo: 1343225856(Failed to allocate memory!) [FINAL][FINAL][allocator_1] Failed to apply for memory. We will try to free memory from memory pool, the above error log can be ignored. Try to free cached memory...
登录 后才可以发表评论