2.3K Star 8.1K Fork 4.3K

GVPMindSpore / mindspore

 / 详情

[CT][MS][ascend910B]Matmul算子网络走ge流程推理,算子融合为NZ格式,推理失败

DONE
Bug-Report
创建于  
2024-04-28 11:27
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

910B后端,Matmul算子网络走ge流程推理,算子融合为NZ格式,推理失败

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

../prototxt/akg/test_mindir_akg_general_ascend_910_matmul_cpp_func_nz_ge_001.prototxt

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

1.安装akg和自定义算子,开启MS_DEV_DUMP_GRAPH_KERNEL_IR,可查看子图
export MS_DEV_DUMP_GRAPH_KERNEL_IR=on pip uninstall akg -y pip install akg/akg*.whl bash custom_kernels/ascend/tbe_and_aicpu/install.sh
2.context设置ge后端
3. 模型build前load配置文件akg_matmul_nz_ge.config

[acl_init_options]
ge.exec.formatMode=0
[ascend_context]
privider=ge

[graph_kernel_param]
opt_level=2

  1. 推理模型比较精度,查看算子是否融合

Describe the expected behavior / 预期结果 (Mandatory / 必填)

算子融合,模型推理成功,精度合理

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

INFO: Step [cpp_predict], cmd :  cd /data/local/tmp/; source env.sh; ./test_basic_predict ms_predict /data/local/tmp/ /data/local/tmp/test_mindir_akg_general_ascend_910_matmul_cpp_func_nz_ge_001.prototxt > tmp.log 2>&1 && echo Success || echo Failed; cat tmp.log

==================== WARNING: Skipping akg as it is not installed.
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.1.1 -> 24.0
[notice] To update, run: pip install --upgrade pip


INFO: Step [cpp_predict], res:  Looking in indexes: https://mirrors.tools.huawei.com/pypi/simple
Processing ./akg/akg-2.2-cp37-cp37m-linux_aarch64.whl
Requirement already satisfied: scipy>=1.5.2 in /home/ci/miniconda3/envs/torch_1.12_ci3.7/lib/python3.7/site-packages (from akg==2.2) (1.7.0)
Requirement already satisfied: numpy>=1.17.0 in /home/ci/miniconda3/envs/torch_1.12_ci3.7/lib/python3.7/site-packages (from akg==2.2) (1.21.2)
Requirement already satisfied: decorator>=4.4.0 in /home/ci/miniconda3/envs/torch_1.12_ci3.7/lib/python3.7/site-packages (from akg==2.2) (5.1.1)
Installing collected packages: akg
Successfully installed akg-2.2
[runtime] [2024-04-28 09:30:02] [INFO] install package to /usr/local/Ascend/latest/opp/vendors
[runtime] [2024-04-28 09:30:02] [INFO] [ops_custom] process the framework
[runtime] [2024-04-28 09:30:02] [INFO] create /usr/local/Ascend/latest/opp/vendors/mslite_tbe_and_aicpu/framework.
[runtime] [2024-04-28 09:30:02] [INFO] copy new ops framework files ......
[runtime] [2024-04-28 09:30:02] [INFO] [ops_custom] process the op proto
[runtime] [2024-04-28 09:30:02] [INFO] create /usr/local/Ascend/latest/opp/vendors/mslite_tbe_and_aicpu/op_proto.
[runtime] [2024-04-28 09:30:02] [INFO] copy new ops op_proto files ......
[runtime] [2024-04-28 09:30:02] [INFO] [ops_custom] process the op impl
[runtime] [2024-04-28 09:30:02] [INFO] create /usr/local/Ascend/latest/opp/vendors/mslite_tbe_and_aicpu/op_impl.
[runtime] [2024-04-28 09:30:02] [INFO] copy new ops op_impl files ......
[runtime] [2024-04-28 09:30:02] [INFO] no need to upgrade custom.proto files
[runtime] [2024-04-28 09:30:02] [INFO] SUCCESS
Failed
[ptest]Running ms_predictLoadPrototxtConfig 
load prototxt file success
Load context info: cpu_bind_mode = 0
SetDeviceID: 0
SetProvider: ge
Config file path is: /data/local/tmp/data/akg_matmul_nz_ge.config
LoadConfig StatusCode:0
Model file path is: /data/local/tmp/data/akg_matmul.mindir
x:resize = false
y0:resize = false
y1:resize = false
[common.cpp] Loading data from: /data/local/tmp//data/akg_matmul_0.bin
[common.cpp]Read Binary Data Over, get tensorSize as: 16384
inputs[i].DataSize() VS size: 16384:16384
Loading data from /data/local/tmp//data/akg_matmul_0.bin to model tensor: x
x:transpose = none
x ---- 0.828125 0.600098 0.157349 0.816406 0.752441 0.419189 0.487793 0.68457 0.876953 0.612793 
shape(16,512,)
[common.cpp] Loading data from: /data/local/tmp//data/akg_matmul_1.bin
[common.cpp]Read Binary Data Over, get tensorSize as: 65536
inputs[i].DataSize() VS size: 65536:65536
Loading data from /data/local/tmp//data/akg_matmul_1.bin to model tensor: y0
y0:transpose = none
y0 ---- 0.106445 0.558105 0.843262 0.338623 0.550781 0.361816 0.750977 1.00098 0.727051 0.187622 
shape(512,64,)
[common.cpp] Loading data from: /data/local/tmp//data/akg_matmul_2.bin
[common.cpp]Read Binary Data Over, get tensorSize as: 65536
inputs[i].DataSize() VS size: 65536:65536
Loading data from /data/local/tmp//data/akg_matmul_2.bin to model tensor: y1
y1:transpose = none
y1 ---- 0.0227814 0.529297 0.774902 0.550293 0.0733643 0.89502 0.211914 0.0552673 0.292725 0.995605 
shape(512,64,)
[ERROR] ME(750599,ffff11452810,test_basic_predict):2024-04-28-09:30:33.943.409 [mindspore/lite/src/extendrt/delegate/ascend_ge/ge_graph_executor.cc:1544] operator()] RunAsync failed.E40021: 2024-04-28-09:30:33.937.537 Failed to compile Op [Default/GraphKernel_MatMul_split-op1Fused_x2_y1]. (oppath: [Compile /usr/local/Ascend/CANN-7.2/opp/vendors/mslite_tbe_and_aicpu/op_impl/ai_core/tbe/mslite_tbe_and_aicpu_impl/custom.py failed with errormsg/stack: File "/home/ci/miniconda3/envs/torch_1.12_ci3.7/lib/python3.7/site-packages/akg/utils/tbe_codegen_utils.py", line 97, in build_npu_for_akg
    from tbe.tvm.driver.cce_build_module import _count_time, generate_cce_code
ImportError: cannot import name 'generate_cce_code' from 'tvm.driver.cce_build_module' (/home/ci/miniconda3/envs/torch_1.12_ci3.7/lib/python3.7/site-packages/tbe/tvm/driver/cce_build_module.py), 
], optype: [Fused_x2_y1])
        Solution: See the host log for details, and then check the Python stack where the error log is reported.
        TraceBack (most recent call last):
        Compile op[Default/GraphKernel_MatMul_split-op1Fused_x2_y1] failed, oppath[/usr/local/Ascend/CANN-7.2/opp/vendors/mslite_tbe_and_aicpu/op_impl/ai_core/tbe/mslite_tbe_and_aicpu_impl/custom.py], optype[Fused_x2_y1], taskID[14]. Please check op's compilation error message.[FUNC:ReportBuildErrMessage][FILE:fusion_manager.cc][LINE:748]
        [SubGraphOpt][Compile][ProcFailedCompTask] Thread[281469532465168] recompile single op[Default/GraphKernel_MatMul_split-op1Fused_x2_y1] failed[FUNC:ProcessAllFailedCompileTasks][FILE:tbe_op_store_adapter.cc][LINE:962]
        [SubGraphOpt][Compile][ProcFailedCompTask] Thread[281469532465168] recompile single op[Default/GraphKernel_MatMul_split-op0Fused_x2_y1] failed[FUNC:ProcessAllFailedCompileTasks][FILE:tbe_op_store_adapter.cc][LINE:962]
        [SubGraphOpt][Compile][ParalCompOp] Thread[281469532465168] process fail task failed[FUNC:ParallelCompileOp][FILE:tbe_op_store_adapter.cc][LINE:1010]
        [SubGraphOpt][Compile][CompOpOnly] CompileOp failed.[FUNC:CompileOpOnly][FILE:op_compiler.cc][LINE:1119]
        [GraphOpt][FusedGraph][RunCompile] Failed to compile graph with compiler Normal mode Op Compiler[FUNC:SubGraphCompile][FILE:fe_graph_optimizer.cc][LINE:1385]
        Call OptimizeFusedGraph failed, ret:-1, engine_name:AIcoreEngine, graph_name:partition1_rank1_new_sub_graph1[FUNC:OptimizeSubGraph][FILE:graph_optimize.cc][LINE:126]
        subgraph 0 optimize failed[FUNC:OptimizeSubGraphWithMultiThreads][FILE:graph_manager.cc][LINE:1021]

[ERROR] ME(750599,ffff7812e010,test_basic_predict):2024-04-28-09:30:33.943.533 [mindspore/lite/src/extendrt/delegate/ascend_ge/ge_graph_executor.cc:1814] RunGraph] Exec compute graph failed, graph id 0
[ERROR] ME(750599,ffff7812e010,test_basic_predict):2024-04-28-09:30:33.943.607 [mindspore/lite/src/extendrt/session/delegate_session.cc:262] RunGraph] GraphSinkSession::RunGraph run graph failed
[ERROR] ME(750599,ffff7812e010,test_basic_predict):2024-04-28-09:30:33.943.639 [mindspore/lite/src/extendrt/cxx_api/model/model_impl.cc:653] Predict] ModelImpl::Predict RunGraph failed with Common error code.
((predict_ret)==(kSuccess))Expectation Failed
Testcase Name: ms_predict

Special notes for this issue/备注 (Optional / 选填)

评论 (5)

fengyue25 创建了Bug-Report
fengyue25 添加了
 
kind/bug
标签
fengyue25 添加了
 
attr/function
标签
fengyue25 添加了
 
sig/mslite
标签
fengyue25 添加了
 
v2.3.0.rc2
标签
fengyue25 添加了
 
device/ascend
标签
fengyue25 添加协作者邹丹音
fengyue25 添加协作者zsj_mind
fengyue25 添加协作者HidyLi
fengyue25 添加协作者fengyue25
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@fengyue25

感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:

  1. 如果您刚刚接触MindSpore,或许您可以在教程找到答案
  2. 如果您是资深Pytorch用户,您或许需要:
  1. 如果您遇到动态图问题,可以设置set_context(pynative_synchronize=True)查看报错栈协助定位
  2. 模型精度调优问题可参考官网调优指南
  3. 如果您反馈的是框架BUG,请确认您在ISSUE中提供了MindSpore版本、使用的后端类型(CPU、GPU、Ascend)、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
  4. 如果您已经定位出问题根因,欢迎提交PR参与MindSpore开源社区,我们会尽快review

CI用例:
matmul算子走ge流程、使能akg(NZ格式),推理成功精度合理
test_mindir_akg_general_ascend_910_matmul_cpp_func_nz_ge_001

wangtongyu6 添加了
 
ccb/rfc
标签
wangtongyu6 移除了
 
ccb/rfc
标签
i-robot 添加了
 
gitee
标签
wangtongyu6 任务状态TODO 修改为VALIDATION
wangtongyu6 添加协作者wangtongyu6
wangtongyu6 负责人wangtongyu6 修改为fengyue25
wangtongyu6 取消协作者fengyue25
wangtongyu6 添加了
 
rca/others
标签
wangtongyu6 添加了
 
ctl/componenttest
标签
wangtongyu6 添加了
 
rct/cann
标签

版本:2.3 B230
910b

ci执行结果pass
输入图片说明

i-robot 添加了
 
foruda
标签
zhang_lin66 任务状态VALIDATION 修改为DONE
fangwenyi 移除了
 
v2.3.0.rc2
标签
fangwenyi 添加了
 
master
标签

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(7)
6557666 hidyli 1648689999
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助