2.4K Star 8.2K Fork 4.4K

GVPMindSpore / mindspore

 / 详情

[ST][MS][r2.1][vgg16][graph][x86-910 8p]FPS:4560 < 4600,网络在X86-910上性能劣化

TODO
Bug-Report
创建于  
2023-07-26 11:28
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

[vgg16][graph][x86-910 8p]FPS:4560 < 4600,网络在X86-910上性能劣化
模型仓地址:https://gitee.com/mindspore/models/tree/master/official/cv/VGG/vgg16
网络在X86-Ascend910上性能劣化,在ARM-Ascend910达标。
vgg16:
X86-Ascend910:4560 <4600
ARM-Ascend910: 4868 > 4600

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend x86/

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):

run包: HiAI/HISI_C30/20230720/
mindspore: r2.1_20230725161523_15377429

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

用例仓地址:solution_test/case/02network/00cv/vgg16/train
用例:test_ms_vgg16_cifar10_train_infer_910_8p_0001

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

1.get code from models
2.cd models/official/cv/VGG/vgg16
3.Usage: bash run_distribute_train.sh [RANK_TABLE_FILE] [DATA_PATH] [cifar10|imagenet2012]
4.验证网络是否训练成功

Describe the expected behavior / 预期结果 (Mandatory / 必填)

网络训练成功,性能达到4600FPS

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

Train epoch time: 88755.073 ms, per step time: 915.001 ms
Train epoch time: 10426.437 ms, per step time: 107.489 ms
Train epoch time: 10416.589 ms, per step time: 107.388 ms
Train epoch time: 10422.537 ms, per step time: 107.449 ms
Train epoch time: 12669.396 ms, per step time: 130.612 ms
Train epoch time: 10413.061 ms, per step time: 107.351 ms
Train epoch time: 10419.519 ms, per step time: 107.418 ms
Train epoch time: 10429.868 ms, per step time: 107.524 ms
Train epoch time: 10420.229 ms, per step time: 107.425 ms
Train epoch time: 12760.309 ms, per step time: 131.550 ms
Train epoch time: 10408.571 ms, per step time: 107.305 ms
Train epoch time: 10438.841 ms, per step time: 107.617 ms
Train epoch time: 10424.552 ms, per step time: 107.470 ms
Train epoch time: 10422.712 ms, per step time: 107.451 ms
Train epoch time: 12748.803 ms, per step time: 131.431 ms
Train epoch time: 10416.548 ms, per step time: 107.387 ms
Train epoch time: 10420.336 ms, per step time: 107.426 ms
Train epoch time: 10415.631 ms, per step time: 107.378 ms
Train epoch time: 10424.064 ms, per step time: 107.465 ms
Train epoch time: 12739.141 ms, per step time: 131.331 ms
Train epoch time: 10411.635 ms, per step time: 107.336 ms
Train epoch time: 10429.574 ms, per step time: 107.521 ms
Train epoch time: 10410.773 ms, per step time: 107.328 ms
Train epoch time: 10423.095 ms, per step time: 107.455 ms
Train epoch time: 12755.915 ms, per step time: 131.504 ms
Train epoch time: 10414.946 ms, per step time: 107.371 ms
Train epoch time: 10414.219 ms, per step time: 107.363 ms
Train epoch time: 10410.813 ms, per step time: 107.328 ms
Train epoch time: 10409.670 ms, per step time: 107.316 ms
Train epoch time: 12731.053 ms, per step time: 131.248 ms
Train epoch time: 10416.225 ms, per step time: 107.384 ms
Train epoch time: 10419.394 ms, per step time: 107.416 ms
Train epoch time: 10421.844 ms, per step time: 107.442 ms
Train epoch time: 10423.710 ms, per step time: 107.461 ms
Train epoch time: 12872.563 ms, per step time: 132.707 ms
Train epoch time: 10424.155 ms, per step time: 107.466 ms
Train epoch time: 10421.570 ms, per step time: 107.439 ms
Train epoch time: 10437.652 ms, per step time: 107.605 ms
Train epoch time: 10407.323 ms, per step time: 107.292 ms
Train epoch time: 12741.421 ms, per step time: 131.355 ms
Train epoch time: 10407.239 ms, per step time: 107.291 ms
Train epoch time: 10418.948 ms, per step time: 107.412 ms
Train epoch time: 10404.158 ms, per step time: 107.259 ms
Train epoch time: 10413.049 ms, per step time: 107.351 ms
Train epoch time: 12725.214 ms, per step time: 131.188 ms
Train epoch time: 10396.183 ms, per step time: 107.177 ms
Train epoch time: 10411.662 ms, per step time: 107.337 ms
Train epoch time: 10416.259 ms, per step time: 107.384 ms
Train epoch time: 10418.939 ms, per step time: 107.412 ms
Train epoch time: 12768.704 ms, per step time: 131.636 ms
Train epoch time: 10415.551 ms, per step time: 107.377 ms
Train epoch time: 10399.587 ms, per step time: 107.212 ms
Train epoch time: 10426.311 ms, per step time: 107.488 ms
Train epoch time: 10414.772 ms, per step time: 107.369 ms
Train epoch time: 12765.006 ms, per step time: 131.598 ms
Train epoch time: 10415.831 ms, per step time: 107.380 ms
Train epoch time: 10401.095 ms, per step time: 107.228 ms
Train epoch time: 10418.307 ms, per step time: 107.405 ms
Train epoch time: 10411.587 ms, per step time: 107.336 ms
Train epoch time: 12744.037 ms, per step time: 131.382 ms
Train epoch time: 10395.420 ms, per step time: 107.169 ms
Train epoch time: 10407.376 ms, per step time: 107.293 ms
Train epoch time: 10422.577 ms, per step time: 107.449 ms
Train epoch time: 10420.676 ms, per step time: 107.430 ms
Train epoch time: 12731.580 ms, per step time: 131.253 ms
Train epoch time: 10427.068 ms, per step time: 107.496 ms
Train epoch time: 10435.553 ms, per step time: 107.583 ms
Train epoch time: 10423.365 ms, per step time: 107.457 ms
Train epoch time: 10410.719 ms, per step time: 107.327 ms
Train epoch time: 12686.112 ms, per step time: 130.785 ms

Special notes for this issue/备注 (Optional / 选填)

走给代宇鑫

评论 (7)

sunjiawei999 创建了Bug-Report
sunjiawei999 添加了
 
kind/bug
标签
sunjiawei999 添加了
 
v2.2.0
标签
sunjiawei999 添加了
 
attr/function
标签
sunjiawei999 添加了
 
stage/func-debug
标签
sunjiawei999 添加了
 
sig/modelzoo
标签
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@sunjiawei999

感谢您的反馈,您可以评论//mindspore-assistant更快获取帮助,更多标签可以查看标签列表

  1. 如果您刚刚接触MindSpore,或许您可以在教程找到答案
  2. 如果您是资深Pytorch用户,您或许需要:
    与PyTorch典型区别 / PyTorch与MindSpore API映射表
  3. 如果您遇到动态图问题,可以设置mindspore.set_context(pynative_synchronize=True)查看报错栈协助定位
  4. 模型精度调优问题可参考官网调优指南
  5. 如果您反馈的是框架BUG,请确认您在ISSUE中提供了MindSpore版本、使用的后端类型(CPU、GPU、Ascend)、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
  6. 如果您已经定位出问题根因,欢迎提交PR参与MindSpore开源社区,我们会尽快review
sunjiawei999 修改了描述
sunjiawei999 修改了描述
sunjiawei999 移除了
 
v2.2.0
标签
sunjiawei999 移除了
 
v2.2.0
标签
sunjiawei999 添加了
 
v2.1.0
标签
sunjiawei999 修改了标题
sunjiawei999 修改了标题
sunjiawei999 关联分支master 修改为r2.1
fangwenyi 负责人zhongjicheng 修改为代宇鑫
fangwenyi 添加协作者zhongjicheng

算子时长相差不大,只差1ms,跑了两次,结果基本一致,从profiling数据来看,主要差距在拖尾阶段,相差7ms左右,需要找AllReduce相关负责人帮忙分析一下环境差异

arm
输入图片说明
x86
输入图片说明

i-robot 添加了
 
foruda
标签
i-robot 添加了
 
foruda
标签

对比单卡性能相差不大,per step time 分别是:
x86:69.127750ms
arm:69.805500ms

2023/7/27 CCB:
遗留原因:ARM环境性能正常,X86性能相比基线下降0.9%,当前外部使用环境大部分为ARM,同时性能下降较小,影响可控,经CCB裁决,问题遗留
影响:X86环境下vgg16静态图模式性能下降0.9%
规避措施:用户如有疑问,可通过社区回复性能优化计划,明确当前X86上有性能劣化,但是ARM上性能是正常的

fangwenyi 添加了
 
ccb/bug
标签
fangwenyi 添加了
 
ccb/bug
标签
代宇鑫 添加了
 
v2.2.0
标签

2023.08.04 CCB:
暂时不在X86看护,改为ARM环境看护。
补充试验:8个进程跑单卡用例,看ARM和X86多卡间性能的抖动是否一致。

代宇鑫 里程碑B-SIG-ModelZoo 修改为B-SIG-Parallel
代宇鑫 添加协作者代宇鑫
代宇鑫 负责人代宇鑫 修改为wangshengnan123
yao_yf 移除了
 
ccb/bug
标签
yao_yf 移除了
 
ccb/bug
标签
yao_yf 移除了
 
ccb/bug
标签
yao_yf 移除了
 
ccb/bug
标签
yao_yf 添加了
 
待CCB
标签
wangshengnan123 添加了
 
rct/cann
标签
linzhengshu 添加了
 
v2.2.10
标签
wuweikang 添加了
 
ccb/bug
标签
wuweikang 移除了
 
待CCB
标签
wuweikang 移除了
 
待CCB
标签
tanghuikang 优先级主要 修改为次要
zhunaipan 添加了
 
v2.2.12
标签
zhunaipan 添加了
 
v2.2.13
标签
zhunaipan 添加了
 
v2.2.14
标签
yuchaojie 移除了
 
rct/cann
标签
yuchaojie 移除了
 
rct/cann
标签
zhunaipan 添加了
 
r2.2
标签

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(5)
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助