2.4K Star 8.2K Fork 4.4K

GVPMindSpore / mindspore

 / 详情

[ST][MS][NET][resnet50-imagenet][910 8p][pynative]性能在arm[353ms/step]、x86[428ms/step]上差异太大

ACCEPTED
Bug-Report
创建于  
2023-03-15 14:36
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

resnet50模型地址:https://gitee.com/mindspore/models/tree/master/official/cv/ResNet

resnet50-imagenet网络pynative模式在910环境8p训练,训练性能在arm、x86差异太大,请定位根因

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device Ascend

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):
    run包:HiAI/HISI_C29/20230301
    MindSpore 版本:r2.0.0.B180_master_20230309002957_8b868e8a

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative

Related testcase / 关联用例 (Mandatory / 必填)

测试仓库地址:solution_test/cases/02network/00cv/resnet50/pynative
用例:
test_ms_resnet50_imagenet_pynative_train_infer_910_8p_0001.py

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

分别在arm、x86环境执行以下步骤

  1. get code from models
  2. cd models/official/cv/ResNet
  3. set pynative mode in train.py
  4. cd scripts;sh run_distribute_train.sh ./hccl_8p.json ./ImageNet2012/train ../config/resnet50_imagenet2012_config.yaml
  5. 验证网络训练是否成功
  6. 对比网络训练性能在arm和x86的差异

Describe the expected behavior / 预期结果 (Mandatory / 必填)

resnet50-ImageNet网络在910环境训练成功,性能在arm和x86差异不大

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

网络      数据集        架构 性能(ms/step) 性能(FPS) mode
resnet50  ImageNet2012  arm   353.053689      5800   pynative
resnet50  ImageNet2012  x86   428.799385      4776   pynative
resnet50  ImageNet2012  arm   225.601191      18155  graph
resnet50  ImageNet2012  x86   326.748022      12535  graph

Special notes for this issue/备注 (Optional / 选填)

走给肖天赐

评论 (13)

zhongjicheng 创建了Bug-Report
zhongjicheng 添加了
 
sig/minddata
标签
zhongjicheng 添加了
 
attr/performance
标签
zhongjicheng 添加了
 
stage/perf-tuning
标签
zhongjicheng 添加了
 
kind/bug
标签
zhongjicheng 添加了
 
v2.0.0.rc1
标签
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@zhongjicheng

Please add labels (comp or sig), also you can visit https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md to find more.
为了让代码尽快被审核,请您为Pull Request打上 组件(comp)或兴趣组(sig) 标签,打上标签的PR可直接推送给责任人进行审核。
更多的标签可以查看https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md
以组件相关代码提交为例,如果你提交的是data组件代码,你可以这样评论:
//comp/data
当然你也可以邀请data SIG组来审核代码,可以这样写:
//sig/data
另外你还可以给这个PR标记类型,例如是bugfix或者是特性需求:
//kind/bug or //kind/feature
恭喜你,你已经学会了使用命令来打标签,接下来就在下面的评论里打上标签吧!

xiangjiawei007 负责人xiangjiawei007 修改为xiaotianci
zhongjicheng 修改了描述

X86:

2023-03-16 17:13:46,777:INFO:epoch: [1/90] loss: 5.355258, epoch time: 207.561 s, per step time: 665.260 ms
epoch time: 207561.62285804749, per step time: 665.2616117245112
2023-03-16 17:15:31,480:INFO:epoch: [2/90] loss: 4.376903, epoch time: 104.702 s, per step time: 335.584 ms
epoch time: 104702.94141769409, per step time: 335.5863506977375
2023-03-16 17:17:15,987:INFO:epoch: [3/90] loss: 3.895827, epoch time: 104.507 s, per step time: 334.958 ms
epoch time: 104507.56478309631, per step time: 334.9601435355651
2023-03-16 17:19:00,174:INFO:epoch: [4/90] loss: 3.552855, epoch time: 104.186 s, per step time: 333.931 ms
epoch time: 104187.00623512268, per step time: 333.9327122920599
2023-03-16 17:21:10,958:INFO:epoch: [5/90] loss: 3.402023, epoch time: 130.783 s, per step time: 419.175 ms
epoch time: 130782.88888931274, per step time: 419.1759259272844
2023-03-16 17:22:55,375:INFO:epoch: [6/90] loss: 3.243053, epoch time: 104.417 s, per step time: 334.669 ms
epoch time: 104417.30642318726, per step time: 334.670853920472
2023-03-16 17:24:39,220:INFO:epoch: [7/90] loss: 3.097744, epoch time: 103.844 s, per step time: 332.835 ms
epoch time: 103844.88248825073, per step time: 332.83616182131647
2023-03-16 17:26:23,350:INFO:epoch: [8/90] loss: 3.023839, epoch time: 104.129 s, per step time: 333.748 ms
epoch time: 104130.01346588135, per step time: 333.7500431598761
2023-03-16 17:28:07,850:INFO:epoch: [9/90] loss: 2.981352, epoch time: 104.500 s, per step time: 334.935 ms
epoch time: 104500.08916854858, per step time: 334.9361832325275
2023-03-16 17:29:52,694:INFO:epoch: [10/90] loss: 2.861949, epoch time: 104.843 s, per step time: 336.036 ms
epoch time: 104843.44387054443, per step time: 336.0366790722578
2023-03-16 17:31:36,638:INFO:epoch: [11/90] loss: 2.832540, epoch time: 103.944 s, per step time: 333.155 ms
epoch time: 103944.82946395874, per step time: 333.1565046921755
2023-03-16 17:33:20,641:INFO:epoch: [12/90] loss: 2.981315, epoch time: 104.002 s, per step time: 333.340 ms
epoch time: 104002.53438949585, per step time: 333.34145637658924
2023-03-16 17:35:04,616:INFO:epoch: [13/90] loss: 3.020468, epoch time: 103.974 s, per step time: 333.251 ms
epoch time: 103974.94506835938, per step time: 333.2530290652544
2023-03-16 17:36:48,442:INFO:epoch: [14/90] loss: 2.818917, epoch time: 103.825 s, per step time: 332.774 ms
epoch time: 103825.89030265808, per step time: 332.77528943159643

ARM:

2023-03-16 17:21:01,044:INFO:epoch: [1/90] loss: 5.336744, epoch time: 221.602 s, per step time: 710.264 ms
epoch time: 221603.19828987122, per step time: 710.2666611854846
2023-03-16 17:22:56,450:INFO:epoch: [2/90] loss: 4.481668, epoch time: 115.405 s, per step time: 369.887 ms
epoch time: 115405.68733215332, per step time: 369.8900235004914
2023-03-16 17:24:47,316:INFO:epoch: [3/90] loss: 3.887445, epoch time: 110.865 s, per step time: 355.338 ms
epoch time: 110866.11890792847, per step time: 355.3401247048989
2023-03-16 17:26:37,867:INFO:epoch: [4/90] loss: 3.555289, epoch time: 110.550 s, per step time: 354.327 ms
epoch time: 110550.67467689514, per step time: 354.329085502869
2023-03-16 17:29:20,074:INFO:epoch: [5/90] loss: 3.385780, epoch time: 162.206 s, per step time: 519.890 ms
epoch time: 162206.50339126587, per step time: 519.8926390745701
2023-03-16 17:31:11,138:INFO:epoch: [6/90] loss: 3.249865, epoch time: 111.064 s, per step time: 355.973 ms
epoch time: 111064.30864334106, per step time: 355.97534821583673
2023-03-16 17:33:00,693:INFO:epoch: [7/90] loss: 3.122056, epoch time: 109.554 s, per step time: 351.136 ms
epoch time: 109555.08518218994, per step time: 351.1380935326601
2023-03-16 17:34:52,209:INFO:epoch: [8/90] loss: 3.040946, epoch time: 111.515 s, per step time: 357.421 ms
epoch time: 111516.05272293091, per step time: 357.42324590682983
2023-03-16 17:36:41,979:INFO:epoch: [9/90] loss: 3.053561, epoch time: 109.769 s, per step time: 351.824 ms
epoch time: 109769.82259750366, per step time: 351.8263544791784
2023-03-16 17:38:31,735:INFO:epoch: [10/90] loss: 2.841896, epoch time: 109.755 s, per step time: 351.778 ms
epoch time: 109755.4886341095, per step time: 351.7804122888125

性能问题只在111.101环境出现,对比单跑数据处理的性能:
111.101:
输入图片说明
112.32:
输入图片说明
均step性能并无差异,非数据处理引起

在111.101环境通过设置ImageFolderDataset的num_samples参数,调小step数,性能则能够达标:
输入图片说明
定位此问题为环境问题。

PYNATIVE模式下数据迭代单step耗时在200ms以内,网络计算单step耗时在300ms左右,数据处理耗时能异步隐藏于迭代间隙,整体对外耗时的差异非数据处理造成,转动态图组件继续定位

在天赐的结论基础上,做了profiler实验

101 profiler
输入图片说明
32 profiler
输入图片说明

101的算子执行的慢,或许是环境影响了什么

101 单卡 图模式
输入图片说明

101 单卡 图模式
输入图片说明

101 单卡 pynative
输入图片说明

32 单卡 pynative
输入图片说明

图模式没有差别,大概是101在pynative下算子执行慢

zhongjicheng 负责人xiaotianci 修改为chujinjin
zhongjicheng 添加协作者xiaotianci
luoyang 移除了
 
sig/minddata
标签
luoyang 移除了
 
sig/minddata
标签
luoyang 添加了
 
sig/minddata
标签
luoyang 移除了
 
sig/minddata
标签
luoyang 添加了
 
sig/pynative
标签
luoyang 负责人chujinjin 修改为xiaotianci
luoyang 取消协作者xiaotianci
luoyang 添加协作者chujinjin
luoyang 负责人xiaotianci 修改为chujinjin
luoyang 取消协作者chujinjin
luoyang 添加协作者xiaotianci
luoyang 里程碑B-SIG-Data 修改为B-SIG-PYNATIVE

该机器上其他CV类网络,也有同样现象,与上述Profiling结果一致,对BN,Conv类算子性能变慢

chujinjin 任务状态TODO 修改为WIP

2023/4/3 CCB结论:该问题降级为提示单,后续跟踪系统整体重装后的性能。

xiangminshan 优先级主要 修改为不重要
chujinjin 添加了
 
ltnr
标签
chujinjin 添加了
 
ltnr
标签
chujinjin 任务状态WIP 修改为ACCEPTED
xiaotianci 取消协作者xiaotianci

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(6)
6575291 chujinjin 1605008803
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助