2.3K Star 8.1K Fork 4.3K

GVPMindSpore / mindspore

 / 详情

[MT][NET][910B][8p][vit_b_32_224]训练精度提升不正常

DONE
Bug-Report
创建于  
2024-04-28 15:31
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

[vit_b_32_224]训练精度提升不正常

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :a87635b6f65bc6225
    -- Python version (e.g., Python 3.7.5) :Python 3.7.6
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):eulerosv2r10.aarch64
    -- GCC/Compiler version (if compiled from source): gcc (GCC) 7.3.0
    run包Milan_C17/20240414
    mindspore包master_20240426230938_a87635b6f65bc6225173fec53883351ebf449b66

ci上一次执行该用例精度正常上升:
run包Milan_C17/20240414
mindspore包master_20240417142516_74e1f3ea86a988bceb71bd7f02b674af4c97c89a/

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

test_ms_lab_vit_b32_224_acc3_ascend_train_infer_8p_0005

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

1.get code from solution_test
2.cd solution_test/cases/02network/00cv/vit_b/train/
3.pytest -s test_ms_lab_vit_b32_224_acc3_ascend_train_infer_8p_0005.py
(最终的训练命令是cd /home/jenkins/workspace/TDT_deployment/solution_test/cases/02network/00cv/vit_b/train/test_ms_lab_vit_b32_224_acc3_ascend_train_infer_8p_0005;bash run_distribute_train.sh /home/workspace/config/hccl_8p.json --config configs/vit/vit_b32_224_ascend.yaml --data_dir /home/workspace/mindspore_dataset/ImageNet2012 --distribute True --ckpt_path /home/workspace/mindspore_ckpt/mindcv_models_ckpt/accuracy_cropping_ckpt_ascend/vit/vit_b_32_224-166_312.ckpt --val_while_train True --resume_opt False --ckpt_save_interval 1 --val_interval 1 > train.log 2>&1 &)
4.查看过训练精度是否正常

Describe the expected behavior / 预期结果 (Mandatory / 必填)

[vit_b_32_224]训练精度正常提升

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

acc test_data_list : [1.788, 3.822, 6.178, 7.638, 9.61, 11.136, 12.892, 14.756, 17.314, 19.976, 20.296, 23.756, 26.674, 29.262, 31.196, 32.538, 34.088, 35.936, 36.102, 36.536, 38.788, 40.084, 40.056, 41.454, 42.144, 41.38, 41.296, 43.15, 43.41, 42.902, 44.878, 44.332, 45.474, 46.144, 46.604, 47.598]
acc standard_data : [
66.362, 66.632, 66.738, 67.084, 67.222, 67.036, 67.16, 67.29, 67.106, 67.324, 67.788, 67.732, 67.72, 67.826,
67.862, 67.904, 68.36, 68.526, 68.312, 68.654, 68.564, 68.91, 68.73, 69.14, 69.106, 69.176, 69.572, 69.408,
69.562, 69.474, 69.698, 69.64, 69.976, 70.286, 70.61, 70.174
]

[2024-04-27 21:45:31] mindcv.utils.callbacks INFO - Epoch: [168/36], batch: [312/312], loss: 6.361636, lr: 0.000121, time: 186.347856s
[2024-04-27 21:45:39] mindcv.utils.callbacks INFO - Validation Top_1_Accuracy: 3.8220%, Top_5_Accuracy: 11.5860%, time: 7.969699s
[2024-04-27 21:45:39] mindcv.utils.callbacks INFO - => New best val acc: 3.8220%
[2024-04-27 21:45:53] mindcv.utils.callbacks INFO - Saving model to ./ckpt/vit_b_32_224-168_312.ckpt
[2024-04-27 21:46:03] mindcv.utils.checkpoint_manager INFO - Top-k accuracy checkpoints:
./ckpt/vit_b_32_224-168_312.ckpt 0.03821999952197075
./ckpt/vit_b_32_224-167_312.ckpt 0.017880000174045563
[2024-04-27 21:46:03] mindcv.utils.callbacks INFO - Total time since last epoch: 217.913440(train: 186.399469, val: 7.969699)s, ETA: -28764.574142s
[2024-04-27 21:46:03] mindcv.utils.callbacks INFO - --------------------------------------------------------------------------------
[INFO] RUNTIME(3206661,python):2024-04-27-21:49:07.835.208 [engine.cc:1709] 3211067 ReportTimeoutProc: report timeout! streamId=2, taskId=174, execId=65535, pendingNum=2, reportCount=122, parseTaskCount=122, msec=65535, curSec=816459488
[2024-04-27 21:50:33] mindcv.utils.callbacks INFO - Epoch: [169/36], batch: [312/312], loss: 6.413715, lr: 0.000174, time: 269.775203s
[2024-04-27 21:50:40] mindcv.utils.callbacks INFO - Validation Top_1_Accuracy: 6.1780%, Top_5_Accuracy: 16.7240%, time: 7.595529s
[2024-04-27 21:50:40] mindcv.utils.callbacks INFO - => New best val acc: 6.1780%
[2024-04-27 21:50:52] mindcv.utils.callbacks INFO - Saving model to ./ckpt/vit_b_32_224-169_312.ckpt
[2024-04-27 21:51:02] mindcv.utils.checkpoint_manager INFO - Top-k accuracy checkpoints:
./ckpt/vit_b_32_224-169_312.ckpt 0.06178000196814537
./ckpt/vit_b_32_224-168_312.ckpt 0.03821999952197075
./ckpt/vit_b_32_224-167_312.ckpt 0.017880000174045563
[2024-04-27 21:51:02] mindcv.utils.callbacks INFO - Total time since last epoch: 299.642142(train: 269.843112, val: 7.595529)s, ETA: -39852.404830s
[2024-04-27 21:51:02] mindcv.utils.callbacks INFO - --------------------------------------------------------------------------------
[INFO] RUNTIME(3206661,python):2024-04-27-21:54:07.607.217 [engine.cc:1709] 3211067 ReportTimeoutProc: report timeout! streamId=2, taskId=254, execId=65535, pendingNum=2, reportCount=170, parseTaskCount=170, msec=65535, curSec=816759260
[2024-04-27 21:55:59] mindcv.utils.callbacks INFO - Epoch: [170/36], batch: [312/312], loss: 6.344756, lr: 0.000227, time: 296.020940s
[2024-04-27 21:56:08] mindcv.utils.callbacks INFO - Validation Top_1_Accuracy: 7.6380%, Top_5_Accuracy: 19.7800%, time: 9.011860s
[2024-04-27 21:56:08] mindcv.utils.callbacks INFO - => New best val acc: 7.6380%
[2024-04-27 21:56:21] mindcv.utils.callbacks INFO - Saving model to ./ckpt/vit_b_32_224-170_312.ckpt
[2024-04-27 21:56:29] mindcv.utils.checkpoint_manager INFO - Top-k accuracy checkpoints:
./ckpt/vit_b_32_224-170_312.ckpt 0.07637999951839447
./ckpt/vit_b_32_224-169_312.ckpt 0.06178000196814537
./ckpt/vit_b_32_224-168_312.ckpt 0.03821999952197075
./ckpt/vit_b_32_224-167_312.ckpt 0.017880000174045563
[2024-04-27 21:56:29] mindcv.utils.callbacks INFO - Total time since last epoch: 326.168938(train: 296.084333, val: 9.011860)s, ETA: -43706.637681s

Special notes for this issue/备注 (Optional / 选填)

评论 (4)

chentangyu 创建了Bug-Report
chentangyu 添加了
 
kind/bug
标签
chentangyu 添加了
 
attr/accuracy
标签
chentangyu 添加了
 
device/ascend
标签
chentangyu 添加了
 
v2.3.0.rc2
标签
chentangyu 添加协作者chentangyu
chentangyu 添加协作者PingqiLi
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@chentangyu

感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:

  1. 如果您刚刚接触MindSpore,或许您可以在教程找到答案
  2. 如果您是资深Pytorch用户,您或许需要:
  1. 如果您遇到动态图问题,可以设置set_context(pynative_synchronize=True)查看报错栈协助定位
  2. 模型精度调优问题可参考官网调优指南
  3. 如果您反馈的是框架BUG,请确认您在ISSUE中提供了MindSpore版本、使用的后端类型(CPU、GPU、Ascend)、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
  4. 如果您已经定位出问题根因,欢迎提交PR参与MindSpore开源社区,我们会尽快review
Shawny 添加协作者Shawny
Shawny 负责人Shawny 修改为代宇鑫
代宇鑫 任务状态TODO 修改为VALIDATION
代宇鑫 添加协作者代宇鑫
代宇鑫 负责人代宇鑫 修改为chentangyu
代宇鑫 取消协作者chentangyu
Shawny 里程碑B-SIG-Kit 修改为B-MDTest
Shawny 添加了
 
rca/others
标签
Shawny 添加了
 
rct/oldrelease
标签
Shawny 添加了
 
ctl/solutiontest
标签

使用此ms包未复现,master_20240426230938_a87635b6f65bc6225173fec53883351ebf449b66
Validation Top_1_Accuracy: 66.1080%, Top_5_Accuracy: 86.8300%
Validation Top_1_Accuracy: 66.4900%, Top_5_Accuracy: 87.0200%
Validation Top_1_Accuracy: 67.0460%, Top_5_Accuracy: 87.3060%

run包Milan_C17/20240414
mindspore包master_20240426230938_a87635b6f65bc6225173fec53883351ebf449b66
重跑后通过
输入图片说明

i-robot 添加了
 
foruda
标签
chentangyu 任务状态VALIDATION 修改为DONE
fangwenyi 移除了
 
v2.3.0.rc2
标签
fangwenyi 添加了
 
master
标签

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(5)
7508424 tacyi139 1588073933 8108889 shawny233 1628167362
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助