2.3K Star 8.1K Fork 4.3K

GVPMindSpore / mindspore

 / 详情

[ST][MS][MF][r2.3][qwen_14b_8K长序列][微调][910B3 8P]网络微调性能劣化,880 < 901

DONE
Bug-Report
Opened this issue  
2024-04-28 20:22
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

[ST][MS][MF][r2.3][qwen_14b_8K长序列][微调][910B3 8P]网络微调性能劣化,880 < 901
模型仓地址:https://gitee.com/mindspore/mindformers/blob/dev/research/qwen/qwen.md

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend/

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):

CANN版本:MILAN-Florence-ASL/ABL V100R001C17SPC001B240 Alpha
Mindspore版本:MindSpore_r2.3_d51c17c7(MindSporeDaily)
MindFormers版本:MindFormers_dev_a4fc9e6d(MindFormersDaily)

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

用例仓地址:MindFormers_Test/cases/qwen/14b/train/
用例:
不涉及

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

  1. get code from mindformers
  2. cd mindformers/reserach
  3. 修改配置文件中的权重、数据集路径 设置runner_config下的bs为1
    4.bash run_singlenode.sh "python qwen/run_qwen.py
    --config qwen/run_qwen_14b.yaml
    --use_parallel True
    --run_mode finetune
    --auto_trans_ckpt True
    --load_checkpoint /home/workspace/large_model_ckpt//qwen/14b/rank_0/qwen_14b_base.ckpt
    --seq_length 8192
    --vocab_file /home/workspace/large_model_dataset/qwen/qwen.tiktoken
    --train_data /home/jenkins0/sjw/alpaca_8192.mindrecord " /home/workspace/config/hccl_8p.json [0,8] 8
  4. 验证网络是否推理成功
  5. 验证网络编译时间是否达标

Describe the expected behavior / 预期结果 (Mandatory / 必填)

网络训推理成功,编译时间达标,性能达标

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

8192 * 0.1075 = 880.64 < 901

2024-04-24 21:58:00,225 - mindformers[mindformers/core/callback/callback.py:327] - INFO -    0.4% |                                                  | 0.10743 samples/s/p  3 days, 11:42:54 }
2024-04-24 21:58:18,837 - mindformers[mindformers/core/callback/callback.py:256] - WARNING - micro_batch_interleave_num: %s > 1, multiple copies in parallel is open.
2024-04-24 21:58:18,838 - mindformers[mindformers/core/callback/callback.py:319] - INFO - { Epoch:[  1/  5], step:[  126/ 6500], loss: 0.927, per_step_time: 9302ms, lr: 1e-05, overflow cond: False, loss_scale: 64.0
2024-04-24 21:58:18,838 - mindformers[mindformers/core/callback/callback.py:327] - INFO -    0.4% |                                                  | 0.10750 samples/s/p  3 days, 11:39:07 }
2024-04-24 21:58:37,455 - mindformers[mindformers/core/callback/callback.py:256] - WARNING - micro_batch_interleave_num: %s > 1, multiple copies in parallel is open.
2024-04-24 21:58:37,455 - mindformers[mindformers/core/callback/callback.py:319] - INFO - { Epoch:[  1/  5], step:[  128/ 6500], loss: 0.760, per_step_time: 9304ms, lr: 1e-05, overflow cond: False, loss_scale: 64.0
2024-04-24 21:58:37,456 - mindformers[mindformers/core/callback/callback.py:327] - INFO -    0.4% |                                                  | 0.10748 samples/s/p  3 days, 11:39:55 }
2024-04-24 21:58:56,080 - mindformers[mindformers/core/callback/callback.py:256] - WARNING - micro_batch_interleave_num: %s > 1, multiple copies in parallel is open.
2024-04-24 21:58:56,081 - mindformers[mindformers/core/callback/callback.py:319] - INFO - { Epoch:[  1/  5], step:[  130/ 6500], loss: 1.167, per_step_time: 9307ms, lr: 1e-05, overflow cond: False, loss_scale: 64.0
2024-04-24 21:58:56,081 - mindformers[mindformers/core/callback/callback.py:327] - INFO -    0.4% |                                                  | 0.10744 samples/s/p  3 days, 11:41:23 }

Special notes for this issue/备注 (Optional / 选填)

走给杨贵龙

Comments (6)

sunjiawei999 createdBug-Report
sunjiawei999 Copied from issue I9K1E9
sunjiawei999 added
 
kind/bug
label
sunjiawei999 added
 
attr/function
label
sunjiawei999 added
 
stage/func-debug
label
sunjiawei999 added
 
sig/mindformers
label
sunjiawei999 added
 
device/ascend
label
sunjiawei999 added
 
v2.3.0.rc2
label
sunjiawei999 assigned collaborator xiangminshan
sunjiawei999 assigned collaborator wangxingyan
sunjiawei999 assigned collaborator sunjiawei999
sunjiawei999 assigned collaborator liyang
Expand operation logs

Please assign maintainer to check this issue.
请为此issue分配处理人。
@sunjiawei999

感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:

  1. 如果您刚刚接触MindSpore,或许您可以在教程找到答案
  2. 如果您是资深Pytorch用户,您或许需要:
  1. 如果您遇到动态图问题,可以设置set_context(pynative_synchronize=True)查看报错栈协助定位
  2. 模型精度调优问题可参考官网调优指南
  3. 如果您反馈的是框架BUG,请确认您在ISSUE中提供了MindSpore版本、使用的后端类型(CPU、GPU、Ascend)、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
  4. 如果您已经定位出问题根因,欢迎提交PR参与MindSpore开源社区,我们会尽快review
sunjiawei999 changed title
sunjiawei999 changed description
hsshuai changed assignee from Yang Guilong to 刘群
hsshuai assigned collaborator Yang Guilong
hsshuai changed assignee from 刘群 to 吴昊天
hsshuai assigned collaborator 刘群
sunjiawei999 changed title
sunjiawei999 changed description

已发送邮件调整基线,根据邮件中基线重新测试

hsshuai changed milestone from B-SIG-MindFormers to B-SolutionTest
hsshuai changed assignee from 吴昊天 to sunjiawei999
hsshuai unassigned collaborator sunjiawei999
hsshuai assigned collaborator 吴昊天
hsshuai added
 
rca/others
label
hsshuai added
 
rct/newfeature
label
hsshuai added
 
ctl/solutiontest
label
hsshuai changed issue state from TODO to VALIDATION

调整基线需要CCB

zhongjicheng changed issue state from VALIDATION to TODO
zhongjicheng changed assignee from sunjiawei999 to 吴昊天
zhongjicheng unassigned collaborator 吴昊天
zhongjicheng assigned collaborator sunjiawei999

新模型第一次验收,无需ccb,邮件记录基线变更,按照最新版本的性能数据作为基准

Lin changed milestone from B-SIG-MindFormers to B-SolutionTest
Lin changed assignee from 吴昊天 to zhongjicheng
Lin assigned collaborator 吴昊天
Lin changed issue state from TODO to VALIDATION

按最新转测邮件性能880tokens/s基线看护,问题单关闭

zhongjicheng changed issue state from VALIDATION to DONE
fangwenyi removed
 
v2.3.0.rc2
label
fangwenyi added
 
master
label

Sign in to comment

Status
Assignees
Projects
Milestones
Pull Requests
Successfully merging a pull request will close this issue.
Branches
Planed to start   -   Planed to end
-
Top level
Priority
Duration (hours)
参与者(11)
11016979 xiangmd 1654824581
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

Search