2.4K Star 8.2K Fork 4.4K

GVPMindSpore / mindspore

 / 详情

[ST][MS][NET][glm3-6b-32k][910B3 8p]微调失败:Memory pool not enough;修改启动方式后,实测性能778,低于基线809

TODO
Bug-Report
创建于  
2024-03-28 10:08
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

模型地址:https://gitee.com/mindspore/mindformers
[glm3-6b-32k][910B3 1p]微调失败:Memory pool not enough, graph: kernel_graph43, max_static_memory_size: 45738800640, feature_memory_size: 25568905728, max_hbm_memory_size: 63350767616

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device 910B3

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g.,r1.6 commit_id=xxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):

CANN:Milan_C17/20240321
MS:r2.3.q1_20240327061515_474c16da70
MF:r1.1.tr5_20240327061514_b14891090daf

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

测试仓库地址:/home/jenkins/workspace/TDT_deployment/MindFormers_Test/cases/glm3/32k/train
用例:
test_mf_chatglm3_6b_32k_train_longbench_8p_0001

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

  1. get code from mindformers
  2. cd research/qwen
  3. bash run_singlenode.sh '''python glm32k/run_glm32k.py --config /home/jenkins/workspace/TDT_deployment/MindFormers_Test/cases/glm3/32k/train/test_mf_chatglm3_6b_32k_train_longbench_8p_0001/research/glm32k/run_glm32k.yaml --run_mode finetune --train_dataset /home/workspace/large_model_dataset//glm32K/longbench.jsonl --use_parallel True ''' /home/workspace/config/hccl_8p.json [0,8] 8
  4. 网络微调成功
  5. 推理结果符合预期

Describe the expected behavior / 预期结果 (Mandatory / 必填)

网络微调成功

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

Traceback (most recent call last):
  File "glm32k/run_glm32k.py", line 159, in <module>
    vocab_file=args.vocab_file)
  File "glm32k/run_glm32k.py", line 96, in main
    resume_training=resume)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/_checkparam.py", line 1371, in wrapper
    return func(*args, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindformers/trainer/trainer.py", line 514, in finetune
    is_full_config=True)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindformers/trainer/causal_language_modeling/causal_language_modeling.py", line 120, in train
    **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindformers/trainer/base_trainer.py", line 788, in training_process
    initial_epoch=config.runner_config.initial_epoch)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 1085, in train
    initial_epoch=initial_epoch)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 115, in wrapper
    func(self, *args, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 635, in _train
    cb_params, sink_size, initial_epoch, valid_infos)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 719, in _train_dataset_sink_process
    outputs = train_network(*inputs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 662, in __call__
    out = self.compile_and_run(*args, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 984, in compile_and_run
    return _cell_graph_executor(self, *new_args, phase=self.phase)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 1631, in __call__
    return self.run(obj, *args, phase=phase)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 1670, in run
    return self._exec_pip(obj, *args, phase=phase_real)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 130, in wrapper
    results = fn(*arg, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 1650, in _exec_pip
    return self._graph_executor(args, phase)
RuntimeError: Memory pool not enough, graph: kernel_graph43, max_static_memory_size: 45738800640, feature_memory_size: 25568905728, max_hbm_memory_size: 63350767616

Special notes for this issue/备注 (Optional / 选填)

走给张又文

评论 (5)

sunjiawei999 创建了Bug-Report
sunjiawei999 复制于任务 I9BXFK
sunjiawei999 添加了
 
attr/function
标签
sunjiawei999 添加了
 
stage/func-debug
标签
sunjiawei999 添加了
 
kind/bug
标签
sunjiawei999 添加了
 
sig/mindformers
标签
sunjiawei999 添加了
 
v2.2.13
标签
sunjiawei999 添加了
 
device/ascend
标签
sunjiawei999 添加了
 
foruda
标签
sunjiawei999 添加了
 
rca/codelogic
标签
sunjiawei999 添加了
 
rct/newfeature
标签
sunjiawei999 添加了
 
ctl/solutiontest
标签
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@sunjiawei999

感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:

  1. 如果您刚刚接触MindSpore,或许您可以在教程找到答案
  2. 如果您是资深Pytorch用户,您或许需要:
  1. 如果您遇到动态图问题,可以设置set_context(pynative_synchronize=True)查看报错栈协助定位
  2. 模型精度调优问题可参考官网调优指南
  3. 如果您反馈的是框架BUG,请确认您在ISSUE中提供了MindSpore版本、使用的后端类型(CPU、GPU、Ascend)、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
  4. 如果您已经定位出问题根因,欢迎提交PR参与MindSpore开源社区,我们会尽快review
sunjiawei999 负责人设置为xiangminshan
sunjiawei999 计划开始日期设置为2024-03-28
sunjiawei999 计划截止日期设置为2024-04-04
sunjiawei999 移除了
 
rca/codelogic
标签
sunjiawei999 移除了
 
rca/codelogic
标签
sunjiawei999 移除了
 
rct/newfeature
标签
sunjiawei999 移除了
 
rct/newfeature
标签
sunjiawei999 移除了
 
ctl/solutiontest
标签
sunjiawei999 移除了
 
ctl/solutiontest
标签
sunjiawei999 关联分支r2.2 修改为master
sunjiawei999 修改了标题
sunjiawei999 修改了描述
sunjiawei999 修改了描述
sunjiawei999 移除了
 
v2.2.13
标签
sunjiawei999 移除了
 
v2.2.13
标签
sunjiawei999 移除了
 
v2.2.13
标签
sunjiawei999 移除了
 
foruda
标签
sunjiawei999 移除了
 
v2.2.13
标签
sunjiawei999 移除了
 
foruda
标签
sunjiawei999 添加了
 
v2.3.0
标签
sunjiawei999 里程碑设置为B-SIG-MindFormers
xiangminshan 负责人xiangminshan 修改为zyw_hw
hsshuai 负责人zyw_hw 修改为xdnjust
hsshuai 添加协作者zyw_hw
sunjiawei999 复制了任务 I9CO0Q

ccb结论:
1、提交一个可以拉起任务的并行配置,测试走回归,并重新定基线;
2、长序列优化转需求;

ccb结论

  1. glm32k微调OOM问题,当前提供规避的并行策略上库,功能可用,性能下降,内存和性能问题先遗留
  2. 库上的评测功能基于lite实现,接入训推一体后,430版本删除评测部分,仅提供在线推理示例
sunjiawei999 复制了任务 I9O2HQ
sunjiawei999 修改了标题

改成ms_run启动之后,实测性能为778,低于基线809

sunjiawei999 修改了标题
fangwenyi 移除了
 
v2.3.0.rc2
标签
fangwenyi 添加了
 
master
标签
fangwenyi 移除了
 
v2.3.0.rc2
标签
sunjiawei999 修改了标题
fangwenyi 添加了关联分支master 选项
fangwenyi 添加了问题后端类型Ascend 选项

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(7)
11016979 xiangmd 1654824581
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助