name | about | labels |
---|---|---|
Bug Report | Use this template for reporting a bug | kind/bug |
模型地址:https://gitee.com/mindspore/mindformers
[glm3-6b-32k][910B3 1p]微调失败:Memory pool not enough, graph: kernel_graph43, max_static_memory_size: 45738800640, feature_memory_size: 25568905728, max_hbm_memory_size: 63350767616
Ascend
/GPU
/CPU
) / 硬件环境:Please delete the backend not involved / 请删除不涉及的后端:
/device 910B3
CANN:Milan_C17/20240321
MS:r2.3.q1_20240327061515_474c16da70
MF:r1.1.tr5_20240327061514_b14891090daf
PyNative
/Graph
):Please delete the mode not involved / 请删除不涉及的模式:
/mode graph
测试仓库地址:/home/jenkins/workspace/TDT_deployment/MindFormers_Test/cases/glm3/32k/train
用例:
test_mf_chatglm3_6b_32k_train_longbench_8p_0001
网络微调成功
Traceback (most recent call last):
File "glm32k/run_glm32k.py", line 159, in <module>
vocab_file=args.vocab_file)
File "glm32k/run_glm32k.py", line 96, in main
resume_training=resume)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/_checkparam.py", line 1371, in wrapper
return func(*args, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindformers/trainer/trainer.py", line 514, in finetune
is_full_config=True)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindformers/trainer/causal_language_modeling/causal_language_modeling.py", line 120, in train
**kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindformers/trainer/base_trainer.py", line 788, in training_process
initial_epoch=config.runner_config.initial_epoch)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 1085, in train
initial_epoch=initial_epoch)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 115, in wrapper
func(self, *args, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 635, in _train
cb_params, sink_size, initial_epoch, valid_infos)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 719, in _train_dataset_sink_process
outputs = train_network(*inputs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 662, in __call__
out = self.compile_and_run(*args, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/nn/cell.py", line 984, in compile_and_run
return _cell_graph_executor(self, *new_args, phase=self.phase)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 1631, in __call__
return self.run(obj, *args, phase=phase)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 1670, in run
return self._exec_pip(obj, *args, phase=phase_real)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 130, in wrapper
results = fn(*arg, **kwargs)
File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 1650, in _exec_pip
return self._graph_executor(args, phase)
RuntimeError: Memory pool not enough, graph: kernel_graph43, max_static_memory_size: 45738800640, feature_memory_size: 25568905728, max_hbm_memory_size: 63350767616
走给张又文
Please assign maintainer to check this issue.
请为此issue分配处理人。
@sunjiawei999
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。
感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:
ccb结论:
1、提交一个可以拉起任务的并行配置,测试走回归,并重新定基线;
2、长序列优化转需求;
ccb结论
改成ms_run启动之后,实测性能为778,低于基线809
登录 后才可以发表评论