2.4K Star 8.2K Fork 4.4K

GVPMindSpore / mindspore

 / 详情

[ST][MS][MF][r2.3][qwen_7b_8K长序列][推理][910B3 8P]网络推理失败,日志无打印,进程不退出。

DONE
Bug-Report
创建于  
2024-04-22 20:39
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

[ST][MS][MF][r2.3][qwen_7b_8K长序列][推理][910B3 8P]网络推理失败,日志无打印,进程不退出
模型仓地址:https://gitee.com/mindspore/mindformers/blob/dev/research/qwen/qwen.md

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend/

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):

CANN版本:MILAN-Florence-ASL/ABL V100R001C17SPC001B240 Alpha
Mindspore版本:MindSpore_r2.3_d51c17c7(MindSporeDaily)
MindFormers版本:MindFormers_dev_a4fc9e6d(MindFormersDaily)

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

用例仓地址:MindFormers_Test/cases/qwen/14b/train/
用例:
不涉及

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

  1. get code from mindformers
  2. cd mindformers/reserach
  3. 修改配置文件中的权重、数据集路径 设置runner_config下的bs为1
  4. export PYTHONPATH=/home/jenkins0/sjw/mindformers/ && python qwen_train_after_infer.py --predict_data /home/jenkins0/0419/mindformers/qwen_7b_questions.txt --config_file /home/jenkins0/0419/mindformers/research/qwen/run_qwen_7b.yaml --ckpt_path /home/jenkins0/0419/mindformers/research/qwen_7b_output/target_checkpoint/rank_0/qwen_7b0.ckpt --lora_generate_value False > sh_eval.log 2>&1
  5. 验证网络是否推理成功
  6. 验证网络编译时间是否达标

Describe the expected behavior / 预期结果 (Mandatory / 必填)

网络训推理成功,编译时间达标,性能达标

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

日志打印这些之后就不再打印数据,并且进程也没有退出


[INFO] PIPELINE(126543,ffff81c15020,python):2024-04-18-16:06:24.776.952 [mindspore/ccsrc/pipeline/jit/ps/pipeline.cc:2449] ClearResPart2] End clear AnalysisResultCacheMgr.
[INFO] PIPELINE(126543,ffff81c15020,python):2024-04-18-16:06:24.776.964 [mindspore/ccsrc/pipeline/jit/ps/pipeline.cc:2451] ClearResPart2] Start clear AnalysisContext...
[INFO] PIPELINE(126543,ffff81c15020,python):2024-04-18-16:06:24.776.978 [mindspore/ccsrc/pipeline/jit/ps/pipeline.cc:2453] ClearResPart2] End clear AnalysisContext...
[INFO] PIPELINE(126543,ffff81c15020,python):2024-04-18-16:06:24.776.989 [mindspore/ccsrc/pipeline/jit/ps/pipeline.cc:2455] ClearResPart2] Start clear AnalysisSchedule...
[INFO] PIPELINE(126543,ffff81c15020,python):2024-04-18-16:06:24.784.284 [mindspore/ccsrc/pipeline/jit/ps/pipeline.cc:2457] ClearResPart2] End clear AnalysisSchedule...
[INFO] DEBUG(126543,ffff81c15020,python):2024-04-18-16:06:24.784.333 [mindspore/ccsrc/debug/debugger/debugger.cc:101] Debugger] Debugger got device_target: Ascend
[INFO] DEBUG(126543,ffff81c15020,python):2024-04-18-16:06:24.784.353 [mindspore/ccsrc/debug/debugger/debugger.cc:305] Reset] Release Debugger resource.
[INFO] PIPELINE(126543,ffff81c15020,python):2024-04-18-16:06:24.784.381 [mindspore/ccsrc/pipeline/jit/ps/pipeline.cc:2470] ClearResPart3] Start clear ClearObjectCache...
[INFO] PIPELINE(126543,ffff81c15020,python):2024-04-18-16:06:24.784.393 [mindspore/ccsrc/pipeline/jit/ps/pipeline.cc:2472] ClearResPart3] End clear ClearObjectCache...
[INFO] PIPELINE(126543,ffff81c15020,python):2024-04-18-16:06:24.784.405 [mindspore/ccsrc/pipeline/jit/ps/pipeline.cc:2474] ClearResPart3] Start clear Parser...
[INFO] PIPELINE(126543,ffff81c15020,python):2024-04-18-16:06:24.784.421 [mindspore/ccsrc/pipeline/jit/ps/pipeline.cc:2476] ClearResPart3] End clear Parser...
[INFO] PIPELINE(126543,ffff81c15020,python):2024-04-18-16:06:24.784.433 [mindspore/ccsrc/pipeline/jit/ps/pipeline.cc:2478] ClearResPart3] Start ClearTraceStack...
[INFO] PIPELINE(126543,ffff81c15020,python):2024-04-18-16:06:24.784.450 [mindspore/ccsrc/pipeline/jit/ps/pipeline.cc:2480] ClearResPart3] End ClearTraceStack...
[INFO] PIPELINE(126543,ffff81c15020,python):2024-04-18-16:06:24.784.462 [mindspore/ccsrc/pipeline/jit/ps/pipeline.cc:2482] ClearResPart3] Start clear InterpretNodeRecorder...
[INFO] PIPELINE(126543,ffff81c15020,python):2024-04-18-16:06:24.784.475 [mindspore/ccsrc/pipeline/jit/ps/pipeline.cc:2484] ClearResPart3] End clear InterpretNodeRecorder...
[INFO] PIPELINE(126543,ffff81c15020,python):2024-04-18-16:06:24.784.486 [mindspore/ccsrc/pipeline/jit/ps/pipeline.cc:2486] ClearResPart3] Start clear parallel::entire_costgraph...
[INFO] PIPELINE(126543,ffff81c15020,python):2024-04-18-16:06:24.784.499 [mindspore/ccsrc/pipeline/jit/ps/pipeline.cc:2488] ClearResPart3] End clear parallel::entire_costgraph...
[INFO] PIPELINE(126543,ffff81c15020,python):2024-04-18-16:06:24.784.510 [mindspore/ccsrc/pipeline/jit/ps/pipeline.cc:2490] ClearResPart3] Start clear ProtobufLibrary...
[INFO] PIPELINE(126543,ffff81c15020,python):2024-04-18-16:06:24.784.704 [mindspore/ccsrc/pipeline/jit/ps/pipeline.cc:2492] ClearResPart3] End clear ProtobufLibrary...
[INFO] PIPELINE(126543,ffff81c15020,python):2024-04-18-16:06:24.784.720 [mindspore/ccsrc/pipeline/jit/ps/pipeline.cc:2494] ClearResPart3] Start clear python_adapter...
[INFO] PIPELINE(126543,ffff81c15020,python):2024-04-18-16:06:24.784.733 [mindspore/ccsrc/pipeline/jit/ps/pipeline.cc:2496] ClearResPart3] End clear python_adapter.
[INFO] PIPELINE(126543,ffff81c15020,python):2024-04-18-16:06:24.784.745 [mindspore/ccsrc/pipeline/jit/ps/pipeline.cc:2500] ClearSingleton] Start clear singleton...
[INFO] PIPELINE(126543,ffff81c15020,python):2024-04-18-16:06:24.784.797 [mindspore/ccsrc/pipeline/jit/ps/pipeline.cc:2517] ClearSingleton] End clear singleton.
[INFO] PIPELINE(126543,ffff81c15020,python):2024-04-18-16:06:24.784.811 [mindspore/ccsrc/pipeline/jit/ps/pipeline.cc:2533] ClearResAtexit] Start unload dynamic lib...
[INFO] PIPELINE(126543,ffff81c15020,python):2024-04-18-16:06:24.784.857 [mindspore/ccsrc/pipeline/jit/ps/pipeline.cc:2535] ClearResAtexit] End unload dynamic lib...
[INFO] GE_ADPT(126543,ffff81c15020,python):2024-04-18-16:06:25.493.708 [mindspore/ccsrc/transform/graph_ir/df_graph_manager.cc:260] DeleteGraphRunner] GraphRunner is not exist
[INFO] GE_ADPT(126543,ffff81c15020,python):2024-04-18-16:06:25.493.790 [mindspore/ccsrc/transform/graph_ir/df_graph_manager.cc:224] DeleteGeSession] Ge Session is not exist
[INFO] GE_ADPT(126543,ffff81c15020,python):2024-04-18-16:06:25.493.810 [mindspore/ccsrc/transform/graph_ir/df_graph_manager.cc:178] ClearGraph] Remove all graphs in GraphManager

Special notes for this issue/备注 (Optional / 选填)

走给杨贵龙

评论 (4)

sunjiawei999 创建了Bug-Report
sunjiawei999 复制于任务 I9IR9V
sunjiawei999 添加了
 
kind/bug
标签
sunjiawei999 添加了
 
attr/function
标签
sunjiawei999 添加了
 
stage/func-debug
标签
sunjiawei999 添加了
 
sig/mindformers
标签
sunjiawei999 添加了
 
device/ascend
标签
sunjiawei999 添加了
 
v2.3.0.rc2
标签
sunjiawei999 添加了
 
v2.3.0
标签
sunjiawei999 添加协作者liyang
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@sunjiawei999

感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:

  1. 如果您刚刚接触MindSpore,或许您可以在教程找到答案
  2. 如果您是资深Pytorch用户,您或许需要:
  1. 如果您遇到动态图问题,可以设置set_context(pynative_synchronize=True)查看报错栈协助定位
  2. 模型精度调优问题可参考官网调优指南
  3. 如果您反馈的是框架BUG,请确认您在ISSUE中提供了MindSpore版本、使用的后端类型(CPU、GPU、Ascend)、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
  4. 如果您已经定位出问题根因,欢迎提交PR参与MindSpore开源社区,我们会尽快review
sunjiawei999 修改了描述
sunjiawei999 复制了任务 I9J6CZ

Appearance & Root Cause

dev 分支的推理部分切换到了训推一体,而一方面训推一体尚未稳定,另一方Qwen未进行适配

Fix Solution

Relation PR:

https://gitee.com/mindspore/mindformers/pulls/2784
https://gitee.com/mindspore/mindformers/pulls/2809

Selftest Result:

开启训推一体后,在910B3上测试Qwen-14B, seq_length=2048, batch_size=4

 python run_qwen.py --config run_qwen_14b.yaml --load_checkpoint /opt/qwen/ckpt-14b/qwen_14b_base-fp16.ckpt --seq_length 8192 --batch_size 4 --predict_data 帮助我制定一份去上海的旅游攻略 用python写一段快排代码 你是谁 'I love Shenzhen, because'

测试结果(贴到这里时手工增加了几个换行)

2024-04-24 15:47:03,868 - mindformers[mindformers/modules/block_tables.py:62] - INFO - init cache engine success.
2024-04-24 15:47:20,879 - mindformers[mindformers/generation/text_generator.py:867] - INFO - total time: 17.010602474212646 s; generated tokens: 1523 tokens; generate speed: 89.5323961810761 tokens/s
2024-04-24 15:47:20,884 - mindformers[mindformers/modules/block_tables.py:125] - INFO - Clear block table cache engines.
2024-04-24 15:47:20,885 - mindformers[mindformers/trainer/base_trainer.py:951] - INFO - output result is: [{'text_generation_text': [
'帮助我制定一份去上海的旅游攻略,包括景点、美食和住宿建议。\n好的,以下是一份去上海的旅游攻略:\n\n景点:\n1. 上海博物馆:了解上海的历史和文化。\n2. 外滩:欣赏浦江两岸的美景,晚上还有灯光秀。\n3. 上海城隍庙:体验传统的中国文化和宗教信仰。\n4. 上海科技馆:适合家庭游玩,有各种互动展览。\n5. 上海迪士尼乐园:适合亲子游玩,有各种主题游乐设施。\n\n美食:\n1. 小笼包:上海特色美食,可以去南翔馒头店或老正兴尝试。\n2. 红烧肉:上海传统菜肴,可以去南翔馒头店或老正兴尝试。\n3. 生煎包:上海特色美食,可以去南翔馒头店或老正兴尝试。\n4. 糖醋排骨:上海传统菜肴,可以去南翔馒头店或老正兴尝试。\n5. 上海菜:可以去南翔馒头店或老正兴尝试。\n\n住宿建议:\n1. 上海外滩茂悦大酒店:位于外滩,可以欣赏浦江两岸的美景。\n2. 上海浦东嘉里大酒店:位于浦东,交通便利,设施齐全。\n3. 上海瑞吉酒店:位于南京路步行街,购物方便,设施豪华。\n4. 上海四季酒店:位于陆家嘴,可以欣赏东方明珠塔的美景。\n5. 上海和平饭店:位于南京路步行街,历史悠久,设施完善。\n\n希望这份攻略能对您有所帮助,祝您旅途愉快!', 
'用python写一段快排代码\n\n好的,以下是Python实现的快速排序代码:\n\n```python\ndef quick_sort(arr):\n    if len(arr) <= 1:\n        return arr\n    else:\n        pivot = arr[0]\n        left = []\n        right = []\n        for i in range(1, len(arr)):\n            if arr[i] < pivot:\n                left.append(arr[i])\n            else:\n                right.append(arr[i])\n        return quick_sort(left) + [pivot] + quick_sort(right)\n```\n\n这个函数接受一个列表作为参数,如果列表长度小于等于1,则直接返回该列表。否则,选择列表的第一个元素作为基准值(pivot),将列表中所有小于基准值的元素放入一个新列表left中,将所有大于等于基准值的元素放入另一个新列表right中。然后递归地对left和right进行快速排序,并将结果合并起来返回。', 
'你是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“我是谁?我是谁?我是谁?”\n\n“', 
'I love Shenzhen, because it is a modern city. It is in the south of China. It is a very big city. It is a new city. It is a beautiful city. It is a busy city. It is a clean city. It is a city with many tall buildings. It is a city with many big parks. It is a city with many big shopping malls. It is a city with many big supermarkets. It is a city with many big restaurants. It is a city with many big hotels. It is a city with many big cinemas. It is a city with many big theatres. It is a city with many big museums. It is a city with many big libraries. It is a city with many big hospitals. It is a city with many big schools. It is a city with many big colleges. It is a city with many big universities. It is a city with many big parks. It is a city with many big gardens. It is a city with many big lakes. It is a city with many big rivers. It is a city with many big mountains. It is a city with many big forests. It is a city with many big zoos. It is a city with many big botanical gardens. It is a city with many big amusement parks. It is a city with many big stadiums. It is a city with many big swimming pools. It is a city with many big gyms. It is a city with many big libraries. It is a city with many big museums. It is a city with many big theatres. It is a city with many big cinemas. It is a city with many big shopping malls. It is a city with many big supermarkets. It is a city with many big restaurants. It is a city with many big hotels. It is a city with many big parks. It is a city with many big gardens. It is a city with many big lakes. It is a city with many big rivers. It is a city with many big mountains. It is a city with many big forests. It is a city with many big zoos. It is a city with many big botanical gardens. It is a city with many big amusement parks. It is a city with many big stadiums. It is a city with many big swimming pools. It is a city with many big gyms. It is a city with many big libraries. It is a city with many big museums. It is a city with many big theatres. It is a city']}]

Self-test Report & DT Review

是否需要补充ST/UT:否
原因:

Self-test Report & DT Review

是否需要补充ST/UT:否
用例文件:
用例名:

i-robot 添加了
 
gitee
标签
i-robot 添加了
 
gitee
标签
i-robot 添加了
 
gitee
标签
i-robot 添加了
 
gitee
标签
Yang Guilong 负责人Yang Guilong 修改为sunjiawei999
Yang Guilong 任务状态TODO 修改为VALIDATION
Yang Guilong 添加了
 
rca/codespec
标签
Yang Guilong 添加了
 
rct/newfeature
标签
Yang Guilong 添加了
 
ctl/solutiontest
标签
sunjiawei999 复制了任务 I9K1E9
sunjiawei999 任务状态VALIDATION 修改为DONE
fangwenyi 移除了
 
v2.3.0.rc2
标签
fangwenyi 添加了
 
master
标签

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(5)
11016979 xiangmd 1654824581
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助