[ST][MS][大集群专项][动态组网]1.3w worker集群，scheduler初始化完成后跟部分worker超时导致异常退出(已知问题)，scheduler退出后，大量worker长时间未退出

name	about	labels
Bug Report	Use this template for reporting a bug	kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

1.3k worker集群，scheduler初始化完成后跟部分worker超时导致异常退出(已知问题)，scheduler退出后，大量worker长时间未退出

Environment / 环境信息 (Mandatory / 必填)

Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend/GPU/CPU/kirin/等其他芯片

Software Environment / 软件环境 (Mandatory / 必填):
-- MindSpore version (e.g., 1.7.0.Bxxx) :
-- Python version (e.g., Python 3.7.5) :
-- OS platform and distribution (e.g., Linux Ubuntu 16.04):
-- GCC/Compiler version (if compiled from source):
Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

test_ms_msrun_simulation_cluster_10k_node

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

1. 确认所有环境无残留python进程
2. 将训练脚本传到所有环境上
3. 在指定的环境中循环拉起若干worker
4. 在指定的环境中拉起sckduler
5. check所有环境worker进程状态,check schduler进程状态以及日志，获取完全注册时间以及平均一次心跳的处理时间
6. 长时间监控scheduler的进程状态

Describe the expected behavior / 预期结果 (Mandatory / 必填)

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

输入图片说明

Special notes for this issue/备注 (Optional / 选填)

走给周培晨

Please assign maintainer to check this issue.
请为此issue分配处理人。
@baimz

感谢您的提问，您可以评论//mindspore-assistant更快获取帮助：

如果您刚刚接触MindSpore，或许您可以在教程找到答案
如果您是资深Pytorch用户，您或许需要：

如果您遇到动态图问题，可以设置set_context(pynative_synchronize=True)查看报错栈协助定位
模型精度调优问题可参考官网调优指南
如果您反馈的是框架BUG，请确认您在ISSUE中提供了MindSpore版本、使用的后端类型（CPU、GPU、Ascend）、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
如果您已经定位出问题根因，欢迎提交PR参与MindSpore开源社区，我们会尽快review

test_excute_parallel_node_rank_one_2x8_with_msrun 异常场景，部分worker未退出

输入图片说明

GVP MindSpore / mindspore

内容风险标识

Describe the current behavior / 问题描述 (Mandatory / 必填)

Environment / 环境信息 (Mandatory / 必填)

Related testcase / 关联用例 (Mandatory / 必填)

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

Describe the expected behavior / 预期结果 (Mandatory / 必填)

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

Special notes for this issue/备注 (Optional / 选填)

评论 (3)

GVPMindSpore / mindspore

内容风险标识

[ST][MS][大集群专项][动态组网]1.3w worker集群，scheduler初始化完成后跟部分worker超时导致异常退出(已知问题)，scheduler退出后，大量worker长时间未退出

Describe the current behavior / 问题描述 (Mandatory / 必填)

Environment / 环境信息 (Mandatory / 必填)

Related testcase / 关联用例 (Mandatory / 必填)

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

Describe the expected behavior / 预期结果 (Mandatory / 必填)

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

Special notes for this issue/备注 (Optional / 选填)

评论 (3)

搜索帮助

GVP MindSpore / mindspore