2.4K Star 8.2K Fork 4.4K

GVPMindSpore / mindspore

 / 详情

[CT][MS][layout扩展后的重排布] 设置layout后并行切分策略文件保存不支持,导致pipeline无法正常推理

TODO
Bug-Report
创建于  
2024-03-19 16:28

Describe the current behavior / 问题描述 (Mandatory / 必填)

设置layout后并行切分策略文件保存不支持,导致pipeline无法正常推理

 fact.mindspore_semi_parallel_impl(parallel_net, dataset=parallel_dataset, epoch=1,
>                                         device_num=8)

../test_parallel_shard_layout.py:2695:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
../../pipeline_split/test_pipeline.py:297: in mindspore_semi_parallel_impl
    dataset_strategy=dataset_strategy, **kwargs)
../../pipeline_split/test_pipeline.py:278: in __mindspore_impl
    eval_network=eval_network)
../../pipeline_split/test_pipeline.py:238: in _model_train_and_save_ckpt
    dataset_sink_mode=self.dataset_sink_mode)
/root/archiconda3/envs/zhanglin3.7/lib/python3.7/site-packages/mindspore/train/model.py:1074: in train
    initial_epoch=initial_epoch)
/root/archiconda3/envs/zhanglin3.7/lib/python3.7/site-packages/mindspore/train/model.py:114: in wrapper
    func(self, *args, **kwargs)
/root/archiconda3/envs/zhanglin3.7/lib/python3.7/site-packages/mindspore/train/model.py:617: in _train
    self._train_process(epoch, train_dataset, list_callback, cb_params, initial_epoch, valid_infos)
/root/archiconda3/envs/zhanglin3.7/lib/python3.7/site-packages/mindspore/train/model.py:919: in _train_process
    outputs = self._train_network(*next_element)
/root/archiconda3/envs/zhanglin3.7/lib/python3.7/site-packages/mindspore/nn/cell.py:662: in __call__
    out = self.compile_and_run(*args, **kwargs)
/root/archiconda3/envs/zhanglin3.7/lib/python3.7/site-packages/mindspore/nn/cell.py:980: in compile_and_run
    self.compile(*args, **kwargs)
/root/archiconda3/envs/zhanglin3.7/lib/python3.7/site-packages/mindspore/nn/cell.py:964: in compile
    jit_config_dict=self._jit_config_dict, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <mindspore.common.api._CellGraphExecutor object at 0xfffea3397610>
obj = TrainOneStepCell<
  (network): WithLossCell<
    (_backbone): TrainNet<
      (block): CellList<
        (0): MatMulNe...    (_loss_fn): SoftmaxCrossEntropyWithLogits<>
    >
  (optimizer): CustomOptimizer<>
  (grad_reducer): Identity<>
  >
phase = 'train.1710836617257453824.281467729116912.0', do_convert = True
jit_config_dict = {'exc_mode': 'auto', 'jit_level': 'O1', 'jit_syntax_level': ''}
args = (Tensor(shape=[128, 96], dtype=Float32, value=
[[ 1.76405239e+00,  4.00157213e-01,  9.78738010e-01 ...  9.76639032e-01...000e+00],
 [ 0.00000000e+00,  0.00000000e+00,  0.00000000e+00 ...  0.00000000e+00,  0.00000000e+00,  0.00000000e+00]]))
kwargs = {}, key_id = '2814677291169121710836617257453824', key = 0

    def compile(self, obj, *args, phase='predict', do_convert=True, jit_config_dict=None, **kwargs):
        """
        Compiles graph.

        Args:
            obj (Function/Cell): The function or cell instance need compile.
            phase (str): The name of compile phase. Default: 'predict'.
            do_convert (bool): When set to True, convert ME graph to GE graph after compiling graph.
            jit_config_dict (dict): Jit config for compile. Default: ``None``.
            args (tuple): Args of the Cell object.
            kwargs (dict): Kwargs of the Cell object.

        Return:
            Str, the full phase of the cell.
            Bool, if the graph has been compiled before, return False, else return True.
        """
        obj.__parse_method__ = 'construct'
        if not hasattr(obj, obj.__parse_method__):
            raise AttributeError(
                'The class {} dose not have method {}'.format(obj.__class__.__name__, obj.__parse_method__))
        key_id = str(id(obj)) + str(obj.create_time)
        args = get_auto_dynamic_shape_args(args, key_id)

        self.enable_tuple_broaden = False
        if hasattr(obj, "enable_tuple_broaden"):
            self.enable_tuple_broaden = obj.enable_tuple_broaden
        logger.debug(f"Convert the network: {do_convert}.")
        self._graph_executor.set_enable_tuple_broaden(self.enable_tuple_broaden)
        key = self._graph_executor.generate_arguments_key(obj, args, kwargs, self.enable_tuple_broaden)
        obj.arguments_key = str(key)
        phase = phase + '.' + str(obj.create_time) + '.' + str(id(obj)) + '.' + obj.arguments_key
        update_auto_dynamic_shape_phase(args, key_id, phase)

        if phase in obj.compile_cache and self.has_compiled(phase):
            logger.debug("%r graph has existed.", phase)
            # Release resource should be released when CompileInner won't be executed, such as cur_convert_input_
            # generated in generate_arguments_key.
            self._graph_executor.clear_compile_arguments_resource()
            return phase, False

        obj.check_names()
        _check_full_batch()
        self._set_dataset_mode(obj)
        self._set_compile_cache_dep_files(phase)

        self._graph_executor.set_weights_values(obj.parameters_dict())
        if jit_config_dict:
            self._graph_executor.set_jit_config(jit_config_dict)
        else:
            jit_config_dict = JitConfig().jit_config_dict
            self._graph_executor.set_jit_config(jit_config_dict)
>       result = self._graph_executor.compile(obj, args, kwargs, phase, self._use_vm_mode())
E       RuntimeError: The pointer[node_stra.second] is null.
E
E       ----------------------------------------------------
E       - Framework Unexpected Exception Raised:
E       ----------------------------------------------------
E       This exception is caused by framework's unexpected error. Please create an issue at https://gitee.com/mindspore/mindspore/issues to get help.
E
E       ----------------------------------------------------
E       - C++ Call Stack: (For framework developers)
E       ----------------------------------------------------
E       mindspore/ccsrc/frontend/parallel/strategy_checkpoint/strategy_checkpoint_info.cc:135 to_protobuf
E
E       ----------------------------------------------------
E       - The Traceback of Net Construct Code:
E       ----------------------------------------------------
E
E       # In file /root/archiconda3/envs/zhanglin3.7/lib/python3.7/site-packages/mindspore/nn/wrap/cell_wrapper.py:416
E           def construct(self, *inputs):
E           ^
E
E       # In file /root/archiconda3/envs/zhanglin3.7/lib/python3.7/site-packages/mindspore/nn/wrap/cell_wrapper.py:418
E                   return self._no_sens_impl(*inputs)
E                          ^
E
E       # In file /root/archiconda3/envs/zhanglin3.7/lib/python3.7/site-packages/mindspore/nn/wrap/cell_wrapper.py:431
E           def _no_sens_impl(self, *inputs):
E           ^
E
E       # In file /root/archiconda3/envs/zhanglin3.7/lib/python3.7/site-packages/mindspore/nn/wrap/cell_wrapper.py:433
E               loss = self.network(*inputs)
E                      ^
E
E       # In file /root/archiconda3/envs/zhanglin3.7/lib/python3.7/site-packages/mindspore/nn/wrap/cell_wrapper.py:120
E           def construct(self, data, label):
E
E       # In file /root/archiconda3/envs/zhanglin3.7/lib/python3.7/site-packages/mindspore/nn/wrap/cell_wrapper.py:121
E               out = self._backbone(data)
E                     ^
E
E       # In file /root/archiconda3/envs/zhanglin3.7/lib/python3.7/site-packages/mindspore/nn/wrap/cell_wrapper.py:121
E               out = self._backbone(data)
E                     ^
E
E       # In file /home/zhanglin/MindSporeTest/parallel/pipeline_split/test_pipeline.py:87
E               for i in range(self.micro_size):
E
E       # In file /home/zhanglin/MindSporeTest/parallel/pipeline_split/test_pipeline.py:88
E                   x = self.block[i](x)
E                       ^
E
E       # In file /home/zhanglin/MindSporeTest/parallel/pipeline_split/test_pipeline.py:64
E               x = self.matmul1(inputs, self.matmul1_weight)
E                   ^

/root/archiconda3/envs/zhanglin3.7/lib/python3.7/site-packages/mindspore/common/api.py:1584: RuntimeError

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:
    device ascend

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):
    mode graph

Related testcase / 关联用例 (Mandatory / 必填)

test_parallel_shard_layout_and_pipeline

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

执行用例

Describe the expected behavior / 预期结果 (Mandatory / 必填)

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

Special notes for this issue/备注 (Optional / 选填)

评论 (4)

zhang_lin66 创建了Bug-Report
zhang_lin66 添加了
 
kind/bug
标签
zhang_lin66 添加了
 
attr/function
标签
zhang_lin66 添加了
 
sig/parallel
标签
zhang_lin66 添加了
 
v2.3.0
标签
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@zhang_lin66

感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:

  1. 如果您刚刚接触MindSpore,或许您可以在教程找到答案
  2. 如果您是资深Pytorch用户,您或许需要:
  1. 如果您遇到动态图问题,可以设置set_context(pynative_synchronize=True)查看报错栈协助定位
  2. 模型精度调优问题可参考官网调优指南
  3. 如果您反馈的是框架BUG,请确认您在ISSUE中提供了MindSpore版本、使用的后端类型(CPU、GPU、Ascend)、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
  4. 如果您已经定位出问题根因,欢迎提交PR参与MindSpore开源社区,我们会尽快review
zhang_lin66 添加了
 
device/ascend
标签

未实现需求,转需求

tanghuikang 添加了
 
ccb/rfc
标签

ccb结论:转需求

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(4)
6574048 hulktang 1584443870
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助