2.4K Star 8.2K Fork 4.4K

GVPMindSpore / mindspore

 / 详情

[ST][MS][NET][ssd/bert][gpu 8p]TypeError: For 'set_context', the parameter device_id can not be set repeatedly, origin value [0] has been in effect

DONE
Bug-Report
创建于  
2023-02-14 10:00
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

ssd-resnet50-fpn网络仓地址:https://gitee.com/mindspore/models/tree/master/official/cv/SSD
ssd/bert网络在gpu 8p训练失败,必现问题

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device gpu

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):
    run包:HiAI/HISI_C84/20230117
    MindSpore 版本:r1.10.1 B010 r1.10_20230211191515_d55b6d4e

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

用例仓地址: solution_test/cases/02network/00cv/ssd_resnet50_fpn/train
用例:test_ms_ssd_resnet50_fpn_coco2017_train_check_loss_gpu_8p_0003.py

用例仓地址: solution_test/cases/02network/00cv/ssd_mobilenetv1_fpn/train
用例:test_ms_ssd_mobilenetv1_fpn_coco2017_train_check_loss_gpu_8p_0003.py

用例仓地址: solution_test/cases/02network/00cv/ssd_vgg16/train
用例:test_ms_ssd_vgg16_coco2017_train_check_loss_gpu_8p_0003.py

用例仓地址: solution_test/remaining/test_scriptes/mindspore/net/ssd/network
用例:test_ms_model_zoo_ssd_coco_check_loss_8p_gpu.py

用例仓地址: solution_test/cases/02network/02nlp/bert/train
用例:
test_ms_bert_large_cn_news_train_check_loss_gpu_8p_0001.py
test_ms_bert_base_cn_news_train_check_loss_gpu_8p_0001.py

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

  1. get code from models
  2. cd models/official/cv/SSD
  3. sh scripts/run_distribute_train_gpu.sh 8 5 0.04 coco config_path
  4. 验证网络训练是否成功

Describe the expected behavior / 预期结果 (Mandatory / 必填)

网络训练成功

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

[CRITICAL] CORE(24355,7f518b374740,python):2023-02-13-21:13:41.511.987 [mindspore/core/utils/ms_context.cc:196] CheckReadStatus] For 'set_context', the parameter device_id can not be set repeatedly, origin value [0] has been in effect.
[CRITICAL] ME(24355:139988204734272,MainProcess):2023-02-13-21:13:41.512.494 [mindspore/dataset/engine/datasets.py:2903] Uncaught exception:
Traceback (most recent call last):
  File "train.py", line 204, in <module>
    train_net()
  File "/home/jenkins/workspace/TDT_deployment/solution_test/cases/02network/00cv/ssd_vgg16/train/test_ms_ssd_vgg16_coco2017_train_check_loss_gpu_8p_0003/LOG/src/model_utils/moxing_adapter.py", line 104, in wrapped_func
    run_func(*args, **kwargs)
  File "train.py", line 201, in train_net
    model.train(5, dataset, callbacks=callback, dataset_sink_mode=dataset_sink_mode, sink_size=100)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 1051, in train
    initial_epoch=initial_epoch)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 98, in wrapper
    func(self, *args, **kwargs)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 625, in _train
    cb_params, sink_size, initial_epoch, valid_infos)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 684, in _train_dataset_sink_process
    dataset_helper=dataset_helper)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/model.py", line 442, in _exec_preprocess
    dataset_helper = DatasetHelper(dataset, dataset_sink_mode, sink_size, epoch_num)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/dataset_helper.py", line 350, in __init__
    self.iter = iterclass(dataset, sink_size, epoch_num)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/dataset_helper.py", line 568, in __init__
    super().__init__(dataset, sink_size, epoch_num)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/dataset_helper.py", line 466, in __init__
    create_data_info_queue=create_data_info_queue)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/train/_utils.py", line 74, in _exec_datagraph
    phase=phase)
  File "/home/miniconda3/envs/ci/lib/python3.7/site-packages/mindspore/common/api.py", line 1047, in init_dataset
    need_run=need_run):
TypeError: For 'set_context', the parameter device_id can not be set repeatedly, origin value [0] has been in effect.

----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/core/utils/ms_context.cc:196 CheckReadStatus

[WARNING] DEBUG(24355,7f518b374740,python):2023-02-13-21:13:44.515.326 [mindspore/ccsrc/common/debug/rdr/recorder_manager.cc:113] TriggerAll] There is no recorder to export.
[ERROR] ME(24355,7f518b374740,python):2023-02-13-21:13:44.515.352 [mindspore/ccsrc/runtime/hardware/device_context_manager.cc:231] WaitTaskFinishOnDevice] SyncStream failed

Special notes for this issue/备注 (Optional / 选填)

走给周峰

评论 (5)

zhongjicheng 创建了Bug-Report
zhongjicheng 添加了
 
sig/ascend
标签
zhongjicheng 添加了
 
attr/function
标签
zhongjicheng 添加了
 
stage/func-debug
标签
zhongjicheng 添加了
 
kind/bug
标签
zhongjicheng 添加了
 
v1.10
标签
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@zhongjicheng

Please add labels (comp or sig), also you can visit https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md to find more.
为了让代码尽快被审核,请您为Pull Request打上 组件(comp)或兴趣组(sig) 标签,打上标签的PR可直接推送给责任人进行审核。
更多的标签可以查看https://gitee.com/mindspore/community/blob/master/sigs/dx/docs/labels.md
以组件相关代码提交为例,如果你提交的是data组件代码,你可以这样评论:
//comp/data
当然你也可以邀请data SIG组来审核代码,可以这样写:
//sig/data
另外你还可以给这个PR标记类型,例如是bugfix或者是特性需求:
//kind/bug or //kind/feature
恭喜你,你已经学会了使用命令来打标签,接下来就在下面的评论里打上标签吧!

xiangjiawei007 修改了描述
xiangjiawei007 负责人xiangjiawei007 修改为zhoufeng
  1. zhoufeng 的 commit 提交后,仍有如下报错:
    输入图片说明
  2. 进一步解决的pr如下:
    !48914:fix: map multi prcess error
    运行3遍,均成功。

根因分析:多p会从mpi处获取rank,并重新设置device id,触发了重复设置device id的报错
修复:只检查用户设置的device id,框架内部的不做检查

zhoufeng 添加了
 
rct/newfeature
标签
zhoufeng 添加了
 
rca/others
标签
zhoufeng 添加了
 
ctl/solutiontest
标签
zhoufeng 任务状态TODO 修改为VALIDATION
zhoufeng 里程碑B-SIG-ASCEND 修改为B-SolutionTest
zhoufeng 添加协作者zhoufeng
zhoufeng 负责人zhoufeng 修改为zhongjicheng

回归版本:r1.10_20230215214139_9b65b907
回归步骤:参考issue复现步骤
基本功能:问题已解决
输入图片说明
输入图片说明
测试结论:回归通过
回归人员:zhongjicheng
回归时间: 2023-02-21

zhongjicheng 任务状态VALIDATION 修改为DONE

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(5)
6575280 zhoufeng54 1584443088
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助