2.4K Star 8.2K Fork 4.4K

GVPMindSpore / mindspore

 / 详情

[ST][MS][教程][mac]SSD教程在mac环境训练失败

TODO
Bug-Report
创建于  
2024-03-30 19:04
name about labels
Bug Report Use this template for reporting a bug kind/bug

Describe the current behavior / 问题描述 (Mandatory / 必填)

SSD教程在mac环境训练失败,环境上ulimit -a显示file descriptors为256,mac环境默认就是256,环境重启之后也无法训练成功
教程地址:https://www.mindspore.cn/tutorials/application/zh-CN/master/cv/ssd.html

Environment / 环境信息 (Mandatory / 必填)

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境:

Please delete the backend not involved / 请删除不涉及的后端:
/device CPU

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) :
    -- Python version (e.g., Python 3.7.5) :
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04):
    -- GCC/Compiler version (if compiled from source):
    失败版本:r2.3.q1_20240329061516_c99698ba26
    上次pass版本:r2.3_20240315121520_a24a055ea90a9

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph):

Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative
/mode graph

Related testcase / 关联用例 (Mandatory / 必填)

solution_test/cases/03subject_test/06document/02network_cases/test_ms_tutorial_cv_ssd_0001.py

Steps to reproduce the issue / 重现步骤 (Mandatory / 必填)

  1. get code from docs
  2. cd docs/tutorials/application/source_zh_cn/cv;python ssd.py
  3. 验证教程网络训练是否成功

Describe the expected behavior / 预期结果 (Mandatory / 必填)

网络训练成功

Related log / screenshot / 日志 / 截图 (Mandatory / 必填)

Traceback (most recent call last):
  File "/Users/jenkins/solution_test/cases/03subject_test/06document/02network_cases/test_ms_tutorial_cv_ssd_0001_PYNATIVE_MODE/ssd.py", line 1004, in <module>
  File "/Users/jenkins/miniconda3/envs/ci/lib/python3.9/site-packages/mindspore/dataset/engine/iterators.py", line 152, in __next__
  File "/Users/jenkins/miniconda3/envs/ci/lib/python3.9/site-packages/mindspore/dataset/engine/iterators.py", line 301, in _get_next
  File "/Users/jenkins/miniconda3/envs/ci/lib/python3.9/site-packages/mindspore/dataset/engine/datasets.py", line 3458, in launch
  File "/Users/jenkins/miniconda3/envs/ci/lib/python3.9/site-packages/mindspore/dataset/engine/datasets.py", line 3488, in create_pool
  File "/Users/jenkins/miniconda3/envs/ci/lib/python3.9/site-packages/mindspore/dataset/engine/datasets.py", line 3569, in _launch_watch_dog
  File "/Users/jenkins/miniconda3/envs/ci/lib/python3.9/multiprocessing/process.py", line 121, in start
  File "/Users/jenkins/miniconda3/envs/ci/lib/python3.9/multiprocessing/context.py", line 224, in _Popen
  File "/Users/jenkins/miniconda3/envs/ci/lib/python3.9/multiprocessing/context.py", line 277, in _Popen
  File "/Users/jenkins/miniconda3/envs/ci/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
  File "/Users/jenkins/miniconda3/envs/ci/lib/python3.9/multiprocessing/popen_fork.py", line 64, in _launch
OSError: [Errno 24] Too many open files
Traceback (most recent call last):
  File "/Users/jenkins/miniconda3/envs/ci/lib/python3.9/multiprocessing/util.py", line 300, in _run_finalizers
  File "/Users/jenkins/miniconda3/envs/ci/lib/python3.9/multiprocessing/util.py", line 224, in __call__
  File "/Users/jenkins/miniconda3/envs/ci/lib/python3.9/multiprocessing/util.py", line 133, in _remove_temp_dir
  File "/Users/jenkins/miniconda3/envs/ci/lib/python3.9/shutil.py", line 711, in rmtree
  File "/Users/jenkins/miniconda3/envs/ci/lib/python3.9/shutil.py", line 709, in rmtree
OSError: [Errno 24] Too many open files: '/var/folders/yj/z1vw4dv17bx3y4yvlh0dtml40000gp/T/pymp-ag8kn2jf'
Exception ignored in: <function MapDataset.__del__ at 0x122613af0>
Traceback (most recent call last):
  File "/Users/jenkins/miniconda3/envs/ci/lib/python3.9/site-packages/mindspore/dataset/engine/datasets.py", line 3737, in __del__
    self.process_pool.terminate()
  File "/Users/jenkins/miniconda3/envs/ci/lib/python3.9/site-packages/mindspore/dataset/engine/datasets.py", line 3494, in terminate
    self.abort_watchdog()
  File "/Users/jenkins/miniconda3/envs/ci/lib/python3.9/site-packages/mindspore/dataset/engine/datasets.py", line 3589, in abort_watchdog
    _PythonMultiprocessing._terminate_processes([self.cleaning_process])
  File "/Users/jenkins/miniconda3/envs/ci/lib/python3.9/site-packages/mindspore/dataset/engine/datasets.py", line 3363, in _terminate_processes
    p._popen.wait()  # pylint: disable=W0212
AttributeError: 'NoneType' object has no attribute 'wait'
Exception ignored in: <function _PythonMultiprocessing.__del__ at 0x12260fdc0>
Traceback (most recent call last):
  File "/Users/jenkins/miniconda3/envs/ci/lib/python3.9/site-packages/mindspore/dataset/engine/datasets.py", line 3293, in __del__
    self.terminate()
  File "/Users/jenkins/miniconda3/envs/ci/lib/python3.9/site-packages/mindspore/dataset/engine/datasets.py", line 3494, in terminate
    self.abort_watchdog()
  File "/Users/jenkins/miniconda3/envs/ci/lib/python3.9/site-packages/mindspore/dataset/engine/datasets.py", line 3589, in abort_watchdog
    _PythonMultiprocessing._terminate_processes([self.cleaning_process])
  File "/Users/jenkins/miniconda3/envs/ci/lib/python3.9/site-packages/mindspore/dataset/engine/datasets.py", line 3363, in _terminate_processes
    p._popen.wait()  # pylint: disable=W0212
AttributeError: 'NoneType' object has no attribute 'wait'

Special notes for this issue/备注 (Optional / 选填)

走给郭志建

评论 (6)

baimz 创建了Bug-Report
baimz 添加了
 
kind/bug
标签
baimz 添加了
 
attr/function
标签
baimz 添加了
 
stage/coding
标签
baimz 添加了
 
v2.3.0.alpha
标签
baimz 添加了
 
v2.3.0
标签
baimz 添加了
 
sig/minddata
标签
展开全部操作日志

Please assign maintainer to check this issue.
请为此issue分配处理人。
@baimz

感谢您的提问,您可以评论//mindspore-assistant更快获取帮助:

  1. 如果您刚刚接触MindSpore,或许您可以在教程找到答案
  2. 如果您是资深Pytorch用户,您或许需要:
  1. 如果您遇到动态图问题,可以设置set_context(pynative_synchronize=True)查看报错栈协助定位
  2. 模型精度调优问题可参考官网调优指南
  3. 如果您反馈的是框架BUG,请确认您在ISSUE中提供了MindSpore版本、使用的后端类型(CPU、GPU、Ascend)、环境、训练的代码官方链接以及可以复现报错的代码的启动方式等必要的定位信息
  4. 如果您已经定位出问题根因,欢迎提交PR参与MindSpore开源社区,我们会尽快review
xiangminshan 负责人xiangminshan 修改为guozhijian

已经确认dataset模块句柄没有增加,且dataset模块目前是有单元测试来保证句柄合理的。
单元测试:https://gitee.com/mindspore/mindspore/blob/r2.3/tests/st/dataset/test_dataset_with_multiprocessing.py
https://gitee.com/mindspore/mindspore/blob/r2.3/tests/ut/python/dataset/test_datasets_generator.py : test_generator_multiprocessing_with_fixed_handle
https://gitee.com/mindspore/mindspore/blob/r2.3/tests/ut/python/dataset/test_map.py : test_map_multiprocessing_with_fixed_handle
https://gitee.com/mindspore/mindspore/blob/r2.3/tests/ut/python/dataset/test_var_batch_map.py : test_batch_multiprocessing_with_in_out_rowsize

i-robot 添加了
 
gitee
标签
i-robot 添加了
 
gitee
标签
i-robot 添加了
 
gitee
标签
i-robot 添加了
 
gitee
标签

但是不确认是哪个模块导致 整个训练进程 使用的句柄数增加,建议 测试通过 二分法 定位到准确模块。

基于以上,先转回测试,找到准确的模块来修复。

guozhijian 添加协作者guozhijian
guozhijian 负责人guozhijian 修改为xiangminshan
guozhijian 取消协作者xiangminshan
guozhijian 里程碑B-SIG-Data 修改为B-SolutionTest

问题单未解决前不能走回给测试

xiangminshan 负责人xiangminshan 修改为guozhijian
xiangminshan 取消协作者guozhijian
xiangminshan 里程碑B-SolutionTest 修改为B-SIG-Data
baimz 添加了
 
v2.3.0.rc2
标签
baimz 添加了
 
v2.3.0.rc2
标签
fangwenyi 移除了
 
v2.3.0.rc2
标签
fangwenyi 添加了
 
master
标签
fangwenyi 移除了
 
v2.3.0.rc2
标签
fangwenyi 添加了关联分支master 选项
fangwenyi 添加了问题后端类型CPU 选项

登录 后才可以发表评论

状态
负责人
项目
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
预计工期 (小时)
参与者(4)
11016979 xiangmd 1654824581
Python
1
https://gitee.com/mindspore/mindspore.git
git@gitee.com:mindspore/mindspore.git
mindspore
mindspore
mindspore

搜索帮助