37 Star 223 Fork 52

GVP隐语SecretFlow / secretflow

 / 详情

使用secureboost的运行时间差异很大

待办的
创建于  
2024-05-10 14:15

我使用集群仿真模式运行secureboost的测试示例时,利用相同的数据集多次运行相同测试脚本时,运行时间有时为130s,有时为400s左右,有时为2500s左右。请问这么大的时间差异可能是什么原因造成的呢?

(1)在服务器上新建2个容器,分别模拟2个参与方的机器。
docker run -it --name sf_tc_4 --mount type=bind,source="$(pwd)",target=/home/admin/dev/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --cap-add=NET_ADMIN --privileged=true -p 192.168.131.53:1696:46 -p 192.168.131.53:1697:47 -p 192.168.131.53:1698:48 --cpus=16 secretflow:code202404 /bin/bash

docker run -it --name sf_tc_5 --mount type=bind,source="$(pwd)",target=/home/admin/dev/ --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --cap-add=NET_ADMIN --privileged=true -p 192.168.131.53:1700:50 -p 192.168.131.53:1701:51 -p 192.168.131.53:1702:52 --cpus=16 secretflow:code202404 /bin/bash

2个容器的IP地址分别为:172.17.0.6 和172.17.0.8。

(2)在第一个容器内部署Ray主节点,模拟参与方alice。
ray start --head --node-ip-address="172.17.0.6" --port="1696" --resources='{"alice": 8}' --include-dashboard=False --disable-usage-stats

(3)在第二个容器内部署Ray从节点,模拟参与方bob。
ray start --address="172.17.0.6:1696" --resources='{"bob": 8}'

(4) 然后在第一个容器内执行secureboost测试代码:
import logging
import socket
import sys
import time

import spu
from sklearn.metrics import mean_squared_error, roc_auc_score

import secretflow as sf
from secretflow.data import FedNdarray, PartitionWay
from secretflow.device.driver import reveal, wait
from secretflow.ml.boost.sgb_v import (
Sgb,
get_classic_XGB_params,
get_classic_lightGBM_params,
)

from secretflow.ml.boost.sgb_v.model import load_model

from secretflow.utils.simulation.datasets import create_df
from secretflow.data.vertical import read_csv as v_read_csv

import numpy as np
import pprint

pp = pprint.PrettyPrinter(depth=4)

Check the version of your SecretFlow

print('The version of SecretFlow: {}'.format(sf.version))

_system_config = {'lineage_pinning_enabled': False}
sf.shutdown()

cluster_config ={
'parties':{
'alice': {
# replace with alice's real address.
'address': '172.17.0.6:1697',
'listen_addr': '0.0.0.0:1697'
},
'bob': {
# replace with bob's real address.
'address': '172.17.0.8:1701',
'listen_addr': '0.0.0.0:1701'
},
},
'self_party': 'alice'
}

SPU settings

cluster_def = {
'nodes': [
{'party': 'alice', 'address': '172.17.0.6:1698', 'listen_addr': '0.0.0.0:1698'},
{'party': 'bob', 'address': '172.17.0.8:1702', 'listen_addr': '0.0.0.0:1702'},
],
'runtime_config': {
'protocol': spu.spu_pb2.SEMI2K,
'field': spu.spu_pb2.FM128,
'sigmoid_mode': spu.spu_pb2.RuntimeConfig.SIGMOID_REAL
},
}

HEU settings

heu_config = {
'sk_keeper': {'party': 'alice'},
'evaluators': [{'party': 'bob'}],
'mode': 'PHEU', # 这里修改同态加密相关配置
'he_parameters': {
'schema': 'paillier',
'key_pair': {
'generate': {
'bit_size': 2048,
},
}
},
'encoding': {
'cleartext_type': 'DT_I32',
'encoder': "IntegerEncoder",
'encoder_args': {"scale": 1},
},
}

sf.init(parties=['alice','bob'],address='172.17.0.6:1696')
alice = sf.PYU('alice')
bob = sf.PYU('bob')
heu = sf.HEU(heu_config, cluster_def['runtime_config']['field'])#HEUSkKeeper、HEUEvaluator

vdf = v_read_csv(
{alice: "./datasets/independent_linear1.csv", bob: "./datasets/independent_linear2.csv"},
keys='id',
drop_keys='id',
)

label_data = vdf["label"]

v_data = vdf.drop(columns="label")

wait([p.data for p in v_data.partitions.values()])

#创建Sgb模型
sgb = Sgb(heu)

params = get_classic_XGB_params()
params['num_boost_round'] = 3
params['max_depth'] = 3
pp.pprint(params)

model = sgb.train(params, v_data, label_data)

sf.shutdown()

评论 (22)

admin_g 创建了任务

@admin_g 请使用生产模式运行,仿真模式只能用来debug code,并不能测试性能。

您好,使用生产模式运行时,alice和bob ping不通。
2024-05-10 07:02:10,345 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 172.17.0.9:1703...
2024-05-10 07:02:10,363 INFO worker.py:1724 -- Connected to Ray cluster.
2024-05-10 07:02:11.348 INFO api.py:233 [alice] -- [Anonymous_job] Started rayfed with {'CLUSTER_ADDRESSES': {'alice': '0.0.0.0:1704', 'bob': '172.17.0.14:1707'}, 'CURRENT_PARTY_NAME': 'alice', 'TLS_CONFIG': {}}
2024-05-10 07:02:12.342 INFO barriers.py:284 [alice] -- [Anonymous_job] Succeeded to create receiver proxy actor.
(ReceiverProxyActor pid=57924) 2024-05-10 07:02:12.336 INFO grpc_proxy.py:359 [alice] -- [Anonymous_job] ReceiverProxy binding port 1704, options: (('grpc.enable_retries', 1), ('grpc.so_reuseport', 0), ('grpc.max_send_message_length', 524288000), ('grpc.max_receive_message_length', 524288000), ('grpc.service_config', '{"methodConfig": [{"name": [{"service": "GrpcService"}], "retryPolicy": {"maxAttempts": 5, "initialBackoff": "5s", "maxBackoff": "30s", "backoffMultiplier": 2, "retryableStatusCodes": ["UNAVAILABLE"]}}]}'))...
(ReceiverProxyActor pid=57924) 2024-05-10 07:02:12.340 INFO grpc_proxy.py:379 [alice] -- [Anonymous_job] Successfully start Grpc service without credentials.
2024-05-10 07:02:13.231 INFO barriers.py:333 [alice] -- [Anonymous_job] SenderProxyActor has successfully created.
2024-05-10 07:02:13.232 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 0 attemp, up to 3600 attemps.
2024-05-10 07:02:16.240 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 1 attemp, up to 3600 attemps.
2024-05-10 07:02:19.243 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 2 attemp, up to 3600 attemps.
2024-05-10 07:02:22.247 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 3 attemp, up to 3600 attemps.
2024-05-10 07:02:25.249 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 4 attemp, up to 3600 attemps.
2024-05-10 07:02:28.253 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 5 attemp, up to 3600 attemps.
2024-05-10 07:02:31.258 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 6 attemp, up to 3600 attemps.
2024-05-10 07:02:34.260 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 7 attemp, up to 3600 attemps.
2024-05-10 07:02:37.262 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 8 attemp, up to 3600 attemps.
2024-05-10 07:02:40.265 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 9 attemp, up to 3600 attemps.
2024-05-10 07:02:43.270 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 10 attemp, up to 3600 attemps

测试代码如下:
在第一个容器内运行下述代码
sf.shutdown()

cluster_config ={
'parties': {
'alice': {
'address': '172.17.0.9:1704',
'listen_addr': '0.0.0.0:1704'
},
'bob': {
'address': '172.17.0.14:1707',
'listen_addr': '0.0.0.0:1707'
},
},
'self_party': 'alice'
}

SPU settings

cluster_def = {
'nodes': [
{'party': 'alice', 'address': '172.17.0.9:1705','listen_addr': '0.0.0.0:1705'},
{'party': 'bob', 'address': '172.17.0.14:1708', 'listen_addr': '0.0.0.0:1708'},
],
'runtime_config': {
'protocol': spu.spu_pb2.SEMI2K,
'field': spu.spu_pb2.FM128,
'sigmoid_mode': spu.spu_pb2.RuntimeConfig.SIGMOID_REAL
},
}

HEU settings

heu_config = {
'sk_keeper': {'party': 'alice'},
'evaluators': [{'party': 'bob'}],
'mode': 'PHEU', # 这里修改同态加密相关配置
'he_parameters': {
'schema': 'paillier',
'key_pair': {
'generate': {
'bit_size': 2048,
},
}
},
'encoding': {
'cleartext_type': 'DT_I32',
'encoder': "IntegerEncoder",
'encoder_args': {"scale": 1},
},
}

sf.init(
address='172.17.0.9:1703',
cluster_config=cluster_config,
)
alice = sf.PYU('alice')
bob = sf.PYU('bob')
heu = sf.HEU(heu_config, cluster_def['runtime_config']['field'])

vdf = v_read_csv(
{alice: "./datasets/independent_linear1.csv", bob: "./datasets/independent_linear2.csv"},
keys='id',
drop_keys='id',
)

label_data = vdf["label"]

v_data = vdf.drop(columns="label")

wait([p.data for p in v_data.partitions.values()])

sgb = Sgb(heu)
params = get_classic_XGB_params()
params['num_boost_round'] = 3
params['max_depth'] = 3
model = sgb.train(params, v_data, label_data)

在第二个容器内代码修改为
cluster_config ={
'parties': {
'alice': {
'address': '172.17.0.9:1704',
'listen_addr': '0.0.0.0:1704'
},
'bob': {
'address': '172.17.0.14:1707',
'listen_addr': '0.0.0.0:1707'
},
},
'self_party': 'bob'
}

sf.init(
address='172.17.0.14:1706',
cluster_config=cluster_config,
)

辛苦贴一下双方启动Ray的代码(ray start)

ray start --head --node-ip-address="172.17.0.9" --port="1703" --resources='{"alice": 16}' --include-dashboard=False --disable-usage-stats

ray start --head --node-ip-address="172.17.0.14" --port="1706" --resources='{"bob": 16}' --include-dashboard=False --disable-usage-stats

排查了以下,发现端口并没有被占用:
ray启动:
Alice:172.17.0.9:1703
bob:172.17.0.14:1706
rayfed:
Alice:172.17.0.9:1704
bob:172.17.0.14:1707
SPU:
Alice:172.17.0.9:1705
bob:172.17.0.14:1708
sf.init:
Alice:172.17.0.9:1703
bob:172.17.0.14:1706
从报错来看,是因为ray未连接到bob机器导致,请检查以下两点:
1、生产模式需要双方同时运行代码,且代码除区分参与方其他为一致。
2、端口号被占用,更换端口号

  1. 已经是双方同时运行代码了。bob侧的机器也提示ping 不到alice。

  2. 端口号也没有被占用。

alice和bob最后都会报下述错误:
2024-05-10 08:13:38.275 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 169 attemp, up to 3600 attemps.
(SenderProxyActor pid=58810) 2024-05-10 08:13:41.213 ERROR barriers.py:171 [alice] -- [Anonymous_job] Failed to send data to seq_id ping of bob from ping, error: <AioRpcError of RPC that terminated with:
(SenderProxyActor pid=58810) status = StatusCode.DEADLINE_EXCEEDED
(SenderProxyActor pid=58810) details = "Deadline Exceeded"
(SenderProxyActor pid=58810) debug_error_string = "UNKNOWN:Deadline Exceeded {created_time:"2024-05-10T08:13:41.212494506+00:00", grpc_status:4}"

2024-05-10 08:13:15.640 INFO barriers.py:520 [bob] -- [Anonymous_job] Try ping ['alice'] at 158 attemp, up to 3600 attemps.
(SenderProxyActor pid=27673) 2024-05-10 08:13:15.569 ERROR barriers.py:171 [bob] -- [Anonymous_job] Failed to send data to seq_id ping of alice from ping, error: <AioRpcError of RPC that terminated with:
(SenderProxyActor pid=27673) status = StatusCode.DEADLINE_EXCEEDED
(SenderProxyActor pid=27673) details = "Deadline Exceeded"
(SenderProxyActor pid=27673) debug_error_string = "UNKNOWN:Deadline Exceeded {grpc_status:4, created_time:"2024-05-10T08:13:15.568493351+00:00"}"

目前python版本是Python 3.10.13,不知python版本是否会有影响?

另外,alice的机器显示bob的地址是可以访通的。同样,bob的机器端alice的地址也是可以访通的
(base) root@87575df2193f:/home/admin/dev/secretflow_2024/secretflow# telnet 172.17.0.14 1706
Trying 172.17.0.14...
Connected to 172.17.0.14.
Escape character is '^]'.
@@@?

  1. 请发一下alice和bob的日志
  2. telnet的端口不正确,应该是cluster config里面的端口

有一个容器的ip变了,重新启动ray节点:
ray start --head --node-ip-address="172.17.0.9" --port="1703" --resources='{"alice": 16}' --include-dashboard=False --disable-usage-stats

ray start --head --node-ip-address="172.17.0.11" --port="1706" --resources='{"bob": 16}' --include-dashboard=False --disable-usage-stats

(1)alice日志如下
2024-05-11 01:10:12,240 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 172.17.0.9:1703...
2024-05-11 01:10:12,252 INFO worker.py:1724 -- Connected to Ray cluster.
2024-05-11 01:10:12.807 INFO api.py:233 [alice] -- [Anonymous_job] Started rayfed with {'CLUSTER_ADDRESSES': {'alice':

'0.0.0.0:1704', 'bob': '172.17.0.11:1707'}, 'CURRENT_PARTY_NAME': 'alice', 'TLS_CONFIG': {}}
2024-05-11 01:10:13.753 INFO barriers.py:284 [alice] -- [Anonymous_job] Succeeded to create receiver proxy actor.
(ReceiverProxyActor pid=1402) 2024-05-11 01:10:13.749 INFO grpc_proxy.py:359 [alice] -- [Anonymous_job] ReceiverProxy binding port

1704, options: (('grpc.enable_retries', 1), ('grpc.so_reuseport', 0), ('grpc.max_send_message_length', 524288000),

('grpc.max_receive_message_length', 524288000), ('grpc.service_config', '{"methodConfig": [{"name": [{"service": "GrpcService"}],

"retryPolicy": {"maxAttempts": 5, "initialBackoff": "5s", "maxBackoff": "30s", "backoffMultiplier": 2, "retryableStatusCodes":

["UNAVAILABLE"]}}]}'))...
(ReceiverProxyActor pid=1402) 2024-05-11 01:10:13.752 INFO grpc_proxy.py:379 [alice] -- [Anonymous_job] Successfully start Grpc

service without credentials.
2024-05-11 01:10:14.633 INFO barriers.py:333 [alice] -- [Anonymous_job] SenderProxyActor has successfully created.
2024-05-11 01:10:14.633 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 0 attemp, up to 3600 attemps.
2024-05-11 01:10:17.637 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 1 attemp, up to 3600 attemps.
2024-05-11 01:10:20.640 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 2 attemp, up to 3600 attemps.
2024-05-11 01:10:23.641 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 3 attemp, up to 3600 attemps.
2024-05-11 01:10:26.645 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 4 attemp, up to 3600 attemps.
2024-05-11 01:10:29.648 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 5 attemp, up to 3600 attemps.
2024-05-11 01:10:32.650 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 6 attemp, up to 3600 attemps.
2024-05-11 01:10:35.654 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 7 attemp, up to 3600 attemps.
2024-05-11 01:10:38.657 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 8 attemp, up to 3600 attemps.
2024-05-11 01:10:41.661 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 9 attemp, up to 3600 attemps.
2024-05-11 01:10:44.664 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 10 attemp, up to 3600 attemps.
2024-05-11 01:10:47.665 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 11 attemp, up to 3600 attemps.
2024-05-11 01:10:50.669 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 12 attemp, up to 3600 attemps.
2024-05-11 01:10:53.673 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 13 attemp, up to 3600 attemps.
2024-05-11 01:10:56.676 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 14 attemp, up to 3600 attemps.
2024-05-11 01:10:59.677 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 15 attemp, up to 3600 attemps.
2024-05-11 01:11:02.680 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 16 attemp, up to 3600 attemps.
2024-05-11 01:11:05.684 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 17 attemp, up to 3600 attemps.
2024-05-11 01:11:08.685 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 18 attemp, up to 3600 attemps.
2024-05-11 01:11:11.689 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 19 attemp, up to 3600 attemps.
2024-05-11 01:11:14.692 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 20 attemp, up to 3600 attemps.
(SenderProxyActor pid=1500) 2024-05-11 01:11:14.642 ERROR barriers.py:171 [alice] -- [Anonymous_job] Failed to send data to seq_id

ping of bob from ping, error: <AioRpcError of RPC that terminated with:
(SenderProxyActor pid=1500) status = StatusCode.DEADLINE_EXCEEDED
(SenderProxyActor pid=1500) details = "Deadline Exceeded"
(SenderProxyActor pid=1500) debug_error_string = "UNKNOWN:Deadline Exceeded {created_time:"2024-05-11T01:11:14.641539153+00:00",

grpc_status:4}"
(SenderProxyActor pid=1500) >
2024-05-11 01:11:17.694 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 21 attemp, up to 3600 attemps.
(SenderProxyActor pid=1500) 2024-05-11 01:11:17.642 ERROR barriers.py:171 [alice] -- [Anonymous_job] Failed to send data to seq_id

ping of bob from ping, error: <AioRpcError of RPC that terminated with:
(SenderProxyActor pid=1500) status = StatusCode.DEADLINE_EXCEEDED
(SenderProxyActor pid=1500) details = "Deadline Exceeded"
(SenderProxyActor pid=1500) debug_error_string = "UNKNOWN:Deadline Exceeded {created_time:"2024-05-11T01:11:17.641523465+00:00",

grpc_status:4}"
(SenderProxyActor pid=1500) >
(SenderProxyActor pid=1500) 2024-05-11 01:11:20.644 ERROR barriers.py:171 [alice] -- [Anonymous_job] Failed to send data to seq_id

ping of bob from ping, error: <AioRpcError of RPC that terminated with:
(SenderProxyActor pid=1500) status = StatusCode.DEADLINE_EXCEEDED
(SenderProxyActor pid=1500) details = "Deadline Exceeded"
(SenderProxyActor pid=1500) debug_error_string = "UNKNOWN:Deadline Exceeded {created_time:"2024-05-11T01:11:20.643464909+00:00",

grpc_status:4}"

(2)bob日志如下:
2024-05-11 01:10:12,046 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 172.17.0.11:1706...
2024-05-11 01:10:12,056 INFO worker.py:1724 -- Connected to Ray cluster.
2024-05-11 01:10:12.627 INFO api.py:233 [bob] -- [Anonymous_job] Started rayfed with {'CLUSTER_ADDRESSES': {'alice':

'172.17.0.9:1704', 'bob': '0.0.0.0:1707'}, 'CURRENT_PARTY_NAME': 'bob', 'TLS_CONFIG': {}}
(ReceiverProxyActor pid=1428) 2024-05-11 01:10:13.545 INFO grpc_proxy.py:359 [bob] -- [Anonymous_job] ReceiverProxy binding port

1707, options: (('grpc.enable_retries', 1), ('grpc.so_reuseport', 0), ('grpc.max_send_message_length', 524288000),

('grpc.max_receive_message_length', 524288000), ('grpc.service_config', '{"methodConfig": [{"name": [{"service": "GrpcService"}],

"retryPolicy": {"maxAttempts": 5, "initialBackoff": "5s", "maxBackoff": "30s", "backoffMultiplier": 2, "retryableStatusCodes":

["UNAVAILABLE"]}}]}'))...
2024-05-11 01:10:13.552 INFO barriers.py:284 [bob] -- [Anonymous_job] Succeeded to create receiver proxy actor.
(ReceiverProxyActor pid=1428) 2024-05-11 01:10:13.550 INFO grpc_proxy.py:379 [bob] -- [Anonymous_job] Successfully start Grpc service

without credentials.
2024-05-11 01:10:14.441 INFO barriers.py:333 [bob] -- [Anonymous_job] SenderProxyActor has successfully created.
2024-05-11 01:10:14.441 INFO barriers.py:520 [bob] -- [Anonymous_job] Try ping ['alice'] at 0 attemp, up to 3600 attemps.
2024-05-11 01:10:17.445 INFO barriers.py:520 [bob] -- [Anonymous_job] Try ping ['alice'] at 1 attemp, up to 3600 attemps.
2024-05-11 01:10:20.449 INFO barriers.py:520 [bob] -- [Anonymous_job] Try ping ['alice'] at 2 attemp, up to 3600 attemps.
2024-05-11 01:10:23.452 INFO barriers.py:520 [bob] -- [Anonymous_job] Try ping ['alice'] at 3 attemp, up to 3600 attemps.
2024-05-11 01:10:26.453 INFO barriers.py:520 [bob] -- [Anonymous_job] Try ping ['alice'] at 4 attemp, up to 3600 attemps.
2024-05-11 01:10:29.457 INFO barriers.py:520 [bob] -- [Anonymous_job] Try ping ['alice'] at 5 attemp, up to 3600 attemps.
2024-05-11 01:10:32.460 INFO barriers.py:520 [bob] -- [Anonymous_job] Try ping ['alice'] at 6 attemp, up to 3600 attemps.
2024-05-11 01:10:35.462 INFO barriers.py:520 [bob] -- [Anonymous_job] Try ping ['alice'] at 7 attemp, up to 3600 attemps.
2024-05-11 01:10:38.465 INFO barriers.py:520 [bob] -- [Anonymous_job] Try ping ['alice'] at 8 attemp, up to 3600 attemps.
2024-05-11 01:10:41.469 INFO barriers.py:520 [bob] -- [Anonymous_job] Try ping ['alice'] at 9 attemp, up to 3600 attemps.
2024-05-11 01:10:44.472 INFO barriers.py:520 [bob] -- [Anonymous_job] Try ping ['alice'] at 10 attemp, up to 3600 attemps.
2024-05-11 01:10:47.476 INFO barriers.py:520 [bob] -- [Anonymous_job] Try ping ['alice'] at 11 attemp, up to 3600 attemps.
2024-05-11 01:10:50.477 INFO barriers.py:520 [bob] -- [Anonymous_job] Try ping ['alice'] at 12 attemp, up to 3600 attemps.
2024-05-11 01:10:53.481 INFO barriers.py:520 [bob] -- [Anonymous_job] Try ping ['alice'] at 13 attemp, up to 3600 attemps.
2024-05-11 01:10:56.482 INFO barriers.py:520 [bob] -- [Anonymous_job] Try ping ['alice'] at 14 attemp, up to 3600 attemps.
2024-05-11 01:10:59.485 INFO barriers.py:520 [bob] -- [Anonymous_job] Try ping ['alice'] at 15 attemp, up to 3600 attemps.
2024-05-11 01:11:02.489 INFO barriers.py:520 [bob] -- [Anonymous_job] Try ping ['alice'] at 16 attemp, up to 3600 attemps.
2024-05-11 01:11:05.492 INFO barriers.py:520 [bob] -- [Anonymous_job] Try ping ['alice'] at 17 attemp, up to 3600 attemps.
2024-05-11 01:11:08.496 INFO barriers.py:520 [bob] -- [Anonymous_job] Try ping ['alice'] at 18 attemp, up to 3600 attemps.
2024-05-11 01:11:11.500 INFO barriers.py:520 [bob] -- [Anonymous_job] Try ping ['alice'] at 19 attemp, up to 3600 attemps.
(SenderProxyActor pid=1526) 2024-05-11 01:11:14.450 ERROR barriers.py:171 [bob] -- [Anonymous_job] Failed to send data to seq_id ping

of alice from ping, error: <AioRpcError of RPC that terminated with:
(SenderProxyActor pid=1526) status = StatusCode.DEADLINE_EXCEEDED
(SenderProxyActor pid=1526) details = "Deadline Exceeded"
(SenderProxyActor pid=1526) debug_error_string = "UNKNOWN:Deadline Exceeded {created_time:"2024-05-11T01:11:14.449452093+00:00",

grpc_status:4}"
(SenderProxyActor pid=1526) >
2024-05-11 01:11:14.504 INFO barriers.py:520 [bob] -- [Anonymous_job] Try ping ['alice'] at 20 attemp, up to 3600 attemps.
2024-05-11 01:11:17.505 INFO barriers.py:520 [bob] -- [Anonymous_job] Try ping ['alice'] at 21 attemp, up to 3600 attemps.
(SenderProxyActor pid=1526) 2024-05-11 01:11:17.452 ERROR barriers.py:171 [bob] -- [Anonymous_job] Failed to send data to seq_id ping

of alice from ping, error: <AioRpcError of RPC that terminated with:
(SenderProxyActor pid=1526) status = StatusCode.DEADLINE_EXCEEDED
(SenderProxyActor pid=1526) details = "Deadline Exceeded"
(SenderProxyActor pid=1526) debug_error_string = "UNKNOWN:Deadline Exceeded {grpc_status:4, created_time:"2024-05-

11T01:11:17.451485228+00:00"}"

辛苦telnet验证一下新的ip和端口是否通。
如果telnet是通的话,怀疑网络是否支持http2,可能是网络链路上被拦截到

容器1连接容器2:sf.init中的端口可以访通,cluster_config中的端口显示 Connection refused。容器2连接容器1时结果类似。
容器的端口用docker ps命令都能看到,而且cluster_config中的端口是没有被占用的。

日志如下:
(base) root@87575df2193f:/home/admin/dev/secretflow_2024/secretflow# telnet 172.17.0.11 1706
Trying 172.17.0.11...
Connected to 172.17.0.11.
Escape character is '^]'.
@@@? ^CConnection closed by foreign host.
(base) root@87575df2193f:/home/admin/dev/secretflow_2024/secretflow#
(base) root@87575df2193f:/home/admin/dev/secretflow_2024/secretflow# telnet 172.17.0.11 1707
Trying 172.17.0.11...
telnet: Unable to connect to remote host: Connection refused

验证时代码需在运行状态,否则服务一致处于未启动,导致结果一定是失败的。

运行代码时验证clust_config中的端口是可以访通的,但还是ping 不通

辛苦以文件后缀为.py上传一下alice和bob的代码

我没法上传文件,alice和bob运行的代码如下
alice.py
import logging
import socket
import sys
import time

import spu
from sklearn.metrics import mean_squared_error, roc_auc_score

import secretflow as sf
from secretflow.data import FedNdarray, PartitionWay
from secretflow.device.driver import reveal, wait
from secretflow.ml.boost.sgb_v import (
Sgb,
get_classic_XGB_params,
get_classic_lightGBM_params,
)

from secretflow.ml.boost.sgb_v.model import load_model

from secretflow.utils.simulation.datasets import create_df
from secretflow.data.vertical import read_csv as v_read_csv

import numpy as np
import pprint

pp = pprint.PrettyPrinter(depth=4)

Check the version of your SecretFlow

print('The version of SecretFlow: {}'.format(sf.version))

_system_config = {'lineage_pinning_enabled': False}
sf.shutdown()

cluster_config ={
'parties': {
'alice': {
'address': '172.17.0.9:1704',
'listen_addr': '0.0.0.0:1704'
},
'bob': {
'address': '172.17.0.11:1707',
'listen_addr': '0.0.0.0:1707'
},
},
'self_party': 'alice'
}

SPU settings

cluster_def = {
'nodes': [
{'party': 'alice', 'address': '172.17.0.9:1705','listen_addr': '0.0.0.0:1705'},
{'party': 'bob', 'address': '172.17.0.11:1708', 'listen_addr': '0.0.0.0:1708'},
],
'runtime_config': {
'protocol': spu.spu_pb2.SEMI2K,
'field': spu.spu_pb2.FM128,
'sigmoid_mode': spu.spu_pb2.RuntimeConfig.SIGMOID_REAL
},
}

HEU settings

heu_config = {
'sk_keeper': {'party': 'alice'},
'evaluators': [{'party': 'bob'}],
'mode': 'PHEU', # 杩欓噷淇敼鍚屾€佸姞瀵嗙浉鍏抽厤缃? 'he_parameters': {
'schema': 'paillier',
#'schema': 'ou',
'key_pair': {
'generate': {
'bit_size': 2048,
},
}
},
'encoding': {
'cleartext_type': 'DT_I32',
'encoder': "IntegerEncoder",
'encoder_args': {"scale": 1},
},
}

sf.init(
address='172.17.0.9:1703',
cluster_config=cluster_config,
)

alice = sf.PYU('alice')
bob = sf.PYU('bob')
heu = sf.HEU(heu_config, cluster_def['runtime_config']['field'])

vdf = v_read_csv(
{alice: "./datasets/independent_linear1.csv", bob: "./datasets/independent_linear2.csv"},
keys='id',
drop_keys='id',
)

label_data = vdf["label"]

v_data = vdf.drop(columns="label")
print('\033[0;32mbefore wait mv_data.type : \033[0m',v_data.shape)

y = reveal(label_data.partitions[alice].data)
wait([p.data for p in v_data.partitions.values()])

print('\033[0;32mv_data.type : \033[0m',v_data.shape)
print('\033[0;32my.shape : \033[0m',y.shape)

sgb = Sgb(heu)
start = time.time()

params = get_classic_XGB_params()
params['num_boost_round'] = 3
params['max_depth'] = 3
pp.pprint(params)
model = sgb.train(params, v_data, label_data)

print(f"\033[0;32mtrain time: {time.time() - start}\033[0m")
start = time.time()
yhat = model.predict(v_data)
yhat = reveal(yhat)
print(f"\033[0;32mpredict time: {time.time() - start}\033[0m")
print(f"auc: {roc_auc_score(y, yhat)}")

sf.shutdown()

bob.py

import logging
import socket
import sys
import time

import spu
from sklearn.metrics import mean_squared_error, roc_auc_score

import secretflow as sf
from secretflow.data import FedNdarray, PartitionWay
from secretflow.device.driver import reveal, wait
from secretflow.ml.boost.sgb_v import (
Sgb,
get_classic_XGB_params,
get_classic_lightGBM_params,
)

from secretflow.ml.boost.sgb_v.model import load_model

from secretflow.utils.simulation.datasets import create_df
from secretflow.data.vertical import read_csv as v_read_csv

import numpy as np
import pprint

pp = pprint.PrettyPrinter(depth=4)

Check the version of your SecretFlow

print('The version of SecretFlow: {}'.format(sf.version))

_system_config = {'lineage_pinning_enabled': False}
sf.shutdown()

cluster_config ={
'parties': {
'alice': {
'address': '172.17.0.9:1704',
'listen_addr': '0.0.0.0:1704'
},
'bob': {
'address': '172.17.0.11:1707',
'listen_addr': '0.0.0.0:1707'
},
},
'self_party': 'bob'
}

SPU settings

cluster_def = {
'nodes': [
{'party': 'alice', 'address': '172.17.0.9:1705','listen_addr': '0.0.0.0:1705'},
{'party': 'bob', 'address': '172.17.0.11:1708', 'listen_addr': '0.0.0.0:1708'},
],
'runtime_config': {
'protocol': spu.spu_pb2.SEMI2K,
'field': spu.spu_pb2.FM128,
'sigmoid_mode': spu.spu_pb2.RuntimeConfig.SIGMOID_REAL
},
}

HEU settings

heu_config = {
'sk_keeper': {'party': 'alice'},
'evaluators': [{'party': 'bob'}],
'mode': 'PHEU', # 杩欓噷淇敼鍚屾€佸姞瀵嗙浉鍏抽厤缃? 'he_parameters': {
'schema': 'paillier',
#'schema': 'ou',
'key_pair': {
'generate': {
'bit_size': 2048,
},
}
},
'encoding': {
'cleartext_type': 'DT_I32',
'encoder': "IntegerEncoder",
'encoder_args': {"scale": 1},
},
}

sf.init(
address='172.17.0.11:1706',
cluster_config=cluster_config,
)

alice = sf.PYU('alice')
bob = sf.PYU('bob')
heu = sf.HEU(heu_config, cluster_def['runtime_config']['field'])

vdf = v_read_csv(
{alice: "./datasets/independent_linear1.csv", bob: "./datasets/independent_linear2.csv"},
keys='id',
drop_keys='id',
)

label_data = vdf["label"]

v_data = vdf.drop(columns="label")
print('\033[0;32mbefore wait mv_data.type : \033[0m',v_data.shape)

y = reveal(label_data.partitions[alice].data)
wait([p.data for p in v_data.partitions.values()])
io_end = time.perf_counter()

print('\033[0;32mv_data.type : \033[0m',v_data.shape)
print('\033[0;32my.shape : \033[0m',y.shape)

sgb = Sgb(heu)
start = time.time()

params = get_classic_XGB_params()
params['num_boost_round'] = 3
params['max_depth'] = 3
pp.pprint(params)
model = sgb.train(params, v_data, label_data)

print(f"\033[0;32mtrain time: {time.time() - start}\033[0m")
start = time.time()
yhat = model.predict(v_data)
yhat = reveal(yhat)
print(f"\033[0;32mpredict time: {time.time() - start}\033[0m")
print(f"auc: {roc_auc_score(y, yhat)}")

sf.shutdown()

@admin_g
已定位到你的问题是使用了grpc,比较强依赖http2,网络上有时候会遇到不支持的情况。
请通过以下参考修改为brpc,只需要http就行,出现问题的概率会比较低

修改参数请参考:https://www.secretflow.org.cn/zh-CN/docs/secretflow/v1.5.0b0/source/secretflow#secretflow.init
中的cross_silo_comm_backend

谢谢。
修改cross_silo_comm_backend参数为brpc后可以连接上了。
将heu_config中schema设为'ou'时可以正常运行train函数,但是设为paillier时,会报下述错误

alice:
2024-05-14 02:20:41.602 INFO proxy.py:180 [alice] -- [Anonymous_job] Create proxy actor <class 'secretflow.ml.boost.sgb_v.factory.sgb_actor.SGBActor'> with party alice.
2024-05-14 02:20:41.747 INFO proxy.py:180 [alice] -- [Anonymous_job] Create proxy actor <class 'secretflow.ml.boost.sgb_v.factory.sgb_actor.SGBActor'> with party bob.
2024-05-14 02:20:41.766 INFO global_ordermap_booster.py:190 [alice] -- [Anonymous_job] training the first tree with label holder only.
2024-05-14 02:20:41.766 INFO level_wise_tree_trainer.py:112 [alice] -- [Anonymous_job] train tree context set up.
2024-05-14 02:20:41.789 INFO level_wise_tree_trainer.py:187 [alice] -- [Anonymous_job] begin train tree.
(_run pid=46340) [2024-05-14 02:21:04.185] [info] [thread_pool.cc:30] Create a fixed thread pool with size 95
2024-05-14 02:21:04.550 INFO global_ordermap_booster.py:208 [alice] -- [Anonymous_job] epoch 0 time 22.7843796312809s
2024-05-14 02:21:04.551 INFO level_wise_tree_trainer.py:112 [alice] -- [Anonymous_job] train tree context set up.
2024-05-14 02:21:04.569 INFO level_wise_tree_trainer.py:187 [alice] -- [Anonymous_job] begin train tree.
(SenderReceiverProxyActor pid=48293) I0514 02:22:35.268082 48463 external/com_github_brpc_brpc/src/brpc/socket.cpp:2506] Checking Socket{id=0 addr=172.17.0.11:1713} (0x278d340)
(SenderReceiverProxyActor pid=48293) [2024-05-14 02:22:36.567] [info] [thread_pool.cc:30] Create a fixed thread pool with size 95 [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(SenderReceiverProxyActor pid=48293) [2024-05-14 02:22:39.492] [info] [channel.cc:352] send request failed and retry, retry_count=1, max_retry=3, interval_ms=1000, message=[external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '112', http status code '0', response header '', response body '', error msg '[E112]Not connected to 172.17.0.11:1713 yet, server_id=0'
(SenderReceiverProxyActor pid=48293) [2024-05-14 02:22:39.493] [info] [channel.cc:352] send request failed and retry, retry_count=1, max_retry=3, interval_ms=1000, message=[external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '112', http status code '0', response header '', response body '', error msg '[E112]Not connected to 172.17.0.11:1713 yet, server_id=0'
(SenderReceiverProxyActor pid=48293) [2024-05-14 02:22:39.493] [info] [channel.cc:352] send request failed and retry, retry_count=1, max_retry=3, interval_ms=1000, message=[external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '112', http status code '0', response header '', response body '', error msg '[E112]Not connected to 172.17.0.11:1713 yet, server_id=0'
(SenderReceiverProxyActor pid=48293) [2024-05-14 02:22:39.493] [info] [channel.cc:352] send request failed and retry, retry_count=1, max_retry=3, interval_ms=1000, message=[external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '112', http status code '0', response header '', response body '', error msg '[E112]Not connected to 172.17.0.11:1713 yet, server_id=0'
(SenderReceiverProxyActor pid=48293) [2024-05-14 02:22:39.493] [info] [channel.cc:352] send request failed and retry, retry_count=1, max_retry=3, interval_ms=1000, message=[external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '112', http status code '0', response header '', response body '', error msg '[E112]Not connected to 172.17.0.11:1713 yet, server_id=0'
(SenderReceiverProxyActor pid=48293) [2024-05-14 02:22:39.493] [info] [channel.cc:352] send request failed and retry, retry_count=1, max_retry=3, interval_ms=1000, message=[external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '112', http status code '0', response header '', response body '', error msg '[E112]Not connected to 172.17.0.11:1713 yet, server_id=0'
(SenderReceiverProxyActor pid=48293) [2024-05-14 02:22:39.495] [info] [channel.cc:352] send request failed and retry, retry_count=1, max_retry=3, interval_ms=1000, message=[external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '112', http status code '0', response header '', response body '', error msg '[E112]Not connected to 172.17.0.11:1713 yet, server_id=0'
(SenderReceiverProxyActor pid=48293) [2024-05-14 02:22:39.495] [info] [channel.cc:352] send request failed and retry, retry_count=1, max_retry=3, interval_ms=1000, message=[external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '112', http status code '0', response header '', response body '', error msg '[E112]Not connected to 172.17.0.11:1713 yet, server_id=0'
2024-05-14 02:22:39.691 WARNING api.py:607 [alice] -- [Anonymous_job] Encounter RemoteError happend in other parties, error message: FedRemoteError occurred at bob
2024-05-14 02:22:39.692 WARNING cleanup.py:154 [alice] -- [Anonymous_job] Failed to send ObjectRef(7bb866669492e80fd7010a44f7f2fad06d8027da0c00000001000000) to bob, error: ray::SenderReceiverProxyActor.send() (pid=48293, ip=172.17.0.9, actor_id=d7010a44f7f2fad06d8027da0c000000, repr=<fed.proxy.barriers.SenderReceiverProxyActor object at 0x7f8cfc72a740>)
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=46340, ip=172.17.0.9)
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::SGBActor.invoke_class_method_two_ret() (pid=48862, ip=172.17.0.9, actor_id=0ce67fa206fcf4dc13a177490c000000, repr=<secretflow.ml.boost.sgb_v.factory.sgb_actor.SGBActor object at 0x7f6b88203b80>)
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=46340, ip=172.17.0.9)
File "/home/admin/dev/secretflow_20240509/secretflow/secretflow/device/device/pyu.py", line 151, in _run
actual_vals = ray.get(list(refs.values()))
ray.exceptions.RayTaskError(FedRemoteError): ray::HEUSkKeeper.decrypt_and_decode() (pid=48470, ip=172.17.0.9, actor_id=e302aa5e632d0ec42b6f70e10c000000, repr=HEUSkKeeper(heu_id=139650334253328, party=alice))
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::SenderReceiverProxyActor.get_data() (pid=48293, ip=172.17.0.9, actor_id=d7010a44f7f2fad06d8027da0c000000, repr=<fed.proxy.barriers.SenderReceiverProxyActor object at 0x7f8cfc72a740>)
File "/root/miniconda3/lib/python3.10/site-packages/fed/proxy/barriers.py", line 379, in get_data
data = self._proxy_instance.get_data(src_party, upstream_seq_id, curr_seq_id)
File "/root/miniconda3/lib/python3.10/site-packages/fed/proxy/brpc_link/link.py", line 127, in get_data
raise data
fed.exceptions.FedRemoteError: FedRemoteError occurred at bob,upstream_seq_id: 252#1, downstream_seq_id: 253.
2024-05-14 02:22:39.693 INFO cleanup.py:161 [alice] -- [Anonymous_job] Sending error FedRemoteError occurred at bob to bob.
2024-05-14 02:22:39.695 WARNING cleanup.py:127 [alice] -- [Anonymous_job] Signal SIGINT to exit.
2024-05-14 02:22:39.695 WARNING api.py:60 [alice] -- [Anonymous_job] Stop signal received (e.g. via SIGINT/Ctrl+C), try to shutdown fed. Press CTRL+C (or send SIGINT/SIGKILL/SIGTERM) to skip.
2024-05-14 02:22:39.696 WARNING api.py:325 [alice] -- [Anonymous_job] Shutdowning rayfed unintendedly...
2024-05-14 02:22:39.696 ERROR api.py:330 [alice] -- [Anonymous_job] Cross-silo sending error occured. ray::SenderReceiverProxyActor.send() (pid=48293, ip=172.17.0.9, actor_id=d7010a44f7f2fad06d8027da0c000000, repr=<fed.proxy.barriers.SenderReceiverProxyActor object at 0x7f8cfc72a740>)
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=46340, ip=172.17.0.9)
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::SGBActor.invoke_class_method_two_ret() (pid=48862, ip=172.17.0.9, actor_id=0ce67fa206fcf4dc13a177490c000000, repr=<secretflow.ml.boost.sgb_v.factory.sgb_actor.SGBActor object at 0x7f6b88203b80>)
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=46340, ip=172.17.0.9)
File "/home/admin/dev/secretflow_20240509/secretflow/secretflow/device/device/pyu.py", line 151, in _run
actual_vals = ray.get(list(refs.values()))
ray.exceptions.RayTaskError(FedRemoteError): ray::HEUSkKeeper.decrypt_and_decode() (pid=48470, ip=172.17.0.9, actor_id=e302aa5e632d0ec42b6f70e10c000000, repr=HEUSkKeeper(heu_id=139650334253328, party=alice))
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::SenderReceiverProxyActor.get_data() (pid=48293, ip=172.17.0.9, actor_id=d7010a44f7f2fad06d8027da0c000000, repr=<fed.proxy.barriers.SenderReceiverProxyActor object at 0x7f8cfc72a740>)
File "/root/miniconda3/lib/python3.10/site-packages/fed/proxy/barriers.py", line 379, in get_data
data = self._proxy_instance.get_data(src_party, upstream_seq_id, curr_seq_id)
File "/root/miniconda3/lib/python3.10/site-packages/fed/proxy/brpc_link/link.py", line 127, in get_data
raise data
fed.exceptions.FedRemoteError: FedRemoteError occurred at bob
2024-05-14 02:22:39.696 INFO api.py:337 [alice] -- [Anonymous_job] No wait for data sending.
2024-05-14 02:22:39.697 INFO message_queue.py:70 [alice] -- [Anonymous_job] Notify message polling thread[ErrorSendingQueueThread] to exit.
2024-05-14 02:22:39.697 INFO api.py:352 [alice] -- [Anonymous_job] Shutdowned rayfed.
2024-05-14 02:22:39.697 CRITICAL api.py:356 [alice] -- [Anonymous_job] Exit now due to the previous error.
(SenderReceiverProxyActor pid=48293) [2024-05-14 02:22:39.667] [info] [channel.cc:352] send request failed and retry, retry_count=1, max_retry=3, interval_ms=1000, message=[external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '112', http status code '0', response header '', response body '', error msg '[E112]Not connected to 172.17.0.11:1713 yet, server_id=0'
(SenderReceiverProxyActor pid=48293) [2024-05-14 02:22:39.668] [info] [channel.cc:352] send request failed and retry, retry_count=1, max_retry=3, interval_ms=1000, message=[external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '112', http status code '0', response header '', response body '', error msg '[E112]Not connected to 172.17.0.11:1713 yet, server_id=0'
(SenderReceiverProxyActor pid=48293) [2024-05-14 02:22:39.695] [info] [channel.cc:352] send request failed and retry, retry_count=1, max_retry=3, interval_ms=1000, message=[external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '112', http status code '0', response header '', response body '', error msg '[E112]Not connected to 172.17.0.11:1713 yet, server_id=0'
(SenderReceiverProxyActor pid=48293) 2024-05-14 02:22:39.665 WARNING link.py:122 [alice] -- [Anonymous_job] Receiving exception: <class 'fed.exceptions.FedRemoteError'>, FedRemoteError occurred at bob from bob, upstream_seq_id: 248#0, curr_seq_id: 249. Re-raise it.

bob:
2024-05-14 02:20:41.599 INFO proxy.py:180 [bob] -- [Anonymous_job] Create proxy actor <class 'secretflow.ml.boost.sgb_v.factory.sgb_actor.SGBActor'> with party alice.
2024-05-14 02:20:41.599 INFO proxy.py:180 [bob] -- [Anonymous_job] Create proxy actor <class 'secretflow.ml.boost.sgb_v.factory.sgb_actor.SGBActor'> with party bob.
2024-05-14 02:20:41.765 INFO global_ordermap_booster.py:190 [bob] -- [Anonymous_job] training the first tree with label holder only.
2024-05-14 02:20:41.765 INFO level_wise_tree_trainer.py:112 [bob] -- [Anonymous_job] train tree context set up.
2024-05-14 02:20:41.781 INFO level_wise_tree_trainer.py:187 [bob] -- [Anonymous_job] begin train tree.
(_run pid=41919) [2024-05-14 02:21:04.427] [info] [thread_pool.cc:30] Create a fixed thread pool with size 95
2024-05-14 02:21:04.553 INFO global_ordermap_booster.py:208 [bob] -- [Anonymous_job] epoch 0 time 22.788510467857122s
2024-05-14 02:21:04.553 INFO level_wise_tree_trainer.py:112 [bob] -- [Anonymous_job] train tree context set up.
2024-05-14 02:21:04.557 INFO level_wise_tree_trainer.py:187 [bob] -- [Anonymous_job] begin train tree.
2024-05-14 02:22:04.616 WARNING cleanup.py:154 [bob] -- [Anonymous_job] Failed to send ObjectRef(b6dfc1926d9c78c8f6b0a8b353eb3c8c1650f2ab0b00000001000000) to alice, error: ray::SenderReceiverProxyActor.send() (pid=41515, ip=172.17.0.11, actor_id=f6b0a8b353eb3c8c1650f2ab0b000000, repr=<fed.proxy.barriers.SenderReceiverProxyActor object at 0x7fae1434e6e0>)
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::HEUEvaluator.getitem() (pid=41693, ip=172.17.0.11, actor_id=7e87ee80ca67f6180bc1c84a0b000000, repr=HEUEvaluator(heu_id=140576711903936, party=bob))
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::HEUEvaluator.getitem() (pid=41693, ip=172.17.0.11, actor_id=7e87ee80ca67f6180bc1c84a0b000000, repr=HEUEvaluator(heu_id=140576711903936, party=bob))
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::HEUEvaluator.batch_feature_wise_bucket_sum() (pid=41693, ip=172.17.0.11, actor_id=7e87ee80ca67f6180bc1c84a0b000000, repr=HEUEvaluator(heu_id=140576711903936, party=bob))
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::SenderReceiverProxyActor.get_data() (pid=41515, ip=172.17.0.11, actor_id=f6b0a8b353eb3c8c1650f2ab0b000000, repr=<fed.proxy.barriers.SenderReceiverProxyActor object at 0x7fae1434e6e0>)
File "/root/miniconda3/lib/python3.10/site-packages/fed/proxy/barriers.py", line 379, in get_data
data = self._proxy_instance.get_data(src_party, upstream_seq_id, curr_seq_id)
File "/root/miniconda3/lib/python3.10/site-packages/fed/proxy/brpc_link/link.py", line 109, in get_data
msg = self._linker.recv(rank)
RuntimeError: what:
[external/yacl/yacl/link/transport/channel.cc:411] Get data timeout, key=root:P2P-36:0->1
stacktrace:
#0 yacl::link::Context::RecvInternal()+0x7faddee3e277
#1:pip install -r 出错 yacl::link::Context::Recv()+0x7faddee3f952
#2 spu::BindLink()::{lambda()#16:ray开启tls之后,secretflow测试demo报错}::operator()()+0x7faddd384f10
#3:import secretflow as sf 报错 pybind11::cpp_function::initialize<>()::{lambda()#3:import secretflow as sf 报错}::_FUN()+0x7faddd3c869d
#4:pip install 部分出错 pybind11::cpp_function::dispatcher()+0x7faddd3967eb
#5:安装失败,需要得依赖包spu找不到,无法安装 cfunction_call+0x4fc697,upstream_seq_id: 248#0, downstream_seq_id: 249.
2024-05-14 02:22:04.616 INFO cleanup.py:161 [bob] -- [Anonymous_job] Sending error what:
[external/yacl/yacl/link/transport/channel.cc:411] Get data timeout, key=root:P2P-36:0->1
stacktrace:
#0 yacl::link::Context::RecvInternal()+0x7faddee3e277
#1:pip install -r 出错 yacl::link::Context::Recv()+0x7faddee3f952
#2 spu::BindLink()::{lambda()#16:ray开启tls之后,secretflow测试demo报错}::operator()()+0x7faddd384f10
#3:import secretflow as sf 报错 pybind11::cpp_function::initialize<>()::{lambda()#3:import secretflow as sf 报错}::_FUN()+0x7faddd3c869d
#4:pip install 部分出错 pybind11::cpp_function::dispatcher()+0x7faddd3967eb
#5:安装失败,需要得依赖包spu找不到,无法安装 cfunction_call+0x4fc697

to alice.
2024-05-14 02:22:04.617 WARNING cleanup.py:127 [bob] -- [Anonymous_job] Signal SIGINT to exit.
2024-05-14 02:22:05.593 WARNING api.py:60 [bob] -- [Anonymous_job] Stop signal received (e.g. via SIGINT/Ctrl+C), try to shutdown fed. Press CTRL+C (or send SIGINT/SIGKILL/SIGTERM) to skip.
2024-05-14 02:22:05.593 WARNING api.py:325 [bob] -- [Anonymous_job] Shutdowning rayfed unintendedly...
2024-05-14 02:22:05.593 ERROR api.py:330 [bob] -- [Anonymous_job] Cross-silo sending error occured. ray::SenderReceiverProxyActor.send() (pid=41515, ip=172.17.0.11, actor_id=f6b0a8b353eb3c8c1650f2ab0b000000, repr=<fed.proxy.barriers.SenderReceiverProxyActor object at 0x7fae1434e6e0>)
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::HEUEvaluator.getitem() (pid=41693, ip=172.17.0.11, actor_id=7e87ee80ca67f6180bc1c84a0b000000, repr=HEUEvaluator(heu_id=140576711903936, party=bob))
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::HEUEvaluator.getitem() (pid=41693, ip=172.17.0.11, actor_id=7e87ee80ca67f6180bc1c84a0b000000, repr=HEUEvaluator(heu_id=140576711903936, party=bob))
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::HEUEvaluator.batch_feature_wise_bucket_sum() (pid=41693, ip=172.17.0.11, actor_id=7e87ee80ca67f6180bc1c84a0b000000, repr=HEUEvaluator(heu_id=140576711903936, party=bob))
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::SenderReceiverProxyActor.get_data() (pid=41515, ip=172.17.0.11, actor_id=f6b0a8b353eb3c8c1650f2ab0b000000, repr=<fed.proxy.barriers.SenderReceiverProxyActor object at 0x7fae1434e6e0>)
File "/root/miniconda3/lib/python3.10/site-packages/fed/proxy/barriers.py", line 379, in get_data
data = self._proxy_instance.get_data(src_party, upstream_seq_id, curr_seq_id)
File "/root/miniconda3/lib/python3.10/site-packages/fed/proxy/brpc_link/link.py", line 109, in get_data
msg = self._linker.recv(rank)
RuntimeError: what:
[external/yacl/yacl/link/transport/channel.cc:411] Get data timeout, key=root:P2P-36:0->1
stacktrace:
#0 yacl::link::Context::RecvInternal()+0x7faddee3e277
#1:pip install -r 出错 yacl::link::Context::Recv()+0x7faddee3f952
#2 spu::BindLink()::{lambda()#16:ray开启tls之后,secretflow测试demo报错}::operator()()+0x7faddd384f10
#3:import secretflow as sf 报错 pybind11::cpp_function::initialize<>()::{lambda()#3:import secretflow as sf 报错}::_FUN()+0x7faddd3c869d
#4:pip install 部分出错 pybind11::cpp_function::dispatcher()+0x7faddd3967eb
#5:安装失败,需要得依赖包spu找不到,无法安装 cfunction_call+0x4fc697
2024-05-14 02:22:05.593 INFO api.py:337 [bob] -- [Anonymous_job] No wait for data sending.
2024-05-14 02:22:05.595 INFO message_queue.py:70 [bob] -- [Anonymous_job] Notify message polling thread[ErrorSendingQueueThread] to exit.
2024-05-14 02:22:05.595 INFO api.py:352 [bob] -- [Anonymous_job] Shutdowned rayfed.
2024-05-14 02:22:05.595 CRITICAL api.py:356 [bob] -- [Anonymous_job] Exit now due to the previous error.
2024-05-14 02:22:34,618 ERROR worker.py:405 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::SenderReceiverProxyActor.get_data() (pid=41515, ip=172.17.0.11, actor_id=f6b0a8b353eb3c8c1650f2ab0b000000, repr=<fed.proxy.barriers.SenderReceiverProxyActor object at 0x7fae1434e6e0>)
File "/root/miniconda3/lib/python3.10/site-packages/fed/proxy/barriers.py", line 379, in get_data
data = self._proxy_instance.get_data(src_party, upstream_seq_id, curr_seq_id)
File "/root/miniconda3/lib/python3.10/site-packages/fed/proxy/brpc_link/link.py", line 109, in get_data
msg = self._linker.recv(rank)
RuntimeError: what:
[external/yacl/yacl/link/transport/channel.cc:411] Get data timeout, key=root:P2P-38:0->1
stacktrace:
#0 yacl::link::Context::RecvInternal()+0x7faddee3e277
#1:pip install -r 出错 yacl::link::Context::Recv()+0x7faddee3f952
#2 spu::BindLink()::{lambda()#16:ray开启tls之后,secretflow测试demo报错}::operator()()+0x7faddd384f10
#3:import secretflow as sf 报错 pybind11::cpp_function::initialize<>()::{lambda()#3:import secretflow as sf 报错}::_FUN()+0x7faddd3c869d
#4:pip install 部分出错 pybind11::cpp_function::dispatcher()+0x7faddd3967eb
#5:安装失败,需要得依赖包spu找不到,无法安装 cfunction_call+0x4fc697

是还需要修改其它参数吗?

从报错看的话是因为SPU依赖问题,您现在secretflow版本和spu是多少呢?
pip list | grep secretflow
pip list | grep spu

可以更新到1.5.0b0尝试一下呢

secretflow 1.5.0.dev20240509
secretflow-ray 2.2.0
secretflow-rayfed 0.2.1a1
secretflow-serving-lib 0.3.0.dev20240320
spu 0.9.0.dev20240320

之前的log好像有问题,新帖了报错的日志

alice的终端:
2024-05-14 07:09:57.583 INFO barriers.py:465 [alice] -- [Anonymous_job] Succeeded to create receiver proxy actor.
2024-05-14 07:09:57.583 INFO barriers.py:520 [alice] -- [Anonymous_job] Try ping ['bob'] at 0 attemp, up to 3600 attemps.
2024-05-14 07:10:03.992 INFO proxy.py:180 [alice] -- [Anonymous_job] Create proxy actor <class 'secretflow.data.core.agent.PartitionAgent'> with party alice.
2024-05-14 07:10:04.097 INFO proxy.py:180 [alice] -- [Anonymous_job] Create proxy actor <class 'secretflow.data.core.agent.PartitionAgent'> with party bob.
{'audit_paths': {},
'base_score': 0.0,
'batch_encoding_enabled': True,
'bottom_rate': 0.5,
'colsample_by_tree': 1.0,
'enable_early_stop': False,
'enable_goss': False,
'enable_monitor': False,
'enable_packbits': False,
'enable_quantization': False,
'eval_metric': 'roc_auc',
'first_tree_with_label_holder_feature': True,
'fixed_point_parameter': 20,
'gamma': 0.0,
'learning_rate': 0.3,
'max_depth': 3,
'max_leaf': 15,
'num_boost_round': 3,
'objective': 'logistic',
'quantization_scale': 10000.0,
'reg_lambda': 1.0,
'rowsample_by_tree': 1.0,
'save_best_model': False,
'seed': 1212,
'sketch_eps': 0.1,
'stopping_rounds': 1,
'stopping_tolerance': 0.0,
'top_rate': 0.3,
'tree_growing_method': 'level',
'validation_fraction': 0.1}
2024-05-14 07:10:33.347 INFO proxy.py:180 [alice] -- [Anonymous_job] Create proxy actor <class 'secretflow.ml.boost.sgb_v.factory.sgb_actor.SGBActor'> with party alice.
2024-05-14 07:10:33.364 INFO proxy.py:180 [alice] -- [Anonymous_job] Create proxy actor <class 'secretflow.ml.boost.sgb_v.factory.sgb_actor.SGBActor'> with party bob.
2024-05-14 07:10:33.512 INFO global_ordermap_booster.py:190 [alice] -- [Anonymous_job] training the first tree with label holder only.
2024-05-14 07:10:33.512 INFO level_wise_tree_trainer.py:112 [alice] -- [Anonymous_job] train tree context set up.
2024-05-14 07:10:33.530 INFO level_wise_tree_trainer.py:187 [alice] -- [Anonymous_job] begin train tree.
(_run pid=635) [2024-05-14 07:11:06.440] [info] [thread_pool.cc:30] Create a fixed thread pool with size 95
2024-05-14 07:11:07.274 INFO global_ordermap_booster.py:208 [alice] -- [Anonymous_job] epoch 0 time 33.76193536631763s
2024-05-14 07:11:07.274 INFO level_wise_tree_trainer.py:112 [alice] -- [Anonymous_job] train tree context set up.
2024-05-14 07:11:07.309 INFO level_wise_tree_trainer.py:187 [alice] -- [Anonymous_job] begin train tree.
(SenderReceiverProxyActor pid=1454) I0514 07:12:38.021386 1601 external/com_github_brpc_brpc/src/brpc/socket.cpp:2506] Checking Socket{id=0 addr=172.17.0.11:1713} (0x27f7f80)
(SenderReceiverProxyActor pid=1454) [2024-05-14 07:12:55.689] [info] [thread_pool.cc:30] Create a fixed thread pool with size 95 [repeated 2x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(SenderReceiverProxyActor pid=1454) [2024-05-14 07:12:59.200] [info] [channel.cc:352] send request failed and retry, retry_count=1, max_retry=3, interval_ms=1000, message=[external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '112', http status code '0', response header '', response body '', error msg '[E112]Not connected to 172.17.0.11:1713 yet, server_id=0'
(SenderReceiverProxyActor pid=1454) [2024-05-14 07:12:59.200] [info] [channel.cc:352] send request failed and retry, retry_count=1, max_retry=3, interval_ms=1000, message=[external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '112', http status code '0', response header '', response body '', error msg '[E112]Not connected to 172.17.0.11:1713 yet, server_id=0'
(SenderReceiverProxyActor pid=1454) [2024-05-14 07:12:59.200] [info] [channel.cc:352] send request failed and retry, retry_count=1, max_retry=3, interval_ms=1000, message=[external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '112', http status code '0', response header '', response body '', error msg '[E112]Not connected to 172.17.0.11:1713 yet, server_id=0'
(SenderReceiverProxyActor pid=1454) [2024-05-14 07:12:59.200] [info] [channel.cc:352] send request failed and retry, retry_count=1, max_retry=3, interval_ms=1000, message=[external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '112', http status code '0', response header '', response body '', error msg '[E112]Not connected to 172.17.0.11:1713 yet, server_id=0'
(SenderReceiverProxyActor pid=1454) [2024-05-14 07:12:59.201] [info] [channel.cc:352] send request failed and retry, retry_count=1, max_retry=3, interval_ms=1000, message=[external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '112', http status code '0', response header '', response body '', error msg '[E112]Not connected to 172.17.0.11:1713 yet, server_id=0'
(SenderReceiverProxyActor pid=1454) [2024-05-14 07:12:59.201] [info] [channel.cc:352] send request failed and retry, retry_count=1, max_retry=3, interval_ms=1000, message=[external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '112', http status code '0', response header '', response body '', error msg '[E112]Not connected to 172.17.0.11:1713 yet, server_id=0'
(SenderReceiverProxyActor pid=1454) [2024-05-14 07:12:59.201] [info] [channel.cc:352] send request failed and retry, retry_count=1, max_retry=3, interval_ms=1000, message=[external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '112', http status code '0', response header '', response body '', error msg '[E112]Not connected to 172.17.0.11:1713 yet, server_id=0'
(SenderReceiverProxyActor pid=1454) [2024-05-14 07:12:59.202] [info] [channel.cc:352] send request failed and retry, retry_count=1, max_retry=3, interval_ms=1000, message=[external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '112', http status code '0', response header '', response body '', error msg '[E112]Not connected to 172.17.0.11:1713 yet, server_id=0'
(SenderReceiverProxyActor pid=1454) 2024-05-14 07:12:59.365 WARNING link.py:122 [alice] -- [Anonymous_job] Receiving exception: <class 'fed.exceptions.FedRemoteError'>, FedRemoteError occurred at bob from bob, upstream_seq_id: 238#0, curr_seq_id: 239. Re-raise it.
(SenderReceiverProxyActor pid=1454) [2024-05-14 07:12:59.369] [info] [channel.cc:352] send request failed and retry, retry_count=1, max_retry=3, interval_ms=1000, message=[external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '112', http status code '0', response header '', response body '', error msg '[E112]Not connected to 172.17.0.11:1713 yet, server_id=0'
(SenderReceiverProxyActor pid=1454) [2024-05-14 07:12:59.369] [info] [channel.cc:352] send request failed and retry, retry_count=1, max_retry=3, interval_ms=1000, message=[external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '112', http status code '0', response header '', response body '', error msg '[E112]Not connected to 172.17.0.11:1713 yet, server_id=0'
2024-05-14 07:12:59.394 WARNING api.py:607 [alice] -- [Anonymous_job] Encounter RemoteError happend in other parties, error message: FedRemoteError occurred at bob
2024-05-14 07:12:59.398 WARNING cleanup.py:154 [alice] -- [Anonymous_job] Failed to send ObjectRef(3712fc1866a0a000602b1132cde31a15e3a0fff80100000001000000) to bob, error: ray::SenderReceiverProxyActor.send() (pid=1454, ip=172.17.0.9, actor_id=602b1132cde31a15e3a0fff801000000, repr=<fed.proxy.barriers.SenderReceiverProxyActor object at 0x7fc55c742770>)
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=640, ip=172.17.0.9)
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::SGBActor.invoke_class_method_two_ret() (pid=1907, ip=172.17.0.9, actor_id=7735c0547b0e2bf62428453b01000000, repr=<secretflow.ml.boost.sgb_v.factory.sgb_actor.SGBActor object at 0x7fa580177b50>)
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=640, ip=172.17.0.9)
File "/home/admin/dev/secretflow_20240509/secretflow/secretflow/device/device/pyu.py", line 151, in _run
actual_vals = ray.get(list(refs.values()))
ray.exceptions.RayTaskError(FedRemoteError): ray::HEUSkKeeper.decrypt_and_decode() (pid=1647, ip=172.17.0.9, actor_id=e9a2bb65a5303e801242253301000000, repr=HEUSkKeeper(heu_id=140104188708656, party=alice))
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::SenderReceiverProxyActor.get_data() (pid=1454, ip=172.17.0.9, actor_id=602b1132cde31a15e3a0fff801000000, repr=<fed.proxy.barriers.SenderReceiverProxyActor object at 0x7fc55c742770>)
File "/root/miniconda3/lib/python3.10/site-packages/fed/proxy/barriers.py", line 379, in get_data
data = self._proxy_instance.get_data(src_party, upstream_seq_id, curr_seq_id)
File "/root/miniconda3/lib/python3.10/site-packages/fed/proxy/brpc_link/link.py", line 127, in get_data
raise data
fed.exceptions.FedRemoteError: FedRemoteError occurred at bob,upstream_seq_id: 242#1, downstream_seq_id: 243.
2024-05-14 07:12:59.398 INFO cleanup.py:161 [alice] -- [Anonymous_job] Sending error FedRemoteError occurred at bob to bob.
2024-05-14 07:12:59.400 WARNING cleanup.py:127 [alice] -- [Anonymous_job] Signal SIGINT to exit.
2024-05-14 07:12:59.400 WARNING api.py:60 [alice] -- [Anonymous_job] Stop signal received (e.g. via SIGINT/Ctrl+C), try to shutdown fed. Press CTRL+C (or send SIGINT/SIGKILL/SIGTERM) to skip.
2024-05-14 07:12:59.400 WARNING api.py:325 [alice] -- [Anonymous_job] Shutdowning rayfed unintendedly...
2024-05-14 07:12:59.400 ERROR api.py:330 [alice] -- [Anonymous_job] Cross-silo sending error occured. ray::SenderReceiverProxyActor.send() (pid=1454, ip=172.17.0.9, actor_id=602b1132cde31a15e3a0fff801000000, repr=<fed.proxy.barriers.SenderReceiverProxyActor object at 0x7fc55c742770>)
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=640, ip=172.17.0.9)
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::SGBActor.invoke_class_method_two_ret() (pid=1907, ip=172.17.0.9, actor_id=7735c0547b0e2bf62428453b01000000, repr=<secretflow.ml.boost.sgb_v.factory.sgb_actor.SGBActor object at 0x7fa580177b50>)
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::_run() (pid=640, ip=172.17.0.9)
File "/home/admin/dev/secretflow_20240509/secretflow/secretflow/device/device/pyu.py", line 151, in _run
actual_vals = ray.get(list(refs.values()))
ray.exceptions.RayTaskError(FedRemoteError): ray::HEUSkKeeper.decrypt_and_decode() (pid=1647, ip=172.17.0.9, actor_id=e9a2bb65a5303e801242253301000000, repr=HEUSkKeeper(heu_id=140104188708656, party=alice))
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::SenderReceiverProxyActor.get_data() (pid=1454, ip=172.17.0.9, actor_id=602b1132cde31a15e3a0fff801000000, repr=<fed.proxy.barriers.SenderReceiverProxyActor object at 0x7fc55c742770>)
File "/root/miniconda3/lib/python3.10/site-packages/fed/proxy/barriers.py", line 379, in get_data
data = self._proxy_instance.get_data(src_party, upstream_seq_id, curr_seq_id)
File "/root/miniconda3/lib/python3.10/site-packages/fed/proxy/brpc_link/link.py", line 127, in get_data
raise data
fed.exceptions.FedRemoteError: FedRemoteError occurred at bob
2024-05-14 07:12:59.401 INFO api.py:337 [alice] -- [Anonymous_job] No wait for data sending.
2024-05-14 07:12:59.402 INFO message_queue.py:70 [alice] -- [Anonymous_job] Notify message polling thread[ErrorSendingQueueThread] to exit.
2024-05-14 07:12:59.402 INFO api.py:352 [alice] -- [Anonymous_job] Shutdowned rayfed.
2024-05-14 07:12:59.402 CRITICAL api.py:356 [alice] -- [Anonymous_job] Exit now due to the previous error.
(SenderReceiverProxyActor pid=1454) [info] [channel.cc:352] send request failed and retry, retry_count=1, max_retry=3, interval_ms=1000, message=[external/yacl/yacl/link/transport/interconnection_link.cc:56] cntl ErrorCode '112', http status code '0', response header '', response body '', error msg '[E112]Not connected to 172.17.0.11:1713 yet, server_id=0'

bob的终端:

2024-05-14 07:09:57.582 INFO barriers.py:465 [bob] -- [Anonymous_job] Succeeded to create receiver proxy actor.
2024-05-14 07:09:57.583 INFO barriers.py:520 [bob] -- [Anonymous_job] Try ping ['alice'] at 0 attemp, up to 3600 attemps.
2024-05-14 07:10:06.496 INFO proxy.py:180 [bob] -- [Anonymous_job] Create proxy actor <class 'secretflow.data.core.agent.PartitionAgent'> with party alice.
2024-05-14 07:10:06.497 INFO proxy.py:180 [bob] -- [Anonymous_job] Create proxy actor <class 'secretflow.data.core.agent.PartitionAgent'> with party bob.
{'audit_paths': {},
'base_score': 0.0,
'batch_encoding_enabled': True,
'bottom_rate': 0.5,
'colsample_by_tree': 1.0,
'enable_early_stop': False,
'enable_goss': False,
'enable_monitor': False,
'enable_packbits': False,
'enable_quantization': False,
'eval_metric': 'roc_auc',
'first_tree_with_label_holder_feature': True,
'fixed_point_parameter': 20,
'gamma': 0.0,
'learning_rate': 0.3,
'max_depth': 3,
'max_leaf': 15,
'num_boost_round': 3,
'objective': 'logistic',
'quantization_scale': 10000.0,
'reg_lambda': 1.0,
'rowsample_by_tree': 1.0,
'save_best_model': False,
'seed': 1212,
'sketch_eps': 0.1,
'stopping_rounds': 1,
'stopping_tolerance': 0.0,
'top_rate': 0.3,
'tree_growing_method': 'level',
'validation_fraction': 0.1}
2024-05-14 07:10:33.345 INFO proxy.py:180 [bob] -- [Anonymous_job] Create proxy actor <class 'secretflow.ml.boost.sgb_v.factory.sgb_actor.SGBActor'> with party alice.
2024-05-14 07:10:33.346 INFO proxy.py:180 [bob] -- [Anonymous_job] Create proxy actor <class 'secretflow.ml.boost.sgb_v.factory.sgb_actor.SGBActor'> with party bob.
2024-05-14 07:10:33.511 INFO global_ordermap_booster.py:190 [bob] -- [Anonymous_job] training the first tree with label holder only.
2024-05-14 07:10:33.512 INFO level_wise_tree_trainer.py:112 [bob] -- [Anonymous_job] train tree context set up.
2024-05-14 07:10:33.525 INFO level_wise_tree_trainer.py:187 [bob] -- [Anonymous_job] begin train tree.
(_run pid=637) [2024-05-14 07:11:07.081] [info] [thread_pool.cc:30] Create a fixed thread pool with size 95
2024-05-14 07:11:07.278 INFO global_ordermap_booster.py:208 [bob] -- [Anonymous_job] epoch 0 time 33.76806065067649s
2024-05-14 07:11:07.278 INFO level_wise_tree_trainer.py:112 [bob] -- [Anonymous_job] train tree context set up.
2024-05-14 07:11:07.288 INFO level_wise_tree_trainer.py:187 [bob] -- [Anonymous_job] begin train tree.
2024-05-14 07:12:07.367 WARNING cleanup.py:154 [bob] -- [Anonymous_job] Failed to send ObjectRef(d99fa8d5c923d7c8187944f821436ede108d8e320100000001000000) to alice, error: ray::SenderReceiverProxyActor.send() (pid=1458, ip=172.17.0.11, actor_id=187944f821436ede108d8e3201000000, repr=<fed.proxy.barriers.SenderReceiverProxyActor object at 0x7fbf04136710>)
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::HEUEvaluator.getitem() (pid=1652, ip=172.17.0.11, actor_id=9b6cc4fd9eff5f2d5555aa5201000000, repr=HEUEvaluator(heu_id=140188552727760, party=bob))
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::HEUEvaluator.getitem() (pid=1652, ip=172.17.0.11, actor_id=9b6cc4fd9eff5f2d5555aa5201000000, repr=HEUEvaluator(heu_id=140188552727760, party=bob))
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::HEUEvaluator.batch_feature_wise_bucket_sum() (pid=1652, ip=172.17.0.11, actor_id=9b6cc4fd9eff5f2d5555aa5201000000, repr=HEUEvaluator(heu_id=140188552727760, party=bob))
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::SenderReceiverProxyActor.get_data() (pid=1458, ip=172.17.0.11, actor_id=187944f821436ede108d8e3201000000, repr=<fed.proxy.barriers.SenderReceiverProxyActor object at 0x7fbf04136710>)
File "/root/miniconda3/lib/python3.10/site-packages/fed/proxy/barriers.py", line 379, in get_data
data = self._proxy_instance.get_data(src_party, upstream_seq_id, curr_seq_id)
File "/root/miniconda3/lib/python3.10/site-packages/fed/proxy/brpc_link/link.py", line 109, in get_data
msg = self._linker.recv(rank)
RuntimeError: what:
[external/yacl/yacl/link/transport/channel.cc:411] Get data timeout, key=root:P2P-33:0->1
stacktrace:
0 yacl::link ::Context::RecvInternal()+0x7fbc1ee3e277
1 yacl::link ::Context::Recv()+0x7fbc1ee3f952
2 spu::BindLink()::{lambda() 16}::operator()()+0x7fbc1d384f10
3 pybind11::cpp_function::initialize<>()::{lambda() 3}::_FUN()+0x7fbc1d3c869d
4 pybind11::cpp_function::dispatcher()+0x7fbc1d3967eb
5 cfunction_call+0x4fc697,upstream_seq_id: 238 0, downstream_seq_id: 239.
2024-05-14 07:12:07.368 INFO cleanup.py:161 [bob] -- [Anonymous_job] Sending error what:
[external/yacl/yacl/link/transport/channel.cc:411] Get data timeout, key=root:P2P-33:0->1
stacktrace:
0 yacl::link ::Context::RecvInternal()+0x7fbc1ee3e277
1 yacl::link ::Context::Recv()+0x7fbc1ee3f952
2 spu::BindLink()::{lambda() 16}::operator()()+0x7fbc1d384f10
3 pybind11::cpp_function::initialize<>()::{lambda() 3}::_FUN()+0x7fbc1d3c869d
4 pybind11::cpp_function::dispatcher()+0x7fbc1d3967eb
5 cfunction_call+0x4fc697

to alice.
2024-05-14 07:12:07.368 WARNING cleanup.py:127 [bob] -- [Anonymous_job] Signal SIGINT to exit.
2024-05-14 07:12:08.349 WARNING api.py:60 [bob] -- [Anonymous_job] Stop signal received (e.g. via SIGINT/Ctrl+C), try to shutdown fed. Press CTRL+C (or send SIGINT/SIGKILL/SIGTERM) to skip.
2024-05-14 07:12:08.349 WARNING api.py:325 [bob] -- [Anonymous_job] Shutdowning rayfed unintendedly...
2024-05-14 07:12:08.349 ERROR api.py:330 [bob] -- [Anonymous_job] Cross-silo sending error occured. ray::SenderReceiverProxyActor.send() (pid=1458, ip=172.17.0.11, actor_id=187944f821436ede108d8e3201000000, repr=<fed.proxy.barriers.SenderReceiverProxyActor object at 0x7fbf04136710>)
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::HEUEvaluator.getitem() (pid=1652, ip=172.17.0.11, actor_id=9b6cc4fd9eff5f2d5555aa5201000000, repr=HEUEvaluator(heu_id=140188552727760, party=bob))
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::HEUEvaluator.getitem() (pid=1652, ip=172.17.0.11, actor_id=9b6cc4fd9eff5f2d5555aa5201000000, repr=HEUEvaluator(heu_id=140188552727760, party=bob))
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::HEUEvaluator.batch_feature_wise_bucket_sum() (pid=1652, ip=172.17.0.11, actor_id=9b6cc4fd9eff5f2d5555aa5201000000, repr=HEUEvaluator(heu_id=140188552727760, party=bob))
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayTaskError: ray::SenderReceiverProxyActor.get_data() (pid=1458, ip=172.17.0.11, actor_id=187944f821436ede108d8e3201000000, repr=<fed.proxy.barriers.SenderReceiverProxyActor object at 0x7fbf04136710>)
File "/root/miniconda3/lib/python3.10/site-packages/fed/proxy/barriers.py", line 379, in get_data
data = self._proxy_instance.get_data(src_party, upstream_seq_id, curr_seq_id)
File "/root/miniconda3/lib/python3.10/site-packages/fed/proxy/brpc_link/link.py", line 109, in get_data
msg = self._linker.recv(rank)
RuntimeError: what:
[external/yacl/yacl/link/transport/channel.cc:411] Get data timeout, key=root:P2P-33:0->1
stacktrace:
0 yacl::link ::Context::RecvInternal()+0x7fbc1ee3e277
1 yacl::link ::Context::Recv()+0x7fbc1ee3f952
2 spu::BindLink()::{lambda() 16}::operator()()+0x7fbc1d384f10
3 pybind11::cpp_function::initialize<>()::{lambda() 3}::_FUN()+0x7fbc1d3c869d
4 pybind11::cpp_function::dispatcher()+0x7fbc1d3967eb
5 cfunction_call+0x4fc697
2024-05-14 07:12:08.349 INFO api.py:337 [bob] -- [Anonymous_job] No wait for data sending.
2024-05-14 07:12:08.351 INFO message_queue.py:70 [bob] -- [Anonymous_job] Notify message polling thread[ErrorSendingQueueThread] to exit.
2024-05-14 07:12:08.351 INFO api.py:352 [bob] -- [Anonymous_job] Shutdowned rayfed.
2024-05-14 07:12:08.351 CRITICAL api.py:356 [bob] -- [Anonymous_job] Exit now due to the previous error.
2024-05-14 07:12:37,370 ERROR worker.py:405 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::SenderReceiverProxyActor.get_data() (pid=1458, ip=172.17.0.11, actor_id=187944f821436ede108d8e3201000000, repr=<fed.proxy.barriers.SenderReceiverProxyActor object at 0x7fbf04136710>)
File "/root/miniconda3/lib/python3.10/site-packages/fed/proxy/barriers.py", line 379, in get_data
data = self._proxy_instance.get_data(src_party, upstream_seq_id, curr_seq_id)
File "/root/miniconda3/lib/python3.10/site-packages/fed/proxy/brpc_link/link.py", line 109, in get_data
msg = self._linker.recv(rank)
RuntimeError: what:
[external/yacl/yacl/link/transport/channel.cc:411] Get data timeout, key=root:P2P-35:0->1
stacktrace:
0 yacl::link ::Context::RecvInternal()+0x7fbc1ee3e277
1 yacl::link ::Context::Recv()+0x7fbc1ee3f952
2 spu::BindLink()::{lambda() 16}::operator()()+0x7fbc1d384f10
3 pybind11::cpp_function::initialize<>()::{lambda() 3}::_FUN()+0x7fbc1d3c869d
4 pybind11::cpp_function::dispatcher()+0x7fbc1d3967eb
5 cfunction_call+0x4fc697

版本问题:
dev是一个不稳定的版本,我们不建议使用dev版本,请更新到正式版本
当前最新的版本为secretflow 1.5.0b0

报错问题
从报错看是因为spu双方时间太长导致通信超时,先贴一下两方的运行日志,看看是不是有一方已经挂掉了
如果没有的话可以在SPU settings中添加以下参数尝试一下(最长等待时间2hour)
link_desc = {
"recv_timeout_ms": :1200000,
}

BTW,中下旬我们会发布secretflow 1.6.0b0版本,相关其他repo也都会更新一版

Secureboost中需要配置spu吗?

Secureboost是通过Secretflow运行,而secretflow调用到了SPU的通信

登录 后才可以发表评论

状态
负责人
里程碑
Pull Requests
关联的 Pull Requests 被合并后可能会关闭此 issue
分支
开始日期   -   截止日期
-
置顶选项
优先级
参与者(2)
Python
1
https://gitee.com/secretflow/secretflow.git
git@gitee.com:secretflow/secretflow.git
secretflow
secretflow
secretflow

搜索帮助