Multi-Party Resource Coordination¶
1. Description¶
Resources refer to the basic engine resources, mainly CPU resources and memory resources of the compute engine, CPU resources and network resources of the transport engine, currently only the management of CPU resources of the compute engine is supported
2. Total resource allocation¶
- The current version does not automatically get the resource size of the base engine, so you configure it through the configuration file
$FATE_PROJECT_BASE/conf/service_conf.yaml
, that is, the resource size of the current engine allocated to the FATE cluster FATE Flow Server
gets all the base engine information from the configuration file and registers it in the database tablet_engine_registry
when it starts.FATE Flow Server
has been started and the resource configuration can be modified by restartingFATE Flow Server
or by reloading the configuration using the command:flow server reload
.total_cores
=nodes
*cores_per_node
Example
fate_on_standalone: is for executing a standalone engine on the same machine as FATE Flow Server
, generally used for fast experiments, nodes
is generally set to 1, cores_per_node
is generally the number of CPU cores of the machine, also can be moderately over-provisioned
fate_on_standalone:
standalone:
cores_per_node: 20
nodes: 1
fate_on_eggroll: configured based on the actual deployment of EggRoll
cluster, nodes
denotes the number of node manager
machines, cores_per_node
denotes the average number of CPU cores per node manager
machine
fate_on_eggroll:
clustermanager:
cores_per_node: 16
nodes: 1
rollsite:
host: 127.0.0.1
port: 9370
fate_on_spark: configured based on the resources allocated to the FATE
cluster in the Spark
cluster, nodes
indicates the number of Spark
nodes, cores_per_node
indicates the average number of CPU cores per node allocated to the FATE
cluster
fate_on_spark:
spark:
# default use SPARK_HOME environment variable
home:
cores_per_node: 20
nodes: 2
Note: Please make sure that the Spark
cluster allocates the corresponding amount of resources to the FATE
cluster, if the Spark
cluster allocates less resources than the resources configured in FATE
here, then it will be possible to submit the FATE
job, but when FATE Flow
submits the task to the Spark
cluster, the task will not actually execute because the Spark
cluster has insufficient resources. Insufficient resources, the task is not actually executed
3. Job request resource configuration¶
We generally use task_cores`'' and
task_parallelism`' to configure job request resources, such as
{
"job_parameters": {
"common": {
"job_type": "train",
"task_cores": 6,
"task_parallelism": 2,
"computing_partitions": 8,
"timeout": 36000
}
}
}
The total resources requested by the job are task_cores
* task_parallelism
. When creating a job, FATE Flow
will distribute the job to each party
based on the above configuration, running role, and the engine used by the party (via $FATE_PROJECT_BASE/conf/service_conf .yaml#default_engines
), the actual parameters will be calculated as follows
4. The process of calculating the actual parameter adaptation for resource requests¶
- Calculate
request_task_cores
: - guest, host.
request_task_cores
=task_cores
-
arbiter, considering that the actual operation consumes very few resources: `request_task_cores
request_task_cores
= 1
-
Further calculate
task_cores_per_node
. -
task_cores_per_node"
= max(1,request_task_cores
/task_nodes
) -
If
eggroll_run
orspark_run
configuration resource is used in the abovejob_parameters
, then thetask_cores
configuration is invalid; calculatetask_cores_per_node
.task_cores_per_node"
= eggroll_run["eggroll.session.processors.per.node"]task_cores_per_node"
= spark_run["executor-cores"]
-
The parameter to convert to the adaptation engine (which will be presented to the compute engine for recognition when running the task).
- fate_on_standalone/fate_on_eggroll:
- eggroll_run["eggroll.session.processors.per.node"] =
task_cores_per_node
- eggroll_run["eggroll.session.processors.per.node"] =
-
fate_on_spark:
- spark_run["num-executors"] =
task_nodes
- spark_run["executor-cores"] =
task_cores_per_node
- spark_run["num-executors"] =
-
The final calculation can be seen in the job's
job_runtime_conf_on_party.json
, typically in$FATE_PROJECT_BASE/jobs/$job_id/$role/$party_id/job_runtime_on_party_conf.json
5. Resource Scheduling Policy¶
total_cores
see total_resource_allocationapply_cores
see job_request_resource_configuration,apply_cores
=task_nodes
*task_cores_per_node
*task_parallelism
- If all participants apply for resources successfully (total_cores - apply_cores) > 0, then the job applies for resources successfully
- If not all participants apply for resources successfully, then send a resource rollback command to the participants who have applied successfully, and the job fails to apply for resources
6. Related commands¶
query¶
For querying fate system resources
flow resource query
Options
Returns
parameter name | type | description |
---|---|---|
retcode | int | return code |
retmsg | string | return message |
data | object | return data |
Example
{
"data": {
"computing_engine_resource": {
"f_cores": 32,
"f_create_date": "2021-09-21 19:32:59",
"f_create_time": 1632223979564,
"f_engine_config": {
"cores_per_node": 32,
"nodes": 1
},
"f_engine_entrance": "fate_on_eggroll",
"f_engine_name": "EGGROLL",
"f_engine_type": "computing",
"f_memory": 0,
"f_nodes": 1,
"f_remaining_cores": 32,
"f_remaining_memory": 0,
"f_update_date": "2021-11-08 16:56:38",
"f_update_time": 1636361798812
},
"use_resource_job": []
},
"retcode": 0,
"retmsg": "success"
}
return¶
Resources for returning a job
flow resource return [options]
Options
parameter name | required | type | description |
---|---|---|---|
job_id | yes | string | job_id |
Returns
parameter name | type | description |
---|---|---|
retcode | int | return code |
retmsg | string | return message |
data | object | return data |
Example
{
"data": [
{
"job_id": "202111081612427726750",
"party_id": "8888",
"resource_in_use": true,
"resource_return_status": true,
"role": "guest"
}
],
"retcode": 0,
"retmsg": "success"
}