百度飞桨(PaddlePaddle)分布式训练在Volcano系统上的实践(下)

举报
技术火炬手 发表于 2021/01/28 16:25:29 2021/01/28
【摘要】 使用Volcano平台执行PaddlePaddle框架计算任务,可以实现计算任务的批量创建,任务的自动化管理,实现计算任务的自我管理。相较于普通的Replicaset+Job的模式,使用Volcano平台可以提升并行计算的管理效率。

百度飞桨(PaddlePaddle)分布式训练在Volcano系统上的实践(上)

将上述计算任务迁移到volcano平台上进行测试。

Volcano支持Multi-pod jobs,拓展“tasks”字段,tasks下可以定义多个pod描述,其中“replicas” 字段描述task将要生成的pod数量,“name”描述task名称,pod名称将根据task名称生成。Template字段与kubernetes “podTemplate”一致。ctrdemo中含有两个task “pserver”“trainer”,每个taskreplicas都是2,将会创建两个PServer任务,两个Trainer任务。

使用Volcano调度器,在job的配置中需要指定“schedulerName”“volcano”,如果schedulerName没有指定为“volcano”job下的任务调度将会使用kubernetes的默认调度器“default”调度器。

Volcano通过指定“minAvailable”字段保证计算任务的gang-scheduler调度策略。“minAvailable”数值指明在对当前计算任务下的pods进行调度时,需保证多少计算任务都能够调度才会执行调度任务,“minAvailable”的数值需要小于或等于计算任务下的所有任务数量的总和。对于PaddlePaddle框架计算任务,只有当所有的PServerTrainer任务都处于运行中,才能开始计算任务。因此对于飞桨计算任务,“minAvailable”的数值需要与计算任务下的所有计算任务总和相等。

对于使用飞桨分布式训练的应用,在计算过程中,如果PServer任务或者Trainer任务被驱逐或失败,PServerTrainer形成的计算集群将会失效,所有的PServer任务和Trainer任务都需要重启,以形成新的集群开始新的计算。Volcano可以通过设置“policies”实现上述目的。设置“PodEvicted”事件对应“RestartJob”动作,设置“PodFailed”事件对应“RestartJob”动作,在设置了这两个“policies”之后,当计算任务被驱逐或者失败,所有的计算任务将会重启。

下面是使用Volcano平台执行CTR任务的配置ctr-volcano.yaml,配置文件可从Volcano代码库获取

Volcano代码仓库地址:

https://github.com/volcano-sh/volcano/blob/master/example/integrations/paddlepaddle/ctr-paddlepaddle-on-volcano.yaml

apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
 name: ctr-volcano
spec:
 minAvailable: 4
schedulerName: volcano
 policies:
- event: PodEvicted
   action: RestartJob
 - event: PodFailed
 action: RestartJob
 tasks:
- replicas: 2
     name: pserver
     template:
       metadata:
         labels:
paddle-job-pserver: fluid-ctr
       spec:
         imagePullSecrets:
 - name: default-secret
         volumes:
         - hostPath:
 path: /home/work/
             type: ""
   name: seqdata
         containers:
 - image: volcanosh/edlctr:v1
             command:
               - paddle_k8s
               - start_fluid
             imagePullPolicy: IfNotPresent
   name: pserver
             volumeMounts:
 - mountPath: /mnt/seqdata
 name: seqdata
             resources:
               limits:
                 cpu: 10
                 memory: 30Gi
 ephemeral-storage: 10Gi
               requests:
                 cpu: 1
                 memory: 100M
                 ephemeral-storage: 1Gi
             env:
 - name: GLOG_v
                 value: "0"
 - name: GLOG_logtostderr
                 value: "1"
               - name: TOPOLOGY
                 value: ""
 - name: TRAINER_PACKAGE
                 value: /workspace
   - name: NAMESPACE
                 valueFrom:
                   fieldRef:
                     apiVersion: v1
                     fieldPath: metadata.namespace
   - name: POD_IP
                 valueFrom:
                   fieldRef:
                     apiVersion: v1
                     fieldPath: status.podIP
   - name: POD_NAME
                 valueFrom:
                   fieldRef:
                     apiVersion: v1
       fieldPath: metadata.name
               - name: PADDLE_CURRENT_IP
                 valueFrom:
                   fieldRef:
                     apiVersion: v1
                     fieldPath: status.podIP
- name: PADDLE_JOB_NAME
                 value: fluid-ctr
     - name: PADDLE_IS_LOCAL
                 value: "0"
     - name: PADDLE_TRAINERS_NUM
                 value: "2"
               - name: PADDLE_PSERVERS_NUM
                 value: "2"
 - name: FLAGS_rpc_deadline
"                  value: "
   - name: ENTRY
                 value: cd /workspace/ctr && python train.py --is_local 0 --cloud_train 1
   - name: PADDLE_PORT
                 value: "30236"
 - name: LD_LIBRARY_PATH
                 value: /usr/local/lib:/usr/local/nvidia/lib64:/usr/local/rdma/lib64:/usr/lib64/mlnx_ofed/valgrind
- name: PADDLE_TRAINING_ROLE
                 value: PSERVER
- name: TRAINING_ROLE
                 value: PSERVER
         restartPolicy: OnFailure 
   - replicas: 2
     policies:
 - event: TaskCompleted
       action: CompleteJob
     name: trainer
     template:
       metadata:
         labels:
           paddle-job: fluid-ctr
       spec:
         imagePullSecrets:
- name: default-secret
         volumes:
         - hostPath:
             path: /home/work/
             type: ""
           name: seqdata
         containers:
   - image: volcanosh/edlctr:v1
             command:
               - paddle_k8s
               - start_fluid
             imagePullPolicy: IfNotPresent
   name: trainer
             volumeMounts:
 - mountPath: /mnt/seqdata
 name: seqdata
             resources:
               limits:
                 cpu: 10
                 memory: 30Gi
                 ephemeral-storage: 10Gi
               requests:
                 cpu: 1
                 memory: 100M
                 ephemeral-storage: 10Gi
             env:
- name: GLOG_v
                 value: "0"
   - name: GLOG_logtostderr
                 value: "1"
- name: TOPOLOGY
- name: TRAINER_PACKAGE
                 value: /workspace
 - name: CPU_NUM
                 value: "2"
               - name: NAMESPACE
                 valueFrom:
                   fieldRef:
                     apiVersion: v1
                     fieldPath: metadata.namespace
 - name: POD_IP
                 valueFrom:
                   fieldRef:
                     apiVersion: v1
                     fieldPath: status.podIP
   - name: POD_NAME
                 valueFrom:
                   fieldRef:
                     apiVersion: v1
                     fieldPath: metadata.name
               - name: PADDLE_CURRENT_IP
                 valueFrom:
                   fieldRef:
                     apiVersion: v1
                     fieldPath: status.podIP
 - name: PADDLE_JOB_NAME
                 value: fluid-ctr
               - name: PADDLE_IS_LOCAL
                 value: "0"
 - name: FLAGS_rpc_deadline
"                  value: "
 - name: PADDLE_PORT
                 value: "30236"
- name: PADDLE_PSERVERS_NUM
                 value: "2"
 - name: PADDLE_TRAINERS_NUM
                 value: "2"
- name: PADDLE_TRAINING_ROLE
                 value: TRAINER
 - name: TRAINING_ROLE
                 value: TRAINER
     - name: LD_LIBRARY_PATH
                 value: /usr/local/lib:/usr/local/nvidia/lib64:/usr/local/rdma/lib64:/usr/lib64/mlnx_ofed/valgrind
   - name: ENTRY
                 value: cd /workspace/ctr && python train.py --is_local 0 --cloud_train 1
         restartPolicy: OnFailure

 

在集群终端中执行以下指令在default namespace下创建volcano job

root@volcano-paddlepaddle:~# kubectl apply -f ctr-volcano.yaml 
job.batch.volcano.sh/ctr-volcano create

检查pods的状态,无论是pserver任务还是trainer任务都被下发到集群中,并开始运行。如果当前集群下的空闲资源,不能满足pserver任务和trainer任务的资源述求,任何任务都不会被创建。

root@volcano-paddlepaddle:~# kubectl get pods | grep ctr-volcano

ctr-volcano-pserver-0     1/1     Running     0          16s
ctr-volcano-pserver-1     1/1     Running     0          16s
ctr-volcano-trainer-0     1/1     Running     0          16s
ctr-volcano-trainer-1     1/1     Running     0          16

选择一个PServer任务查看日志,看到PServer在监听端口,并对外提供服务

root@volcano-paddlepaddle:~# kubectl logs ctr-volcano-pserver-0


+ case "$1" in
+ start_fluid_process
+ pserver_label=paddle-job-pserver=fluid-ctr
+ trainer_label=paddle-job=fluid-ctr
+ hostname=ctr-volcano-pserver-0
+ task_index=
+ '[' PSERVER == TRAINER ']'
+ '[' PSERVER == PSERVER ']'
+ stdbuf -oL python /root/k8s_tools.py wait_pods_running paddle-job-pserver=fluid-ctr 2
label selector: paddle-job-pserver=fluid-ctr, desired: 2
current cnt: 0 sleep for 5 seconds...
+ '[' PSERVER == TRAINER ']'
+ '[' PSERVER == WORKER ']'
++ python /root/k8s_tools.py fetch_endpoints paddle-job-pserver=fluid-ctr 30236
+ export PADDLE_PSERVERS=172.20.0.148:30236,172.20.1.134:30237
+ PADDLE_PSERVERS=172.20.0.148:30236,172.20.1.134:30237
++ python /root/k8s_tools.py fetch_ips paddle-job=fluid-ctr
+ export PADDLE_TRAINER_IPS=172.20.0.147,172.20.1.133
+ PADDLE_TRAINER_IPS=172.20.0.147,172.20.1.133
+ '[' PSERVER == TRAINER ']'
+ '[' PSERVER == WORKER ']'
++ python /root/k8s_tools.py fetch_id paddle-job-pserver=fluid-ctr
+ task_index=0
+ export PADDLE_TRAINER_ID=0
+ PADDLE_TRAINER_ID=0
+ export PADDLE_PSERVER_ID=0
+ PADDLE_PSERVER_ID=0
+ stdbuf -oL sh -c 'cd /workspace/ctr && python train.py --is_local 0 --cloud_train 1'
2019-09-03 09:57:55,619 - INFO - run dist training
2019-09-03 09:57:55,708 - INFO - run pserver
get_pserver_program() is deprecated, call get_pserver_programs() to get pserver main and startup in a single call.
I0903 09:57:55.860916    41 grpc_server.cc:435] Server listening on 172.20.0.148:30236 selected port:

选择一个Trainer任务查看日志,看到计算任务已经开始执行

root@volcano-paddlepaddle:~# kubectl logs ctr-volcano-trainer-0


+ case "$1" in
+ start_fluid_process
+ pserver_label=paddle-job-pserver=fluid-ctr
+ trainer_label=paddle-job=fluid-ctr
+ hostname=ctr-volcano-trainer-0
+ task_index=
+ '[' TRAINER == TRAINER ']'
+ stdbuf -oL python /root/k8s_tools.py wait_pods_running paddle-job-pserver=fluid-ctr 2
label selector: paddle-job-pserver=fluid-ctr, desired: 2
current cnt: 0 sleep for 5 seconds...
+ '[' TRAINER == TRAINER ']'
+ stdbuf -oL python /root/k8s_tools.py wait_pods_running paddle-job=fluid-ctr 2
label selector: paddle-job=fluid-ctr, desired: 2
++ python /root/k8s_tools.py fetch_endpoints paddle-job-pserver=fluid-ctr 30236
+ export PADDLE_PSERVERS=172.20.0.148:30236,172.20.1.134:30237
+ PADDLE_PSERVERS=172.20.0.148:30236,172.20.1.134:30237
++ python /root/k8s_tools.py fetch_ips paddle-job=fluid-ctr
+ export PADDLE_TRAINER_IPS=172.20.0.147,172.20.1.133
+ PADDLE_TRAINER_IPS=172.20.0.147,172.20.1.133
+ '[' TRAINER == TRAINER ']'
+ check_failed_cnt 1
+ max_failed=1
++ python /root/k8s_tools.py count_pods_by_phase paddle-job=fluid-ctr Failed
+ failed_count=0
+ '[' 0 -gt 1 ']'
++ python /root/k8s_tools.py fetch_id paddle-job=fluid-ctr
+ task_index=0
+ export PADDLE_TRAINER_ID=0
+ PADDLE_TRAINER_ID=0
+ export PADDLE_PSERVER_ID=0
+ PADDLE_PSERVER_ID=0
+ stdbuf -oL sh -c 'cd /workspace/ctr && python train.py --is_local 0 --cloud_train 1'
2019-09-03 09:57:56,712 - INFO - run dist training
2019-09-03 09:57:56,773 - INFO - download the training materials
 % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                Dload  Upload   Total   Spent    Left  Speed
100  433M  100  433M    0     0  96.2M      0  0:00:04  0:00:04 --:--:-- 96.2M
2019-09-03 09:58:27,648 - INFO - run trainer
2019-09-03 09:58:27,732 - WARNING -
I0903 09:58:27.734141    25 parallel_executor.cc:329] The number of CPUPlace, which is used in ParallelExecutor, is 2. And the Program will be copied 2 copies
I0903 09:58:27.937546    25 rpc_client.h:101] init rpc client with trainer_id 0
2019-09-03 09:58:37,957 - INFO - TRAIN --> pass: 0 batch: 0 loss: 0.670620727539 auc: 0.510430537062, batch_auc: 0.510764985415
2019-09-03 09:58:38,264 - INFO - TRAIN --> pass: 0 batch: 1 loss: 0.641319274902 auc: 0.503955813399, batch_auc: 0.503955813399
2019-09-03 09:58:38,585 - INFO - TRAIN --> pass: 0 batch: 2 loss: 0.617138793945 auc: 0.50334993182, batch_auc: 0.50334993182
2019-09-03 09:58:38,873 - INFO - TRAIN --> pass: 0 batch: 3 loss: 0.598490356445 auc: 0.507263818365, batch_auc: 0.507263818365
2019-09-03 09:58:39,182 - INFO - TRAIN --> pass: 0 batch: 4 loss: 0.573976501465 auc: 0.510442316749, batch_auc: 0.51044231674

等待大概70分钟,查看计算任务日志,发现任务已经安全退出

root@volcano-paddlepaddle:~# kubectl get pod | grep ctr-volcano


ctr-volcano-trainer-0   0/1     Completed   0          77m
ctr-volcano-trainer-1   0/1     Completed   0          77

与此同时,在训练结束之后,我们可能需要训练出来的模型用于别处。在yaml文件当中,我们规定了该任务volcanosh/edlctr:v1镜像,该镜像的工作目录在/workspace/ctr下,在train.py当中有定义,会在每1000batch或是每一轮pass(跑完一遍训练集)的时候,调用save_inference_model接口来保存模型。保存的模型在/workspace/ctr/models文件夹下。那么如何在任务结束后获取模型呢?我们建议以下几种方式。

1)在yaml文件当中trainer部分的spec当中定义volume,通过dockervolume映射容器路径和宿主机路径的机制,将/workspace/ctr/models文件夹映射到宿主机的文件夹中。接下来通过kubectl describe pod ctr-volcano-trainer-0,可以得知我们的模型所在的节点,接下来ssh登陆到对应的节点上,到宿主机被映射到路径下,就可以获取到训练出来到模型了。

2)如果需要更加灵活的,自动化的模型配送流程,可以在K8S集群上建立File Server和分布式文件系统,例如GlusterFS。将ctr-volcano-trainer-0容器内部的/workspace/ctr/models文件夹映射到GlusterFSPVCPersistent Volume Claim)上。通过ftpwget/curl操作命令就可以实现模型的获取和配送。

综上,使用Volcano平台执行PaddlePaddle框架计算任务,可以实现计算任务的批量创建,任务的自动化管理,实现计算任务的自我管理。相较于普通的Replicaset+Job的模式,使用Volcano平台可以提升并行计算的管理效率。

作者

董大祥,@guru4elephant, PaddlePaddle Architect, Principal Architect, Baidu

王嘉炜,@wangjiawei04, PaddlePaddle Engineer, Senior Engineer, Baidu

于佃海,@raindrops2sea, PaddlePaddle Architect, Distinguished Architect, Baidu

张经辉, @sivanzcw, Volcano Contributor, Cloud Native software engineer, Huawei

马达, @k82cn, Kubernetes Maintainer, SIG-Scheduling Co-Leader, Volcano Lead, Huawei

 参考文献

PaddlePaddle官网 

https://www.paddlepaddle.org.cn

Paddle on Spark 

https://github.com/hohdiy/paddle_on_spark/blob/master/doc/paddle_on_spark.md

Run Deep Learning with PaddlePaddle on Kubernetes 

https://kubernetes.io/blog/2017/02/run-deep-learning-with-paddlepaddle-on-kubernetes/

Volcano官网 

https://volcano.sh

Volcao社区 

https://github.com/volcano-sh/volcano

百度CTR Demo 

https://www.paddlepaddle.org.cn/documentation/docs/zh/1.5/user_guides/howto/training/deploy_ctr_on_baidu_cloud_cn.html

CTR-volcano 配置文件 

https://github.com/volcano-sh/volcano/blob/master/example/integrations/paddlepaddle/ctr-paddlepaddle-on-volcano.yam

 

【版权声明】本文为华为云社区用户原创内容,转载时必须标注文章的来源(华为云社区)、文章链接、文章作者等基本信息, 否则作者和本社区有权追究责任。如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。