PaddlePaddle EDL: Elastic Deep Learning

While many hardware and software manufacturers are working on improving the running time of deep learning jobs, EDL optimizes

the global utilization of the cluster, and
the waiting time of job submitters.

For more about the project EDL, please refer to this invited blog post on the Kubernetes official blog.

EDL includes two parts:

a Kubernetes controller for the elastic scheduling of distributed deep learning jobs, and
making PaddlePaddle a fault-tolerable deep learning framework. This directory contains the Kubernetes controller. For more information about fault-tolerance, please refer to the design.

We deployed EDL on a real Kubernetes cluster, dlnel.com, opened for graduate students of Tsinghua University. The performance test report of EDL on this cluster is here.

Tutorials

Usage
How to Build EDL Component
Run CTR Training and Deployment on Baidu Cloud

Design Docs

Fault-Tolerant Training in PaddlePaddle.
Elastic Deep Learning Design Doc.

Future

Resource Adjustments by EDL
Support Full-Tolerant Distributed Training in PadldePaddle Fluid.

FAQ

TBD

License

PaddlePaddle EDL is provided under the Apache-2.0 license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

PaddlePaddle EDL: Elastic Deep Learning

Tutorials

Design Docs

Future

FAQ

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

PaddlePaddle EDL: Elastic Deep Learning

Tutorials

Design Docs

Future

FAQ

License