TonY: A framework to natively run deep learning frameworks on Apache Hadoop.
TonY supports different frameworks including TensorFlow, PyTorch, MXNet and Horovod. And it enables running either single node or distributed training as a Hadoop application on YARN. Further, more frameworks will be involved in TonY, like Ray.
Multiple Training Modes
TonY supports multiple training modes, including Worker + ParameterServer, Ring-All-Reduce and so on. It supports not only multiple training modes, but also parallel tasks which are no communication in offline inference scenarios.
Easy to use
1. Support sidecar and inline tensorboards when running training jobs and help user to visit tensorboard site directly
2. Provide TonY portal to orchestrate many running distributed training jobs
3. Provide TonY cli to facilitate users to submit training jobs
4. TonY not only supports the docker runtime on higher-version of Hadoop, but also is compatible with older-version hadoop with help of TonY python virtual environment mechanism.
TonY supports heterogeneous resources, including CPU and GPU on Hadoop YARN. And it provides the resource utilization monitoring so that users can reasonably adjust the amount of training resources.
Easy to extend
Our generic training frameworks runtime interface design allows user to support other deep learning frameworks. Further, advanced users could run any job on YARN which is unlimited to deep-learning frameworks.
Existing Yarn features to leverage
1. Stable Hadoop scheduler
2. Team-based and hierarchical queues
3. Elasticity between queues
4. User-based limits