Since the inception of the Cloud Native Computing Foundation (CNCF) in 2015, cloud native technologies have been making significant strides in reaching enterprise-level maturity and global adoption. Compared to the traditional monolithic programming paradigm, cloud native offers a faster and more cost-effective way to innovate and develop services on a highly scalable and available application platform. According to the CNCF, cloud native services have been deployed in various industries — such as Financial Services, E-commerce, Education, Transportation, Travels, Media & Entertainment, Government, Telecom, IT, etc. — mostly with the goal of solving problems in application development velocity, scalability, efficiency, portability, availability, and more.
With the emergence and support of Mobile, IoT and Edge Computing technologies, we are seeing the next wave of workloads running on cloud native platforms — Artificial Intelligence (AI; including Machine Learning and Deep Learning), Big Data, and High-Performance Computing (HPC) — where a large amount of compute resources running “batch jobs” connected to massive data lakes is essential.
Kubernetes has been recognized as the de-facto cloud native orchestration platform that is portable and extensible, and can effectively orchestrate and manage container workloads and services. However, there are still gaps in Kubernetes to meet the “batch” job needs of AI, Big Data and HPC workloads today:
- Kubernetes’ native scheduling function cannot effectively meet the computing requirements of AI, Big Data and HPC workloads.
- Kubernetes’ job management ability cannot meet the complex demands of AI training.
- Data management lacks functions such as data caching on the computing side and data location awareness.
- Resource management lacks time-sharing, resulting in lower resource utilization.
- Insufficient heterogeneous hardware support.
Volcano is an open source batch system built on top of Kubernetes to meet the requirements of AI, Big Data and HPC workloads. It includes the following features:
- A versatile batch scheduling, with many options to deploy based on use cases.
- Gang scheduling
- Job-based fair share
- Queue scheduling
- Namespace based fair share cross queue
- Fairness over time
- Job-based priority
- Preemption and reclaim
- Reservation and backfill
- Enhanced job/queue management, such as multiple pod template and flexible error handling mechanisms.
- Increased data cache on the computing side, to improve the efficiency of data transmission and reading.
- A multidimensional comprehensive scoring mechanism, to achieve more efficient management and allocation of resources.
- Multiple accelerator support such as GPU, FPGA, etc.
After over a year of development, the Volcano project team has reached its first major milestone: being accepted as a CNCF Sandbox project in May 2020, with the support of multiple CNCF member companies and developers. Based on various use cases, Volcano has integrated with many mainstream computing frameworks and communities — including Tensorflow, Kubeflow, Spark, PyTorch, Paddle Paddle, Horovod (MPI), Cromwell, MindSpore, etc. Nevertheless, there is still a lot to be done for both Volcano capabilities and its ecosystem. We invite you to check out the Volcano project and join the project team.