PFN Develops High-Efficiency Scheduler for AI Training
- •PFN unveils custom Kubernetes scheduler to maximize AI training efficiency.
- •Deadline compliance rates improved significantly via the Least Slack Time algorithm.
- •Custom evaluation tool enables rapid simulation of weeks-long training scenarios.
Preferred Networks (PFN) has unveiled a custom Kubernetes scheduler and evaluation tool designed to push AI computing infrastructure efficiency to its limits. Training modern AI models requires massive computational resources and time, yet standard schedulers often struggle to handle individual priorities and strict deadlines in complex environments with mixed workloads. To address this, the company proposed a new approach where software tightly coordinates with the AI infrastructure to optimize resource scheduling.
The core of this development is the Least Slack Time (LST) scheduling algorithm. This mechanism calculates "slack time" based on a job's estimated execution time and its deadline, prioritizing tasks with the least remaining leeway. By utilizing preemption to interrupt and resume tasks, this algorithm ensures that smaller subsequent jobs—which often timed out while waiting in standard schedulers—can be completed within their windows. Using deadlines as a quantitative metric allows for efficient operations without relying on subjective priority settings.
The system also integrates Gang Scheduling, which is essential for distributed training, and specialized Bin Packing to prevent accelerator fragmentation. It is specifically designed to minimize resource fragmentation even when node failures cause variations in available chip counts. This design maintains high utilization rates even in large-scale environments prone to hardware uncertainty.
To solve the challenge of lengthy evaluation cycles, PFN also introduced "kube-scheduler-evaluator." This tool simulates job execution scenarios spanning several weeks in just a few minutes by advancing internal virtual time without consuming physical resources. By supporting flexible scenario definitions in Go and releasing the tool as open-source software, PFN aims to contribute significan