TY - GEN
T1 - Efficient evaluation of scheduling metrics using emulation
T2 - 47th International Conference on Parallel Processing, ICPP 2018
AU - Barberato, Claudio
AU - Strazdins, Peter E.
AU - McCreath, Eric
AU - Atif, Muhammad
N1 - Publisher Copyright:
© 2018 Association for Computing Machinery.
PY - 2018/8/13
Y1 - 2018/8/13
N2 - Scheduling algorithms have a significant impact in the optimal utilization of HPC facilities. Waiting time, response time, slowdown and weighted slowdown are classical metrics used to compare the performance of different scheduling algorithms. This paper investigates the effects of four artefacts, namely non-determinism, shuffling, time shrinking and sampling, on these metrics. We present a scheduling framework based on emulation, that is, using a real scheduler (Slurm) with a sleep program able to take into account periods of suspension. The framework is able to emulate a 50K core cluster using 10 virtualized nodes, with the scheduler running on an isolated node. We find that the non-determinism in repeatedly running a workload has a small but discernible effect of these metrics, and that shuffling job order in a workload increases this by a factor of 5-10. Experiments with shuffled workloads indicate that the average difference of the Backfill and Suspend-Resume strategy performance is within this variation. We also propose methodologies for time shrinking and sampling to decrease the duration of emulations, while aiming to keep these metrics invariant (or linear variant) with the original workload. We find that time shrinking to a factor of up to 90% can have similar effect on the metrics as non-determinism. For sampling, our methodology preserved the distribution of job sizes to a high extent, but had a variation in the metrics somewhat greater than for shuffling. Finally, we use our framework to study in-depth Slurm's scheduling performance, and discover a deficiency in the Suspend-Resume implementation.
AB - Scheduling algorithms have a significant impact in the optimal utilization of HPC facilities. Waiting time, response time, slowdown and weighted slowdown are classical metrics used to compare the performance of different scheduling algorithms. This paper investigates the effects of four artefacts, namely non-determinism, shuffling, time shrinking and sampling, on these metrics. We present a scheduling framework based on emulation, that is, using a real scheduler (Slurm) with a sleep program able to take into account periods of suspension. The framework is able to emulate a 50K core cluster using 10 virtualized nodes, with the scheduler running on an isolated node. We find that the non-determinism in repeatedly running a workload has a small but discernible effect of these metrics, and that shuffling job order in a workload increases this by a factor of 5-10. Experiments with shuffled workloads indicate that the average difference of the Backfill and Suspend-Resume strategy performance is within this variation. We also propose methodologies for time shrinking and sampling to decrease the duration of emulations, while aiming to keep these metrics invariant (or linear variant) with the original workload. We find that time shrinking to a factor of up to 90% can have similar effect on the metrics as non-determinism. For sampling, our methodology preserved the distribution of job sizes to a high extent, but had a variation in the metrics somewhat greater than for shuffling. Finally, we use our framework to study in-depth Slurm's scheduling performance, and discover a deficiency in the Suspend-Resume implementation.
KW - Classical scheduling metrics
KW - Emulation
KW - Parallel job scheduling
KW - Slurm
UR - http://www.scopus.com/inward/record.url?scp=85054848022&partnerID=8YFLogxK
U2 - 10.1145/3229710.3229751
DO - 10.1145/3229710.3229751
M3 - Conference contribution
SN - 9781450365239
T3 - ACM International Conference Proceeding Series
BT - 47th International Conference on Parallel Processing, ICPP 2018
PB - Association for Computing Machinery (ACM)
Y2 - 13 August 2018 through 16 August 2018
ER -