Workloads and Checkpointing

KevinR_Intel · ‎03-09-2022

One concept that is common to the HPC environment is "checkpointing." There is an upper limit to the duration of a single job to facilitate sharing the machine. But that doesn't need to be a fixed limit on your workload's duration. By designing checkpoints into your program, often but not exclusively by recording some intermediate results or state, you can accomplish some meaningful step in the workload and then have the option to quit gracefully and resume later or to keep running until the "clock expires" and simply pick-up with the latest checkpoint. Checkpointing also helps to make a program robust against hardware failure or unanticipated bugs.

In many usage and architecture models, this concept won't entirely disappear. If quantum hardware needs to be shared among a population of users, then queues, limits, and checkpointing will follow.

It's also good to keep in mind that in real quantum hardware, it will not be possible to 'save state' in the middle of a quantum kernel / circuit execution to be resumed later.

There are several ways the workloads of your classical-quantum program can expand. The first is pretty obvious--your algorithm requires a large number of qubits. Another cause might be the number of 'shots' you need for statistical results or a quantum kernel is contained in a logic loop. In most cases, you can try a smaller, trial-sized portion of your workload to understand how the duration scales with size.