CheckFree: fault tolerant training without checkpoints

This is an academic paper describing CheckFree, a novel recovery method for failures in distributed training that does not require checkpointing or redundant computation, enabling efficient training in the presence of frequent failures.
‍

‍Key Highlights

1.6x speed up to conventional checkpointing: CheckFree and CheckFree+ can achieve up to 1.6x training time speed up compared to conventional checkpointing in the presence of frequent stage faults.‍
Novel checkpoint-less recovery: CheckFree uses the weights of neighbouring stages to approximate the weights of the lost stage.

Background

In state of the art recovery strategies model weights are checkpointed (stored periodically) to non-faulty centralized storage. This can prove incredibly costly, with a single LLaMa 70B checkpoint taking over 20 minutes at high bandwidth connections (assuming over 500 Mb/s). When a fault occurs, the model is rolled back entirely to the previous checkpoint, thus losing potentially hours of training. Bamboo proposed as an alternative to checkpointing redundant computation - storing the weights of a stage on the previous stage and executing each microbatch forward pass redundantly on the copies. This way when a single fault occurs, training can resume immediately. However, such training proves ineffective for large models as each node needs to double their memory requirements to store the redundant layers. CheckFree and CheckFree+ provide a viable alternative in large scale geo-distributed training, as it incurs no additional computation or communication.

‍

How it works

When a fault occurs, the lost stage is recovered via a weighted average of its two neighbouring ones. This exploits LLMs natural redundancy in layers, shown in prior work where removal of a few layers does not significantly impact the performance of the model. We demonstrate empirically that averaging outperforms significantly simple copying, typically used in layer stacking works.

‍

‍

A naive way of averaging would be a uniform average of the two stages. Such averaging, though, does not distinguish the importance and convergence of the stages, and thereby leads to slower convergence of the overall model. For that reason, CheckFree uses the weights of the last gradient norm of the given stage. Conceptually, this gives more weight to stages which have not converged as much yet, thus partially offloading their functionality to the new stage. To allow the reinitialised stage to “catch up”, CheckFree slightly increases the learning rate for a few steps post-recovery.

‍

‍

However, this strategy cannot recover the weights of the first and last stages, as there are no neighbouring stages to average from. To deal with this, we propose CheckFree+. It enables recovery for edge stages by leveraging out-of-order execution: every other batch swaps the order of the first two and last two stages, allowing intermediate layers to learn their neighbour’s behaviour, similar to redundant computation, however without additional memory or computation overhead. In the event of a failure, the "redundant" stages can be copied to replace the missing ones.

‍

Findings

We extensively evaluate CheckFree and CheckFree+ in failure rates from 5% to 16% per hour against conventional checkpointing and redundant computation. We observe that across various model sizes, CheckFree and CheckFree+ can converge faster in wall-clock time relative to state of the art. However, our methods incur a decrease in convergence iteration-wise relative to a no faults baseline (convergence-wise equivalent to redundant computation). But due to its lightweight recovery procedure, CheckFree and CheckFree+ can have much higher throughput, making them well suited for geo-distributed large language model training.

‍

‍

Why it matters

In decentralized training, nodes can enter and leave the network at any point, potentially causing failure of an entire stage. Even in distributed training on preemptible instances, it is possible to lose an entire stage, if the corresponding nodes are scheduled in the same region. Checkpointing can incur heavy overhead due to the frequent restarts, while redundant computation might not even be possible for large models, due to the linear increase in memory. CheckFree proposes an efficient manner of recovering the LLMs training without any additional computation or communication. We’re excited to open source it today.

Learn More

• Read the paper

• Explore the repo‍

• Join the discussion | Discord & X

‍

Research

Gensyn