NoLoCo: training large models with no all-reduce

This is an academic paper describing NoLoCo, a novel optimisation method for distributed training that replaces the global synchronisation step with a gossip method, enabling training in heterogeneous and low bandwidth networks.

‍

Current methods for large model training require frequent gradient exchanges between nodes. These demand high-bandwidth, low-latency interconnects; otherwise, nodes will sit idle waiting for updates to arrive. Prior work (including DiLoCo) lowers communication cost by reducing the frequency of all-reduce, yet each event still involves every replica and therefore inherits the same latency constraints.

Today we’re extending this body of research with NoLoCo (No-all-reduce Low-Communication), a method that removes the global all-reduce step altogether. After a short block of local SGD steps, replicas average weights with a single random peer while activations are randomly re-routed between pipeline shards. A modified Nesterov step is used to keep parameters aligned. In experiments with up to 1,000 internet-connected GPUs, NoLoCo delivers the same validation loss as standard data parallelism while reducing synchronisation latency by roughly 10x.

Key Highlights

No all-reduce: NoLoCo uses data parallelism but avoids global synchronisations (all-reduce) by only synchronising between small pairs.
No loss in convergence speed: NoLoCo reduces variance across instances by randomly routing activations between pipeline stages, and adding a new term to Nesterov momentum to control the divergence in weights.
‍10x faster synchronisation at scale — With 1,000 replicas, NoLoCo preserves baseline accuracy while reducing each sync step by an order of magnitude relative to tree all-reduce.

Background

In standard data-parallel training every worker computes gradients on its own mini-batch and then participates in a cluster-wide all-reduce so that every replica sees the same update. The collective communication scales linearly with the replica count and is gated by the slowest link. On nodes connected over the internet, this step dominates wall time and wastes compute. Low-communication schemes such as DiLoCo reduce the frequency of the all-reduce, yet each synchronisation still involves every worker and therefore inherits the same latency bound.

NoLoCo shows that we can avoid this all-reduce synchronisation step, without hurting convergence.

How it works

Instead of performing an all-reduce operation, NoLoCo synchronises only small, randomly-chosen replicas (as small as two). This reduces the time complexity of the synchronisation step by ~log(N), where N is the number of model replicas. NoLoCo incorporates extra measures to ensure that convergence doesn’t slow down. Inspired by the dynamic and randomised routing methods of SWARM Parallelism and DiPaCo, training samples in NoLoCo are randomly routed between different replicas of pipeline stages.

‍

‍

This forces sufficient mixing of gradients across data-parallel instances.

NoLoCo also uses a novel variant of Nesterov momentum where we add a third term to move weights closer to each other. With the extra term full expression for Nesterov momentum becomes:
‍

If the synchronisation is applied over all workers, the final term vanishes and we recover the original Nesterov momentum expression used in DiLoCo. Different terms are further illustrated in a figure below:
‍

‍

The second term is averaging the direction where each individual NoLoCo worker is updating the weights while the last term can be viewed as another update moving the weights of different workers closer to their starting position. When this is done over multiple iterations, the third term will have a similar effect as rolling averaging of random replica weights samples during the training.

In simple terms, NoLoCo includes the following steps:

1. Local phase. Each replica performs k SGD updates.

2. Pairwise averaging. It then chooses one peer at random and averages weights using a Nesterov-style rule that bounds drift.

3. Random routing. Throughout the local phase, activations are forwarded to random partner shards, ensuring continuous gradient mixing.

Findings

We demonstrate, both theoretically and empirically, that NoLoCo maintains convergence while substantially reducing communication requirements. Leveraging these techniques, NoLoCo efficiently trains models from millions to billions of parameters. Our experiments with Llama-style models ranging from 125 million to 6.8 billion parameters using up to 1,000 replicas show that NoLoCo’s synchronisation steps are an order of magnitude faster than DiLoCo in practice and yet still converge faster. Moreover, both the synchronisation strategy and modified optimiser can be seamlessly integrated into other training protocols and model architectures.

Why it matters

Removing the global all-reduce lowers the infrastructure threshold for large-model training, allowing researchers to better leverage decentralized hardware without specialised interconnects. We’re excited to share NoLoCo as a fully open source implementation to advance the frontier of open machine learning.

Learn more

Read the paper
Explore the repo – benchmarks, scripts, and a minimal 100-line reference implementation.

‍

Join the discussion on our Discord or follow us on X.

Report

Gensyn

NoLoCo: training large models with no all-reduce

NoLoCo: training large models with no all-reduce

Key Highlights

Background

How it works

Findings

Why it matters

Learn more

Further reading