Rethinking Training: Small Batches Revolutionize Language Model Efficiency.

A groundbreaking NeurIPS 2025 paper challenges traditional language model training by demonstrating that small batch sizes, even down to one, can outperform larger batches in efficiency and stability, revolutionizing LLM optimization.

Article written by

Maria Konieczna

Challenging conventional wisdom in language model training, a groundbreaking 2025 NeurIPS paper reveals that small batch sizes—even down to batch size 1—train stably, outperform larger batches in per-FLOP efficiency, and enable vanilla SGD without momentum or optimizer state.[1][5]

Authors Martin Marek, Sanae Lotfi, Aditya Somasundaram, Andrew Gordon Wilson, and Micah Goldblum demonstrate that small batches are more robust to hyperparameter choices like learning rate and Adam's β parameters, thanks to a novel scaling rule that fixes the second moment's half-life in tokens rather than decay rates.[1][3] This stability holds across LLM pretraining and fine-tuning on datasets like FineWeb-Edu, matching or exceeding large-batch performance while using simpler optimizers.

Key insights include rejecting gradient accumulation as wasteful—except in bandwidth-limited multi-device setups—and recommending the smallest batch size that maximizes throughput for computational advantages and less tuning.[2] Small batches shine in the far-from-convergence regime of AI model training, aligning with compute-optimal scaling laws for better benchmarks.

Practical takeaway: Pair small batches with low-state optimizers to rival full fine-tuning while matching LoRA's memory footprint, revolutionizing efficient LLM optimization.[4]

Embrace small batches: Simplicity scales where complexity stumbles.

Article written by

Maria Konieczna

Want to see us in action?

Schedule a 30-min demo

Book a demo

Get candidates this week

Short-list in 2–4 days. Pilot in 1–2 weeks. Scale on proof.

Got questions? 🤔

Book a call →