Challenging conventional wisdom in language model training, a groundbreaking 2025 NeurIPS paper reveals that small batch sizes—even down to batch size 1—train stably, outperform larger batches in per-FLOP efficiency, and enable vanilla SGD without momentum or optimizer state.[1][5]
Authors Martin Marek, Sanae Lotfi, Aditya Somasundaram, Andrew Gordon Wilson, and Micah Goldblum demonstrate that small batches are more robust to hyperparameter choices like learning rate and Adam's β parameters, thanks to a novel scaling rule that fixes the second moment's half-life in tokens rather than decay rates.[1][3] This stability holds across LLM pretraining and fine-tuning on datasets like FineWeb-Edu, matching or exceeding large-batch performance while using simpler optimizers.
Key insights include rejecting gradient accumulation as wasteful—except in bandwidth-limited multi-device setups—and recommending the smallest batch size that maximizes throughput for computational advantages and less tuning.[2] Small batches shine in the far-from-convergence regime of AI model training, aligning with compute-optimal scaling laws for better benchmarks.
Practical takeaway: Pair small batches with low-state optimizers to rival full fine-tuning while matching LoRA's memory footprint, revolutionizing efficient LLM optimization.[4]
Embrace small batches: Simplicity scales where complexity stumbles.