The Rounding Bin Becomes the Training Policy
Qian Zhao, Kunlong Chen, Changxin Tian, Zhonghui Jiang, Haitao Zhang, Chaofan Yu, Peijie Jiang, Mingliang Gong, Jia Liu, Ziqi Liu, Zhiqiang Zhang, and Jun Zhou's FP4 pretraining paper is useful because it makes numerical format choice look like infrastructure policy.
For this essay, a format-training receipt is the record that binds codebook geometry, rounding rule, transform scope, scale hierarchy, kernel overhead, BF16 baseline, and long-run loss gap into one auditable low-precision training claim.
The Claim
The paper, arXiv:2606.20381 [cs.AI], was submitted on June 18, 2026. It studies FP4 training for large language models and argues that the default E2M1 4-bit format has a structural problem under common training recipes.
The problem is not simply "four bits are hard." The authors identify shrinkage bias: a systematic negative rounding error caused by asymmetric Round-to-Nearest-Even bins in non-uniform grids such as E2M1. They then connect that local grid geometry to layerwise signal decay and long-run training loss.
The proposed answer is UFP4, a 4-bit training recipe built on an E1M2/INT4-style uniform grid. The paper's narrow recommendation is that future accelerators should expose uniform 4-bit grids as first-class training primitives alongside E2M1.
The Format Problem
FP4 training promises lower memory and computation cost for pretraining. The hardware context matters: the paper names NVIDIA Blackwell and Rubin-class systems and AMD MI350-series GPUs as examples of current FP4 paths centered on E2M1-style data elements.
E2M1 has one sign bit, two exponent bits, and one mantissa bit. That gives it useful dynamic range for raw outlier-heavy tensors, but its representable values are not uniformly spaced. The authors argue that this geometry becomes a liability once tensor preprocessing changes the quantization regime.
Random Hadamard Transform, or RHT, is the key preprocessing step. RHT spreads outlier energy across coordinates, improving codebook utilization. Under E2M1, though, that can move tensor mass into bins whose asymmetric geometry creates a toward-zero rounding bias.
Shrinkage Bias
The paper defines shrinkage bias as a negative expected RTNE error in normalized magnitude space. In plain terms, the codebook repeatedly rounds values slightly inward, and that small inward error can accumulate through many layers.
Uniform grids such as E1M2 or INT4 avoid the particular bin-asymmetry source of this bias. That does not make low-precision training free. It changes the error surface: after RHT reduces the need for extreme dynamic range, local magnitude preservation becomes more important.
The tensor diagnostics make the mechanism concrete. For the outlier-heavy linear_fc2/fwd_x tensor, RHT reverses the format ranking: E2M1 leads before rotation at 21.90 dB versus 19.94 dB, while E1M2 leads after rotation at 23.19 dB versus 20.00 dB and raises effective bucket ratio from 0.56 to 0.97 on average.
The UFP4 Recipe
UFP4 uses an E1M2/INT4-style uniform grid, quant block size 1x16, FP32 single-level scale hierarchy, RHT block size 16, and stochastic rounding only on dY.
The major structural difference is RHT scope. The E2M1 reference uses RHT only on the weight-gradient path, bwd_dw. UFP4 applies RHT to all three linear-layer training GEMMs: FPROP fwd_y, DGRAD bwd_dx, and WGRAD bwd_dw.
This is the practical claim: full-RHT training is not inherently harmful. The paper argues that it becomes harmful when the post-RHT tensor regime is forced back through a non-uniform E2M1 grid. With a uniform grid, full-RHT coverage becomes beneficial in the tested setting.
Experiments
The paper compares BF16, a controlled E2M1 reference, and UFP4 on Dense 1.5B, MoE 7.9B, and MoE 124B long-run pretraining. The metric is BF16-relative language-modeling loss error.
Across all three settings, UFP4 stays closer to BF16. The latest-1000-step relative error drops from 1.2570% to 0.9673% on Dense 1.5B, from 2.3596% to 1.8469% on MoE 7.9B, and from 1.7308% to 1.3863% on MoE 124B.
The ablations matter because they show that the E2M1 baseline was not casual. The authors tune the E2M1 configuration through controlled one-factor ablations, selecting WGRAD-only RHT with dY-only stochastic rounding. Each candidate Dense 1.5B run trains for 200B tokens before the E2M1 reference is frozen for long-run and scaling experiments.
For E1M2 FP4 on Dense 1.5B runs trained beyond 100B tokens, full RHT is best. Relative to no RHT, full RHT reduces loss by 0.01123, and stochastic rounding on dY adds another 0.00456 reduction.
Hardware Reading
The paper does not merely propose a software recipe. It argues for a hardware primitive. Range-restricted E2M1 variants underperform the E2M1 reference on Dense 1.5B and MoE 7.9B in the tested recipes, so the authors say range-restricted E2M1 is not a satisfactory substitute for native E1M2/INT4 support.
The fused-kernel result is the practical counterweight. For block size 16, fused RHT plus quantization is about 1.06x and 1.07x the latency of standalone quantization across tested BF16 matrix shapes on SM90 and SM100. Unfused RHT plus quantization is much more expensive, at 1.62x and 1.41x the fused latency.
That makes the infrastructure claim specific: UFP4 needs native uniform 4-bit data elements and fused implementation paths, not just a new training-script flag.
Governance Reading
The Spiralist reading is that the rounding bin becomes a training policy. A tiny numerical-format choice shapes which models can be trained cheaply, which hardware paths become standard, and which loss gaps are treated as acceptable.
This is not only an optimization detail. Training efficiency changes access, competition, energy demand, deployment economics, and reproducibility. If one accelerator generation makes E2M1 easy and uniform grids awkward, the default format can steer the whole field even when a different grid is better for a recipe.
The governance question is therefore not "is FP4 good?" It is: which FP4 format, under which rounding rule, with which transform, on which hardware, compared against which BF16 baseline, and with what measured degradation?
Training Receipts
A format-training receipt should include the data element format, codebook, rounding rule, block size, scale hierarchy, RHT scope, RHT block size, stochastic-rounding scope, and whether transforms are fused into quantization kernels.
The experiment receipt should include model family and size, dense or MoE architecture, token budget, optimizer and schedule, BF16 baseline, E2M1 reference-selection procedure, latest-window metric, scaling-law fit, ablation matrix, and per-kernel latency.
The hardware receipt should say whether the accelerator supports E2M1, E1M2, INT4-style uniform grids, fused RHT plus quantization, native matrix throughput for the chosen format, and fallback costs when the desired grid is emulated.
Limits
The paper's recommendation is intentionally narrow. It says E2M1 should remain available for raw outlier-heavy tensors and inference workloads, but should not be the only first-class FP4 training format.
The results are reported under specific model families, training recipes, scale hierarchies, block sizes, kernels, and hardware-facing assumptions. They strengthen the case for uniform FP4 training primitives, but they do not settle every low-precision recipe, accelerator, workload, or inference setting.
The safe reading is: UFP4 is a strong argument that post-RHT FP4 training needs uniform-grid support, not a general claim that one 4-bit format dominates every numerical regime.
Source Discipline
This page treats the arXiv abstract, arXiv HTML, and PDF as the source set. The PDF was used for the recipe table, tensor-diagnostic values, long-run loss-error figures, ablation details, fused-kernel latency numbers, reference-selection procedure, and conclusion.
I did not independently rerun the training jobs, inspect kernels, or reproduce the scaling-law fits. The arXiv page did not expose a public code repository, so this analysis treats the reported results as paper claims.
Related Pages
- AI Compute, AI Data Centers, AI Chip Export Controls, Model Weight Security, Mixture-of-Experts Models, AI Evaluations, and AI Audit Trails cover adjacent vocabulary.
- The AI Factory Becomes Industrial Policy, The Token Meter Becomes the AI Budget, The Quantized Repair Becomes the Cost Ledger, The Performance Benchmark Becomes the Coding Agent, and The Soft Prefix Becomes the Skill Artifact cover neighboring infrastructure, cost, and benchmark questions.
Sources
- arXiv abstract: Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe.
- arXiv HTML: arXiv:2606.20381 HTML.
- Paper PDF: arXiv:2606.20381 PDF.