Unmasking the True Culprit: Why Temperature=0 Doesn’t Mean Deterministic LLM Inference

A deep technical dive into batch invariant kernels and the path to truly reproducible language model inference

Preamble: The Temperature Parameter Misconception

Recent research by Peeperkorn et al. (2024) in “Is Temperature the Creativity Parameter of Large Language Models?” challenges our fundamental assumptions about the temperature parameter in LLMs. Their empirical analysis reveals that temperature is only weakly correlated with novelty and shows no relationship with cohesion or typicality in narrative generation tasks. This finding exposes a critical gap in our understanding: if temperature doesn’t reliably control creativity as commonly assumed, what other misconceptions do we hold about LLM behavior?

The most pervasive misconception is the belief that temperature=0 guarantees deterministic inference. This assumption has become so entrenched that practitioners often dismiss reproducibility issues as inevitable quirks of probabilistic systems. However, as we’ll demonstrate, the real culprit behind non-deterministic LLM inference has nothing to do with sampling parameters and everything to do with the underlying computational infrastructure.

The Temperature=0 Myth: A Dangerous Misconception

When practitioners first encounter non-deterministic outputs from their “temperature=0” language model inference, the immediate reaction is often confusion followed by resignation. After all, if we’re using greedy sampling (temperature=0), shouldn’t the model always select the highest probability token deterministically?

The widespread belief that “temperature=0 = deterministic inference” has led to a collective shrug in the ML community when faced with irreproducible results. This misconception has been perpetuated across documentation, forums, and even research papers, creating a false sense of understanding about what truly drives non-determinism in LLM inference.

Building on the temperature research findings: If temperature has such a weak relationship with core aspects of LLM output behavior, it should come as no surprise that setting temperature=0 fails to achieve the deterministic behavior we expect. The problem runs much deeper than sampling strategies.

The Real Problem Hiding in Plain Sight

The uncomfortable truth is that temperature=0 has nothing to do with inference determinism. Recent empirical research has already shown that temperature’s effects on LLM outputs are far more nuanced than commonly believed (Peeperkorn et al., 2024), with weak correlations to actual output characteristics. Similarly, the determinism problem has been misattributed to sampling parameters when the real culprit lurks much deeper in the computational stack: batch invariance.

When your LLM inference server processes requests with different batch sizes, the underlying GPU kernels execute reduction operations in different orders, leading to numerically different results due to floating-point non-associativity.

This isn’t a minor numerical quirk—it’s a fundamental architectural issue that affects:

Research reproducibility: Different batch sizes during evaluation produce different results
Production reliability: Load-dependent output variations in serving systems
On-policy RL training: Sampling/training distribution mismatches
A/B testing validity: Inconsistent baselines make comparisons meaningless

Debunking the “Concurrency + Floating Point” Hypothesis

The commonly cited explanation for LLM non-determinism suggests that concurrent GPU threads finishing in random order, combined with floating-point non-associativity, creates the observed variability. While this explanation contains elements of truth, it fundamentally misses the mark.

Understanding Floating-Point Non-Associativity

Before examining why this hypothesis fails, let’s understand what floating-point non-associativity means and why it matters.

In mathematics, addition is associative: (a + b) + c = a + (b + c). However, with floating-point numbers, this fundamental property breaks down:

# Mathematical example of non-associativity
a, b, c = 0.1, 1e20, -1e20

# Method 1: Left-to-right evaluation
result1 = (a + b) + c
print(f"(0.1 + 1e20) + (-1e20) = {result1}")  # 0.0

# Method 2: Right-to-left evaluation  
result2 = a + (b + c)
print(f"0.1 + (1e20 + (-1e20)) = {result2}")  # 0.1

print(f"Results equal? {result1 == result2}")  # False!

Why does this happen?

Floating-point numbers have limited precision. When you add numbers with vastly different magnitudes, the smaller number can be “lost” due to rounding:

# Demonstrating precision loss
large_num = 1e20
small_num = 0.1

# The small number gets lost when added to the large number
combined = large_num + small_num
print(f"{large_num} + {small_num} = {combined}")  # Still 1e20!
print(f"Precision lost: {combined == large_num}")  # True

# This is why order matters in floating-point arithmetic

Real-world implications in GPU computing:

import torch
import numpy as np

# Create an array where order matters
values = [1e-10, 1e-5, 1e-2, 1.0, -1.0, -1e-2, -1e-5, -1e-10]
values_tensor = torch.tensor(values, dtype=torch.float32)

# Different reduction orders can yield different results
result_forward = torch.sum(values_tensor)
result_reverse = torch.sum(torch.flip(values_tensor, [0]))

print(f"Forward sum: {result_forward}")
print(f"Reverse sum: {result_reverse}")
print(f"Difference: {abs(result_forward - result_reverse)}")

This non-associativity means that the order of operations fundamentally affects numerical results. When GPU kernels execute reductions (like matrix multiplication accumulations), the order in which threads contribute their partial results can change the final answer.

Why Most GPU Kernels Don’t Use Atomic Operations

Consider this simple test that demonstrates the flaw in the “concurrency” hypothesis:

import torch

# Test matrix multiplication determinism
A = torch.randn(2048, 2048, device='cuda', dtype=torch.bfloat16)  
B = torch.randn(2048, 2048, device='cuda', dtype=torch.bfloat16)

ref = torch.mm(A, B)
for _ in range(1000):
    result = torch.mm(A, B)
    assert (result - ref).abs().max().item() == 0
print("✅ Matrix multiplication is deterministic!")

This test passes consistently because modern GPU kernels avoid atomic operations whenever possible. The LLM forward pass contains virtually no operations requiring atomic adds, making it “run-to-run deterministic”—meaning identical inputs always produce identical outputs.

The Batch Invariance Problem Revealed

The real issue becomes apparent when we examine how results change with batch size:

import torch
torch.set_default_device('cuda')

B, D = 2048, 4096
a = torch.linspace(-100, 100, B*D).reshape(B, D)
b = torch.linspace(-100, 100, D*D).reshape(D, D)

# Method 1: Process single element
out1 = torch.mm(a[:1], b)

# Method 2: Process full batch, then extract first element  
out2 = torch.mm(a, b)[:1]

difference = (out1 - out2).abs().max()
print(f"Difference: {difference.item()}")  # Typically shows significant difference

This demonstrates that the batch size affects individual element computations—a mathematically surprising property that violates our intuition about matrix operations being independent across batch dimensions.

Technical Deep Dive: Batch Invariant Kernels

To achieve true determinism, we need kernels that produce identical results regardless of batch size. Let’s examine the technical implementation from Thinking Machines’ batch_invariant_ops library.

The Data-Parallel Strategy

The key insight is using a fixed reduction strategy that doesn’t change based on batch size. Here’s how their batch-invariant matrix multiplication kernel works:

@triton.jit(launch_metadata=_matmul_launch_metadata)
def matmul_kernel_persistent(
    a_ptr, b_ptr, c_ptr, bias_ptr,
    M, N, K,
    stride_am, stride_ak, stride_bk, stride_bn, stride_cm, stride_cn,
    BLOCK_SIZE_M: tl.constexpr,
    BLOCK_SIZE_N: tl.constexpr, 
    BLOCK_SIZE_K: tl.constexpr,
    GROUP_SIZE_M: tl.constexpr,
    NUM_SMS: tl.constexpr,
    A_LARGE: tl.constexpr,
    B_LARGE: tl.constexpr, 
    C_LARGE: tl.constexpr,
    HAS_BIAS: tl.constexpr,
):
    start_pid = tl.program_id(axis=0)
    num_pid_m = tl.cdiv(M, BLOCK_SIZE_M)
    num_pid_n = tl.cdiv(N, BLOCK_SIZE_N)
    k_tiles = tl.cdiv(K, BLOCK_SIZE_K)
    num_tiles = num_pid_m * num_pid_n
    
    # Key insight: Fixed tile assignment strategy
    offs_k_for_mask = tl.arange(0, BLOCK_SIZE_K)
    num_pid_in_group = GROUP_SIZE_M * num_pid_n
    
    for tile_id in tl.range(start_pid, num_tiles, NUM_SMS, flatten=True):
        pid_m, pid_n = _compute_pid(tile_id, num_pid_in_group, num_pid_m, GROUP_SIZE_M, NUM_SMS)
        
        # Compute output tile coordinates
        start_m = pid_m * BLOCK_SIZE_M
        start_n = pid_n * BLOCK_SIZE_N
        
        # Data-parallel reduction: each tile handles its reduction independently
        accumulator = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
        for ki in range(k_tiles):
            # Load input tiles
            a = tl.load(a_ptrs, mask=offs_k_for_mask[None, :] < K - ki * BLOCK_SIZE_K, other=0.0)
            b = tl.load(b_ptrs, mask=offs_k_for_mask[:, None] < K - ki * BLOCK_SIZE_K, other=0.0)
            
            # Accumulate: reduction order is fixed by ki iteration
            accumulator = tl.dot(a, b, accumulator)

Why This Achieves Batch Invariance

The critical design decisions that ensure batch invariance:

Fixed Block Sizes: BLOCK_SIZE_M, BLOCK_SIZE_N, and BLOCK_SIZE_K are constant regardless of input size
Data-Parallel Strategy: Each output tile is computed entirely within one GPU core
Consistent Reduction Order: The ki loop always processes K-dimension tiles in the same order
No Split-K: Avoids splitting reductions across multiple cores, which would change based on batch size

Implementation Analysis: The matmul_persistent Function

def matmul_persistent(a: torch.Tensor, b: torch.Tensor, bias: torch.Tensor | None = None):
    assert a.shape[1] == b.shape[0], "Incompatible dimensions"
    assert a.dtype == b.dtype, "Incompatible dtypes"
    
    # Get GPU properties for kernel configuration
    NUM_SMS = torch.cuda.get_device_properties("cuda").multi_processor_count
    M, K = a.shape
    K, N = b.shape
    dtype = a.dtype
    
    # Fixed configurations per data type - key for batch invariance!
    configs = {
        torch.bfloat16: {
            "BLOCK_SIZE_M": 128,
            "BLOCK_SIZE_N": 128, 
            "BLOCK_SIZE_K": 64,
            "GROUP_SIZE_M": 8,
            "num_stages": 3,
            "num_warps": 8,
        },
        torch.float16: {
            "BLOCK_SIZE_M": 128,
            "BLOCK_SIZE_N": 256,
            "BLOCK_SIZE_K": 64,
            "GROUP_SIZE_M": 8,
            "num_stages": 3,
            "num_warps": 8,
        },
        # ... other dtypes
    }
    
    # Launch kernel with fixed configuration
    matmul_kernel_persistent[grid](
        a, b, c, bias, M, N, K,
        a.stride(0), a.stride(1),
        b.stride(0), b.stride(1), 
        c.stride(0), c.stride(1),
        NUM_SMS=NUM_SMS,
        A_LARGE=a.numel() > 2**31,
        B_LARGE=b.numel() > 2**31,
        C_LARGE=c.numel() > 2**31,
        HAS_BIAS=bias is not None,
        **configs[dtype],  # Fixed config ensures batch invariance
    )
    return c

The crucial insight is that fixed configurations prevent the kernel from adapting its strategy based on input shape, maintaining identical reduction orders across different batch sizes.

Beyond Matrix Multiplication: Other Critical Operations

Understanding RMSNorm: Root Mean Square Normalization

Before diving into batch-invariant implementations, let’s understand what RMSNorm actually does. Root Mean Square Normalization is a simpler alternative to LayerNorm that has become popular in modern transformer architectures (especially in models like LLaMA).

RMSNorm normalizes activations by their root mean square, without centering around the mean:

def rms_norm_standard(x, weight, eps=1e-6):
    """
    Standard RMSNorm implementation
    x: input tensor of shape [batch_size, seq_len, hidden_dim]
    weight: learnable scale parameter of shape [hidden_dim]
    """
    # Compute variance along the last dimension (hidden_dim)
    variance = x.pow(2).mean(dim=-1, keepdim=True)
    
    # Normalize by RMS
    x_normalized = x * torch.rsqrt(variance + eps)
    
    # Apply learnable scale
    return x_normalized * weight

Key characteristics of RMSNorm:

Simpler than LayerNorm: No mean subtraction, just RMS scaling
Per-token normalization: Each token in the sequence is normalized independently
Reduction operation: Requires summing/averaging across the hidden dimension

Here’s a concrete example showing RMSNorm in action:

import torch

batch_size, seq_len, hidden_dim = 2, 4, 8
x = torch.randn(batch_size, seq_len, hidden_dim)
weight = torch.ones(hidden_dim)

print("Input shape:", x.shape)
print("Input sample:", x[0, 0, :4])  # First 4 elements of first token

# Apply RMSNorm
normalized = rms_norm_standard(x, weight)
print("Normalized sample:", normalized[0, 0, :4])
print("RMS of normalized:", normalized[0, 0].pow(2).mean().sqrt())  # Should be close to 1

Batch-Invariant RMSNorm

RMSNorm presents similar challenges to matrix multiplication. Standard implementations might use different parallelization strategies for small vs. large batch sizes:

@triton.jit
def _rms_norm_kernel(
    x_ptr, w_ptr, output_ptr,
    M, N,
    eps,
    x_row_stride, output_row_stride,
    BLOCK_SIZE: tl.constexpr,
):
    row_idx = tl.program_id(0)
    
    # Data-parallel strategy: each row processed independently
    row_start = row_idx * x_row_stride
    cols = tl.arange(0, BLOCK_SIZE)
    mask = cols < N
    
    # Load row data
    x_row = tl.load(x_ptr + row_start + cols, mask=mask, other=0.0)
    w_row = tl.load(w_ptr + cols, mask=mask, other=0.0)
    
    # Compute RMS in fixed order (crucial for batch invariance)
    x_squared = x_row * x_row
    mean_x_squared = tl.sum(x_squared, axis=0) / N
    rrms = 1 / tl.sqrt(mean_x_squared + eps)
    
    # Apply normalization
    output = x_row * rrms * w_row
    tl.store(output_ptr + row_idx * output_row_stride + cols, output, mask=mask)

Log-Softmax with Fixed Reduction Strategy

@triton.jit
def _log_softmax_kernel(
    input_ptr, output_ptr,
    input_row_stride, output_row_stride,
    n_cols,
    BLOCK_SIZE: tl.constexpr,
):
    row_idx = tl.program_id(0)
    
    # Fixed block processing strategy
    row_start_ptr = input_ptr + row_idx * input_row_stride
    col_offsets = tl.arange(0, BLOCK_SIZE)
    input_ptrs = row_start_ptr + col_offsets
    mask = col_offsets < n_cols
    
    # Load and find max (reduction in fixed order)
    row = tl.load(input_ptrs, mask=mask, other=-float('inf'))
    row_max = tl.max(row, axis=0)
    
    # Compute exp and sum (maintaining order)
    row_shifted = row - row_max
    exp_row = tl.exp(row_shifted)
    row_sum = tl.sum(exp_row, axis=0)
    
    # Final log-softmax computation
    output = row_shifted - tl.log(row_sum)
    
    # Store result
    output_row_start_ptr = output_ptr + row_idx * output_row_stride  
    output_ptrs = output_row_start_ptr + col_offsets
    tl.store(output_ptrs, output, mask=mask)

Integration with PyTorch: The torch.Library Approach

The batch_invariant_ops library cleverly uses PyTorch’s torch.library mechanism to substitute standard operations with batch-invariant versions:

import contextlib
from typing import Dict, Any
import torch

# Global state management
_batch_invariant_mode_enabled = False

@contextlib.contextmanager
def set_batch_invariant_mode(enabled: bool = True):
    """Context manager to enable/disable batch invariant operations."""
    global _batch_invariant_mode_enabled
    old_state = _batch_invariant_mode_enabled
    _batch_invariant_mode_enabled = enabled
    
    if enabled:
        # Register batch-invariant implementations
        torch.library.register_impl("aten::mm", "cuda", _batch_invariant_mm)
        torch.library.register_impl("aten::addmm", "cuda", _batch_invariant_addmm)
        torch.library.register_impl("aten::log_softmax", "cuda", _batch_invariant_log_softmax)
    
    try:
        yield
    finally:
        _batch_invariant_mode_enabled = old_state
        if not enabled:
            # Restore original implementations
            torch.library.register_impl("aten::mm", "cuda", None)
            torch.library.register_impl("aten::addmm", "cuda", None) 
            torch.library.register_impl("aten::log_softmax", "cuda", None)

def _batch_invariant_mm(a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
    """Batch-invariant matrix multiplication implementation."""
    return matmul_persistent(a, b)

def _batch_invariant_addmm(bias: torch.Tensor, a: torch.Tensor, b: torch.Tensor) -> torch.Tensor:
    """Batch-invariant addmm implementation.""" 
    return matmul_persistent(a, b, bias)

This approach allows existing PyTorch models to gain batch invariance with minimal code changes:

# Standard usage (non-deterministic)
model = MyTransformerModel()
output1 = model(batch_size_1_input)
output2 = model(batch_size_4_input)[:1]  # Different result!

# Batch-invariant usage (deterministic)
with set_batch_invariant_mode(True):
    output1 = model(batch_size_1_input) 
    output2 = model(batch_size_4_input)[:1]  # Identical result!

Performance Analysis: The Cost of Determinism

Batch-invariant kernels achieve determinism at a performance cost. The Thinking Machines blog reports approximately 20% performance degradation compared to highly optimized libraries like cuBLAS.

Why Performance Degrades

Suboptimal Parallelization: Fixed block sizes may leave GPU cores underutilized for certain input shapes
No Adaptive Optimization: Can’t use split-K or other advanced strategies that depend on input dimensions
Conservative Configuration: Must choose kernel parameters that work across all batch sizes

Performance vs. Determinism Trade-off

import time
import torch
from batch_invariant_ops import set_batch_invariant_mode

def benchmark_matmul(a, b, num_trials=100):
    # Warmup
    for _ in range(10):
        _ = torch.mm(a, b)
    torch.cuda.synchronize()
    
    # Benchmark
    start = time.time()
    for _ in range(num_trials):
        result = torch.mm(a, b)
    torch.cuda.synchronize()
    end = time.time()
    
    return (end - start) / num_trials

# Test data
M, N, K = 2048, 2048, 2048
a = torch.randn(M, K, device='cuda', dtype=torch.bfloat16)
b = torch.randn(K, N, device='cuda', dtype=torch.bfloat16)

# Standard PyTorch
standard_time = benchmark_matmul(a, b)

# Batch-invariant version
with set_batch_invariant_mode(True):
    batch_invariant_time = benchmark_matmul(a, b)

slowdown = batch_invariant_time / standard_time
print(f"Standard PyTorch: {standard_time*1000:.2f}ms")
print(f"Batch Invariant: {batch_invariant_time*1000:.2f}ms") 
print(f"Slowdown: {slowdown:.2f}x")

When the Performance Trade-off Makes Sense

The performance cost is justified in scenarios where determinism is critical:

Research experiments requiring reproducible results
Production A/B testing needing consistent baselines
Regulatory compliance demanding auditable model behavior
On-policy RL training requiring exact policy matching

Looking Forward: Integration Possibilities with SGLang

SGLang’s architecture provides several potential integration points for batch-invariant kernels:

1. Custom Operator Registration

SGLang’s srt/custom_op.py and srt/_custom_ops.py provide mechanisms for registering custom CUDA kernels:

# Potential integration point in sglang/srt/custom_op.py
def register_batch_invariant_ops():
    """Register batch-invariant operations in SGLang."""
    import batch_invariant_ops
    
    # Register through SGLang's custom op system
    register_custom_op("batch_invariant_mm", batch_invariant_ops.matmul_persistent)
    register_custom_op("batch_invariant_rmsnorm", batch_invariant_ops.rms_norm)
    register_custom_op("batch_invariant_attention", batch_invariant_ops.attention)

2. Model Executor Integration

The srt/model_executor/ directory contains SGLang’s core execution logic, where batch-invariant operations could be conditionally enabled:

# In sglang/srt/model_executor/forward_batch_info.py
class ForwardBatchInfo:
    def __init__(self, ..., deterministic_mode: bool = False):
        self.deterministic_mode = deterministic_mode
        
        if deterministic_mode:
            import batch_invariant_ops
            batch_invariant_ops.enable_batch_invariant_mode()

3. Server-Level Configuration

SGLang’s server arguments could include deterministic inference options:

# In sglang/srt/server_args.py
@dataclass
class ServerArgs:
    # ... existing args ...
    deterministic_inference: bool = False
    batch_invariant_attention_block_size: int = 256
    
    def __post_init__(self):
        if self.deterministic_inference:
            logger.info("Enabling deterministic inference mode")
            # Enable batch-invariant operations

Technical Challenges for SGLang Integration

Attention Kernel Complexity: SGLang’s attention implementations may require significant modification
Performance Impact: Need careful benchmarking to understand real-world performance implications
Memory Layout: Batch-invariant kernels may have different memory access patterns
Testing Coverage: Extensive testing needed to ensure numerical correctness across model types

Real-World Impact: Beyond Academic Curiosity

The implications of batch-invariant inference extend far beyond theoretical interest:

Production LLM Serving

# Problem: Non-deterministic results in production
# Same prompt, different batch contexts = different outputs
prompt = "Explain quantum computing"
server_load_1 = process_request(prompt, current_batch_size=1)   # Result A
server_load_2 = process_request(prompt, current_batch_size=32)  # Result B ≠ A

# Solution: Batch-invariant inference
with deterministic_serving_mode():
    result_1 = process_request(prompt, current_batch_size=1)   # Result A
    result_2 = process_request(prompt, current_batch_size=32)  # Result A (identical!)

Research Reproducibility

The reproducibility crisis in AI research is partly driven by hidden non-determinism in model evaluation. When researchers run the same benchmark on the same model but get different results due to varying computational contexts, it undermines scientific validity. Consider how many papers report results that others struggle to reproduce—not due to methodological issues, but because of subtle numerical differences from batch size variations during evaluation.

Batch-invariant inference provides a path toward truly reproducible benchmarks. By ensuring that evaluation results are independent of computational context, researchers can focus on actual methodological differences rather than fighting numerical artifacts. This is particularly crucial for:

Benchmark leaderboards where slight numerical differences can change rankings
Ablation studies requiring precise comparison between model variants
Cross-institution collaborations where different hardware setups should yield identical results
Longitudinal studies comparing models across time periods with different computational infrastructure

# Enabling true scientific reproducibility
def run_evaluation_suite():
    with set_batch_invariant_mode(True):
        # Results are now independent of:
        # - Server load during evaluation
        # - Batch size choices
        # - Hardware-specific optimizations
        results = evaluate_model_on_benchmark(model, dataset)
    return results

On-Policy Reinforcement Learning

Reinforcement Learning (RL) trains agents to make decisions by learning from the consequences of their actions. In the context of LLMs, Reinforcement Learning from Human Feedback (RLHF) has become crucial for aligning models with human preferences—think ChatGPT’s training process.

Understanding On-Policy vs Off-Policy Learning:

On-Policy Learning: The agent learns from actions taken by the current policy being trained. Think of it as learning to drive by practicing with your current driving skills.
Off-Policy Learning: The agent learns from actions taken by a different policy (often an older version). Like learning to drive by watching videos of someone else driving.

In LLM training, the distinction becomes critical:

# Typical RLHF training loop
def rlhf_training_step():
    # 1. Sample responses using current policy
    prompts = get_training_prompts()
    responses = current_policy.sample(prompts)  # Current policy generates text
    
    # 2. Get human feedback/rewards  
    rewards = reward_model.score(prompts, responses)
    
    # 3. Update policy using these samples
    loss = compute_policy_loss(current_policy, responses, rewards)
    current_policy.update(loss)

The Critical Problem: Policy Drift

Here’s where numerical non-determinism becomes catastrophic. If the policy used for sampling (step 1) produces different outputs than the policy used for training (step 3) due to batch-size differences, you’re no longer doing true on-policy learning:

# What happens with non-deterministic inference:
sampling_policy = load_policy_for_sampling()    # Batch size varies with server load
training_policy = load_policy_for_training()    # Fixed batch size for training

# Even though these should be identical models, numerical differences mean:
# sampling_policy ≠ training_policy (due to batch invariance issues)
# This creates a subtle but devastating off-policy situation!

Why This Matters:

When your sampling policy differs from your training policy (even slightly), the training algorithm receives biased gradients. The policy updates are based on actions that weren’t actually taken by the policy being updated. This can lead to:

Training instability: Gradients pointing in wrong directions
Performance degradation: Policy “forgets” how to generate good responses
Reward hacking: Policy exploits numerical differences rather than learning genuine improvements

The most compelling application may be in RL training, where policy-sampling mismatches can catastrophically impact learning:

# Traditional approach: sampling ≠ training policy
def traditional_rl_step():
    # Sampling policy (different numerical behavior)
    samples = sample_from_policy(policy, batch_size=variable)
    
    # Training policy (different numerical behavior)  
    loss = compute_policy_loss(policy, samples, batch_size=training_batch)
    return loss  # Biased due to off-policy sampling!

# Batch-invariant approach: sampling = training policy
def deterministic_rl_step():
    with set_batch_invariant_mode(True):
        # Identical numerical behavior regardless of batch size
        samples = sample_from_policy(policy, batch_size=variable)
        loss = compute_policy_loss(policy, samples, batch_size=training_batch)
    return loss  # True on-policy learning!

Conclusion: Toward a Deterministic Future

The path to deterministic LLM inference requires abandoning comfortable misconceptions and embracing the technical complexity of modern GPU computing. The “temperature=0” myth has obscured the real challenge: ensuring that mathematical operations produce consistent results regardless of computational context.

Thinking Machines’ batch-invariant kernels represent a crucial first step toward this goal. By implementing fixed reduction strategies in Triton, they’ve demonstrated that deterministic inference is not only possible but practical, even with moderate performance trade-offs.

The Road Ahead

For the broader community, several opportunities emerge:

Framework Integration: Libraries like SGLang, vLLM, and TGI could offer deterministic modes
Hardware Optimization: GPU vendors could provide deterministic primitives at the hardware level
Standards Development: The community could establish deterministic inference standards
Research Applications: True reproducibility would accelerate scientific progress in LLM research

A Call to Action

As practitioners, we must move beyond accepting non-determinism as an inevitable consequence of high-performance computing. The tools and techniques exist to achieve true reproducibility—what remains is the collective will to prioritize correctness alongside performance.

The choice is ours: continue accepting the status quo of irreproducible results, or embrace the engineering rigor required for truly deterministic language model inference. The foundation has been laid; now it’s time to build upon it.

This blog post provides a technical deep-dive into the implementation and implications of batch-invariant kernels for deterministic LLM inference. For hands-on experimentation, explore the batch_invariant_ops repository and consider how these techniques might apply to your own use cases.

References and Further Reading

Peeperkorn, M., Kouwenhoven, T., Brown, D., & Jordanous, A. (2024). Is Temperature the Creativity Parameter of Large Language Models? Proceedings of the 15th International Conference on Computational Creativity (ICCC’24). arXiv:2405.00492 [cs.CL]
He, H. & Thinking Machines Lab (2025). Defeating Nondeterminism in LLM Inference. Thinking Machines Lab: Connectionism.
HuggingFace Kernel Hub - Practical guide to optimized compute kernels
batch_invariant_ops GitHub Repository - Implementation of batch-invariant kernels
Triton Documentation - GPU kernel programming framework
SGLang Framework - Structured generation language for LLMs

Calculus Related Posts