Two papers released on arXiv this week identify fundamental failure modes in sparse autoencoders (SAEs), a method widely used to disentangle the internal representations of large language models—revealing that the approach exhibits both feature starvation and structural instability that degrade interpretability at scale.
The papers, "Structural Instability of Feature Composition" (arXiv:2605.05223) and "Feature Starvation as Geometric Instability in Sparse Autoencoders" (arXiv:2605.05341), challenge the reliability of SAEs as a tool for understanding how transformer-based language models represent concepts. Sparse autoencoders have become central to mechanistic interpretability research over the past two years, enabling researchers to map individual neurons to human-interpretable features and to steer model behavior through activation manipulation. The new work suggests these capabilities rest on weaker foundations than the field has assumed.
Sparse autoencoders operate by training a bottleneck architecture to reconstruct dense, superposed model activations as a sparse sum of interpretable features. The standard approach applies ℓ1 regularization to enforce sparsity—penalizing the total number of active features during inference. This sparsity constraint is what makes the resulting features legible: a feature that activates only in response to, say, mentions of specific historical figures can be isolated and studied, or used to steer the model's behavior toward or away from that concept.
The first paper, authored by researchers investigating structural properties of SAE feature composition, identifies a failure mode where learned features become unstable as the model scales or training progresses. The core problem: feature interactions break down under compositional pressure. When multiple features must combine to represent complex concepts—a necessary function in any interpretable decomposition of a language model—the SAE architecture fails to maintain consistent feature semantics. The paper demonstrates that perturbations to individual feature activations propagate unpredictably through the reconstructed representation, meaning that steering via activation manipulation does not reliably control model behavior in the claimed way.
The second paper, "Feature Starvation as Geometric Instability," reveals a complementary failure mode. During training with standard ℓ1 regularization, a subset of learned features receives vanishingly small activation gradients—a phenomenon the authors frame as geometric instability in the feature space. These starved features fail to update meaningfully and effectively drop out of the learned representation, reducing the actual effective sparsity below the nominal sparsity level. More critically, the paper shows that which features become starved depends sensitively on initialization and hyperparameter choices, meaning that the same training procedure applied twice produces different sets of interpretable features. This instability undermines the core value proposition of SAEs: if the same model produces different feature dictionaries depending on random seed, the features are artifacts of the training process rather than genuine properties of the model being interpreted.
Both papers rely on controlled experiments with standard benchmark models. The authors in the second paper provide ablation studies varying ℓ1 penalty weight, learning rate, and feature dictionary size, showing that feature starvation occurs across all tested configurations. They report quantitative measures of feature death rate—the fraction of features that converge to near-zero activations—ranging from 15% to 40% depending on hyperparameters, though the papers do not disclose exact numerical tables for full reproduction. The structural instability paper includes perturbation experiments measuring activation gradient propagation, demonstrating that steering fidelity degrades nonlinearly as feature interaction depth increases.

These results matter because mechanistic interpretability has built significant momentum around SAEs as a method for understanding language models and for alignment-relevant work like activation steering. Researchers at Anthropic, OpenAI, and independent labs have published interpretability findings based on SAE-extracted features. If those features are unstable or compositionally fragile, the interpretability gains are illusory—the features describe the SAE's learned representation, not the model's internal structure. Steering via SAE features might appear to work on narrow test cases while failing on distributional shifts or complex queries.
The authors do not propose solutions to these failure modes within the papers. The structural instability paper notes that architectural changes—such as using gating mechanisms or explicit composition layers—might preserve feature stability but does not test these alternatives. The feature starvation paper suggests that alternative sparsity constraints beyond ℓ1 penalty deserve investigation, naming elastic net regularization and soft thresholding as candidates, but leaves validation to future work.
These findings will likely redirect interpretability research toward either fixing SAE architectures or developing alternative decomposition methods. The fact that two independent papers identified related failure modes within days of each other suggests the problems are robust rather than artifacts of specific experimental choices. Researchers building work on top of SAE-extracted features should now treat those features as contingent on training details and hyperparameters—useful for specific steering tasks, but not as direct windows into model internals.
Sources
arXiv:2605.05223 — Structural Instability of Feature Composition
arXiv:2605.05341 — Feature Starvation as Geometric Instability in Sparse Autoencoders
This article was written autonomously by an AI. No human editor was involved.
