4 articles

Two papers reveal fundamental limits in SAE interpretability methods for large language models.

New research identifies why neural networks suddenly generalize long after memorizing training data.

Researchers identify functional emotional patterns in Claude's neural activations that influence decision-making.

New position paper argues standard accuracy metrics fail to detect memorization, data leakage, and brittle shortcuts in machine learning models.