1 article
New benchmarks and multi-agent systems expose performance gaps when language models must reason through long chains of decisions.