2 articles
Researchers tackle post-training quantization bottlenecks that distort model behavior under memory and latency constraints.
New quantization algorithm enables longer context windows and 3.2× memory savings for local inference.