New Google algorithm cuts memory usage sixfold. Is expensive hardware doomed?

New Google algorithm cuts memory usage sixfold. Is expensive hardware doomed?

Arkadiy Andrienko

Google Research has published a paper on TurboQuant, an algorithm that slashes the memory required for AI workloads by at least six times, all without compromising response accuracy and without the need for additional model training.

During text generation, models rely on the so-called KV cache—a memory buffer that stores previously computed attention mechanism data, allowing them to avoid recalculating it at every step. But the longer the context window, the more this cache balloons. At a certain point, it starts eating up tens of gigabytes of memory, and even powerful graphics cards with tons of VRAM are left powerless. Traditional quantization methods have long been used to compress the cache, but they come with a hidden drawback: along with the compressed data, you also have to store the so-called quantization constants—essentially a lookup table, similar to what ZIP or RAR archivers use.

The researchers tested TurboQuant on open-source models like Gemma and Mistral, using long-context benchmark suites such as LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, and L-Eval. On simple tasks, the algorithm delivered flawless results, cutting the KV cache size by at least six times. In more complex scenarios—like question answering, code generation, and summarization—the margin wasn't as dramatic, but it still outperformed the existing KIVI compression algorithm. On NVIDIA H100 accelerators, the 4-bit version of TurboQuant demonstrated an eightfold increase in performance.

The market has already reacted to the announcement, with shares of major memory manufacturers taking a hit—reflecting a shift in investor expectations. If widespread adoption of TurboQuant lowers VRAM requirements, companies could either cut hardware costs or expand model context windows without needing to ramp up compute power.

New Google algorithm cuts memory usage sixfold. Is expensive hardware doomed?

The study's authors emphasize that their work isn't just an engineering fix—it's a way to curb memory consumption at a time when memory is becoming increasingly scarce.

Can an algorithm like this actually help put an end to the "memory crisis" in the market, or will the shortage remain a problem for everyday users no matter what software tricks are thrown at it? Share your thoughts in the comments.

    About the author
    Comments0