A research paper that has not shipped to production wiped roughly $25 billion off Micron's market cap in 48 hours.
Google Research published TurboQuant, a method for compressing the key-value cache in large language models from 16-bit precision to 3 bits. The paper claims up to 6x less memory required during inference, up to 8x faster attention computation on Nvidia H100s, no accuracy loss, and no retraining needed. Cloudflare CEO Matthew Prince called it "Google's DeepSeek". The market took the comparison literally.
What sold off
Micron fell 3.4% on the day the paper landed and kept falling. By Thursday, Micron was the worst performer in the US 100, down 16% on the week. SK Hynix dropped about 6% in Seoul. Samsung lost roughly 5%. Sandisk and Kioxia followed. The selloff dragged the broader chip sector and pulled the Nasdaq down with it.
The market's reasoning was straightforward. If AI models need 6x less memory at inference time, data centers need 6x fewer HBM chips, and HBM is the high-margin, high-growth product line that has driven memory revenue for the last two years. Compress the KV cache, and the AI infrastructure thesis cracks.
What the paper actually does
Inference memory and training memory are different things. TurboQuant targets the key-value cache, which stores intermediate calculations so a model does not redo them on each token. The cache scales with sequence length, so it dominates memory cost at long context. Compressing it 6x means a model that needed 60 GB of cache memory to serve a 100k-context conversation now needs about 10. In practical terms: the same hardware serves more concurrent conversations, or longer ones.
Training, where models actually get their weights, is a separate budget. TurboQuant does not touch training memory. The training-side capex spending that drives most data-center HBM demand stays intact.
Bank of America analysts called the selloff overdone. Compression algorithms are not new. Nvidia has shipped similar techniques before. And Quilter Cheviot's tech research head pointed out that memory stocks had already had a strong run going in, so investors were already looking for reasons to take profit.
The "running on a MacBook" demo
Independent developers ported TurboQuant to Apple Silicon within days. One team ran a 9-billion-parameter model on a base MacBook Air with 16 GB of RAM and a 20,000-token context window. That configuration was previously impossible. The algorithm is open research. Anyone with a couple of weekends and a working knowledge of CUDA kernels can implement it.
That is the part the market arguably underweighted. Compression algorithms are evolutionary in the data center. Nvidia bakes them in over time. But on the long tail of consumer hardware and edge inference, the same compression is revolutionary. The number of GPUs in the world that can run a 9-billion-parameter model just expanded by an order of magnitude.
Who actually benefits
This is the part of the story that makes the editorial observation hard to ignore. Google released TurboQuant as open research. But Google also runs more inference than nearly any other company on Earth. The savings flow to them first. The compute cost of every Search AI Overview, every Gemini chat session, every Vertex AI customer query, drops by a meaningful multiple.
Releasing a memory-efficiency breakthrough as open research is a costly signal of confidence in your own infrastructure advantage. You give the algorithm to your competitors knowing they cannot operationalize it as fast or as widely as you can. Open weights and open research are tools of competitive position when you are the largest operator. They look generous from the outside. From inside the spreadsheet, they reshape the cost basis of everyone you compete against.
Why it matters
The market priced this as a structural shift in memory demand. The technical reality is more modest in the short term: KV-cache compression is one optimization among many, training capex stays put, and the largest memory customers will continue ordering HBM. But a structural argument about the medium term has merit. Compression algorithms are evolving fast. Each generation reduces the per-token cost of inference. The trajectory of inference cost is downward and probably steeper than the chip industry's revenue model currently bakes in.
A research paper that has not yet shipped erased $25 billion of Micron's market cap. The next compression paper, also from Google, also released as open research, lands somewhere between the writing of this article and the next earnings report. What does the model price in then?
Originally published as an Instagram carousel on @recul.ai.