Tether’s AI Research Group has released an open-source production version of TurboQuant, a memory compression algorithm originally developed by Google Research.
The release is part of QVAC SDK 0.12.0 and targets laptops, phones, edge devices, and decentralized networks. It allows local AI models to handle longer sessions without relying on cloud infrastructure.
This marks a practical shift in how on-device AI manages memory-intensive tasks.
Memory has long been a barrier for running capable AI models on consumer hardware. When an AI assistant processes a long document or conversation, it stores that context in what is called the KV cache.
At roughly 262,000 tokens, the KV cache for a 4B model can consume around 8 GB of memory alone. Four concurrent sessions can push that figure to 32 GB before accounting for the model itself.
TurboQuant addresses this by compressing the KV cache by up to five times while maintaining output quality close to an uncompressed model.
A user can now ask a laptop-based assistant to analyze a hundred-page legal document without uploading it to a remote server.
Students, developers, journalists, and researchers can all benefit from longer, more context-aware AI sessions on devices they already own.
Speaking on the broader reasoning behind the release, Tether CEO Paolo Ardoino pointed to the gap between research and practical software.
“Google’s research showed that AI memory could be compressed far more efficiently than most people assumed,” he said. “Our work brings that breakthrough into production software that developers, startups, and users can actually build with.”
The production release includes a full quantization pipeline, framework adapters, developer documentation, and workload-tuned profiles.
These components are designed for real environments outside hyperscale data centers, covering constrained memory, mixed hardware, and latency-sensitive deployments.
TurboQuant ships as part of QVAC SDK 0.12.0, integrated directly into Fabric, a core component of the QVAC stack.
Fabric began as a llama.cpp fork and has since grown to incorporate multiple research advances. The SDK gives developers a unified set of tools, libraries, and runtime components for building local AI applications.
For startups and independent developers, this removes the assumption that large AI products require expensive GPU clusters.
Teams can now design for longer context windows, larger file workloads, and flexible deployment across consumer and edge hardware. That opens practical paths for building AI products without cloud-only architecture.
Addressing concerns around data privacy and cloud dependency, Ardoino made the case for keeping AI tasks on local devices.
“People should be able to ask an AI assistant to read a long document or work through private information without every task being forced through a remote data center,” he said. TurboQuant, in that sense, gives local AI more operational room.
Tether’s strategy centers on AI that runs closer to users, across personal devices and decentralized networks. The company sees software efficiency and portability as defining factors in the next phase of AI development, alongside large-scale compute infrastructure.
The post Tether Brings Google’s TurboQuant to Production, Unlocking Long-Context AI on Everyday Devices appeared first on Blockonomi.
