Same models. Lower hourly rate. Faster.
Quantization, MTP (Multi-Token Prediction), and active research into new optimization techniques that reduce memory footprint and improve inference performance across model groups.
Optimization techniques.
Built into the scheduler layer. No client-side changes required.
Quantization
Reduce model weight precision from FP16 to INT8 or lower with minimal quality loss. Smaller models fit in less VRAM, enabling more concurrent routes behind each balancer.
MTP
Multi-Token Prediction decodes multiple tokens per step, reducing the total compute required per response. Built into the scheduler layer without client-side changes.
Active Research
Continuous investigation into speculative decoding, pruning strategies, and architecture-specific optimizations that improve cost-per-token across heterogeneous hardware.
Speculative Decoding
A small draft model suggests tokens, a larger verifier accepts or rejects them. Faster inference with the same output quality. Built into the scheduler—no configuration needed.
Available now on every endpoint.
Optimization techniques are already live in the inference runtime. When you deploy a model behind a Compute or Smart Balancers endpoint, quantization and MTP apply automatically—the scheduler handles the details so your client code stays untouched.
Currently enabled
Quantization
INT8 / INT4 on supported models
MTP
Multi-token prediction on qualifying models
Speculative Decoding
Draft-verifier on qualifying groups
Coming soon
In the future, you will be able to control optimization settings directly from QDivZero—choose quantization levels, enable experimental techniques, and preview the impact on VRAM and throughput before applying them to your models.
Quantization
Quantization reduces the precision of model weights from floating-point (FP16, FP32) to lower bit-width representations (INT8, INT4, etc.). This dramatically reduces VRAM requirements, allowing more models to run concurrently on the same hardware and lowering cost-per-token.
QDivZero applies quantization at the scheduler level, meaning your existing OpenAI-compatible endpoint automatically benefits from reduced memory footprint without changes to client code or model configuration.
Example impact
FP16
70 GB VRAM
INT8
35 GB VRAM
INT4
~18 GB VRAM
MTP — Multi-Token Prediction
Standard language models predict one token at a time. MTP modifies the decoding process to predict multiple tokens simultaneously, reducing the total number of compute steps required per response.
This is built into the inference scheduler, so workloads running behind Smart Balancers or direct Compute endpoints can benefit from faster decode times without model changes or client updates.
Benefit
Fewer compute steps per response = lower cost per token + faster responses
Speculative Decoding
Standard autoregressive decoding runs the full model for every token. Speculative decoding pairs a small draft model with a larger verifier. The draft generates several tokens in parallel; the verifier accepts or rejects the sequence in a single pass.
The result: near-identical output quality at significantly lower latency. The draft learns which tokens the verifier would have picked—accepted tokens are free, rejected tokens cost only what the verifier would have spent anyway.
Benefit
Up to 3x faster token generation without changing the output
Already enabled on qualifying model groups. No configuration required—the scheduler selects the best draft-verifier pair automatically.
Research-driven optimization.
New optimization techniques are evaluated and integrated continuously as the research landscape evolves.