In active development

Same models. Lower hourly rate. Faster.

Quantization, MTP (Multi-Token Prediction), and active research into new optimization techniques that reduce memory footprint and improve inference performance across model groups.

Optimization techniques.

Built into the scheduler layer. No client-side changes required.

Quantization

Reduce model weight precision from FP16 to INT8 or lower with minimal quality loss. Smaller models fit in less VRAM, enabling more concurrent routes behind each balancer.

MTP

Multi-Token Prediction decodes multiple tokens per step, reducing the total compute required per response. Built into the scheduler layer without client-side changes.

Active Research

Continuous investigation into speculative decoding, pruning strategies, and architecture-specific optimizations that improve cost-per-token across heterogeneous hardware.

Speculative Decoding

A small draft model suggests tokens, a larger verifier accepts or rejects them. Faster inference with the same output quality. Built into the scheduler—no configuration needed.

Available now on every endpoint.

Optimization techniques are already live in the inference runtime. When you deploy a model behind a Compute or Smart Balancers endpoint, quantization and MTP apply automatically—the scheduler handles the details so your client code stays untouched.

Currently enabled

Quantization

INT8 / INT4 on supported models

MTP

Multi-token prediction on qualifying models

Speculative Decoding

Draft-verifier on qualifying groups

Coming soon

In the future, you will be able to control optimization settings directly from QDivZero—choose quantization levels, enable experimental techniques, and preview the impact on VRAM and throughput before applying them to your models.

Quantization

Quantization reduces the precision of model weights from floating-point (FP16, FP32) to lower bit-width representations (INT8, INT4, etc.). This dramatically reduces VRAM requirements, allowing more models to run concurrently on the same hardware and lowering cost-per-token.

QDivZero applies quantization at the scheduler level, meaning your existing OpenAI-compatible endpoint automatically benefits from reduced memory footprint without changes to client code or model configuration.

Example impact

FP16

70 GB VRAM

INT8

35 GB VRAM

INT4

~18 GB VRAM

MTP — Multi-Token Prediction

Standard language models predict one token at a time. MTP modifies the decoding process to predict multiple tokens simultaneously, reducing the total number of compute steps required per response.

This is built into the inference scheduler, so workloads running behind Smart Balancers or direct Compute endpoints can benefit from faster decode times without model changes or client updates.

Benefit

Fewer compute steps per response = lower cost per token + faster responses

Speculative Decoding

Standard autoregressive decoding runs the full model for every token. Speculative decoding pairs a small draft model with a larger verifier. The draft generates several tokens in parallel; the verifier accepts or rejects the sequence in a single pass.

The result: near-identical output quality at significantly lower latency. The draft learns which tokens the verifier would have picked—accepted tokens are free, rejected tokens cost only what the verifier would have spent anyway.

Benefit

Up to 3x faster token generation without changing the output

Already enabled on qualifying model groups. No configuration required—the scheduler selects the best draft-verifier pair automatically.

Research-driven optimization.

New optimization techniques are evaluated and integrated continuously as the research landscape evolves.

Access the platform Explore Smart Balancers