Launch a new instance
The launch wizard collects the model, the runtime preset, and a small set of scheduling choices. The default path is smart scheduling: you describe what you need, the scheduler picks the cheapest compatible capacity that fits.
Launch costs
running. Stop the instance when you are done to stop the meter.Fields
| Field | Notes |
|---|---|
| Hugging Face repo | The exact repo id, e.g. Qwen/Qwen3-VL-Embedding-2B. The catalog validates architecture, tokenizer, and approximate VRAM. |
| Serving name | Public model name used by the OpenAI endpoint. Lowercase, dashes, no spaces. Must be unique inside the account. |
| Workload kind | chat (text generation) or embeddings (vector encoding). |
| Runtime preset | balanced, lower_vram, or long_context. The preset is fixed at launch time. |
| Context size | Defaults to 0, which means the platform picks the model’s maximum context window automatically. Set an explicit value only when you want a smaller KV cache than the model supports. |
| Scheduling mode | Smart (the default): the scheduler picks the best capacity for your constraints. Manual: you pick a specific GPU. |
| Smart constraints | Optional max hourly price, allowed regions, provider preference, allow spot, allow community. Ignored in manual mode. |
| Firewall | Optional pre-prompt policy. Applies to chat workloads only. |
| Name & description | Display metadata for the workspace. Free text. |
Runtime presets
The preset picks the runtime configuration. Three are available today. Pick balanced unless you have a specific reason to reach for the others.
| Preset | When to pick it |
|---|---|
| balanced | Default. bfloat16 weights, 0.9 GPU memory utilisation. Use when in doubt — it serves the broadest range of models cleanly. |
| lower_vram | FP8 weights, 0.88 GPU memory utilisation. Prefer when the model is close to the GPU memory limit and you need a tighter fit. |
| long_context | 0.92 GPU memory utilisation, tuned for the model’s maximum context window. Pick this when the workload is dominated by long prompts. |
The preset is fixed at launch time. Changing the underlying runtime configuration is not exposed in the wizard; if you need a non-default dtype, quantization, or tensor parallel size, switch to manual mode and override the values there.
Smart scheduling (default)
Smart mode replaces the GPU picker with a small set of constraints. At launch, the scheduler picks the cheapest compatible option that satisfies every constraint. The same constraints are re-evaluated on every start, so the same instance can run on different providers over time.
- Max price: a hard ceiling in EUR/h. The scheduler never picks above it.
- Region: pin to a single region or a list of allowed regions.
- Provider preference: name a preferred provider, a list, or leave empty for cheapest.
- Allow spot / allow community: include non-guaranteed capacity to lower cost.
Allowed regions
Use one of the following values for smart_region or smart_regions. Anything else is silently ignored.
| Continent | Accepted values |
|---|---|
| Europe | eu-west, eu-central, eu-north, europe |
| North America | us-east, us-west, north-america |
| Asia | asia, ap-south, ap-northeast, ap-southeast |
| South America | south-america |
| Africa | africa |
| Oceania | oceania |
Re-selection on resume
Smart is the right choice for almost every launch: the scheduler has visibility into capacity and pricing across every supported provider, and the constraints you set are the only thing you need to reason about. Manual mode is available below for the cases where you need it.
Manual scheduling
Manual mode skips the smart scheduler and lets you pick a specific GPU from the catalog. Use it when you need a fixed hardware profile (compliance, predictable latency), or when you want to override the runtime configuration that the preset would otherwise pick for you.
The cost is locked at launch from the catalog price snapshot, so manual mode does not move to a cheaper provider on its own. The cost only changes when you stop the instance and start it again.
Programmatic launch (API)
The same fields are accepted by the platform API. The frontend builds a CreateInstanceRequest payload and posts it to POST /instances. The example below uses the smart mode defaults:
1curl -X POST https://api.qdiv0.com/v1/instances \
2 -H "Authorization: Bearer $QDIV0_API_KEY" \
3 -H "Content-Type: application/json" \
4 -d '{
5 "huggingface_repo_id": "Qwen/Qwen3-VL-Embedding-2B",
6 "serving_name": "qwen3-vl-2b-demo",
7 "workload_kind": "embeddings",
8 "runtime_preset": "balanced",
9 "scheduling_mode": "smart",
10 "smart_regions": ["eu-west"],
11 "smart_max_price_per_hour_usd": 2.5,
12 "name": "Embedding demo",
13 "description": "Sandbox for multimodal embeddings"
14 }'After launch
The wizard redirects to the instance detail page. From there you can monitor the launch, attach a firewall, build a Smart Balancer that targets the instance, or attach a cron schedule to start and stop it on a cadence.