Launch a new instance

The launch wizard collects the model, the runtime preset, and a small set of scheduling choices. The default path is smart scheduling: you describe what you need, the scheduler picks the cheapest compatible capacity that fits.

Launch costs

Compute bills from the moment the instance transitions to running. Stop the instance when you are done to stop the meter.

Fields

FieldNotes
Hugging Face repoThe exact repo id, e.g. Qwen/Qwen3-VL-Embedding-2B. The catalog validates architecture, tokenizer, and approximate VRAM.
Serving namePublic model name used by the OpenAI endpoint. Lowercase, dashes, no spaces. Must be unique inside the account.
Workload kindchat (text generation) or embeddings (vector encoding).
Runtime presetbalanced, lower_vram, or long_context. The preset is fixed at launch time.
Context sizeDefaults to 0, which means the platform picks the model’s maximum context window automatically. Set an explicit value only when you want a smaller KV cache than the model supports.
Scheduling modeSmart (the default): the scheduler picks the best capacity for your constraints. Manual: you pick a specific GPU.
Smart constraintsOptional max hourly price, allowed regions, provider preference, allow spot, allow community. Ignored in manual mode.
FirewallOptional pre-prompt policy. Applies to chat workloads only.
Name & descriptionDisplay metadata for the workspace. Free text.

Runtime presets

The preset picks the runtime configuration. Three are available today. Pick balanced unless you have a specific reason to reach for the others.

PresetWhen to pick it
balancedDefault. bfloat16 weights, 0.9 GPU memory utilisation. Use when in doubt — it serves the broadest range of models cleanly.
lower_vramFP8 weights, 0.88 GPU memory utilisation. Prefer when the model is close to the GPU memory limit and you need a tighter fit.
long_context0.92 GPU memory utilisation, tuned for the model’s maximum context window. Pick this when the workload is dominated by long prompts.

The preset is fixed at launch time. Changing the underlying runtime configuration is not exposed in the wizard; if you need a non-default dtype, quantization, or tensor parallel size, switch to manual mode and override the values there.

Smart scheduling (default)

Smart mode replaces the GPU picker with a small set of constraints. At launch, the scheduler picks the cheapest compatible option that satisfies every constraint. The same constraints are re-evaluated on every start, so the same instance can run on different providers over time.

  • Max price: a hard ceiling in EUR/h. The scheduler never picks above it.
  • Region: pin to a single region or a list of allowed regions.
  • Provider preference: name a preferred provider, a list, or leave empty for cheapest.
  • Allow spot / allow community: include non-guaranteed capacity to lower cost.

Allowed regions

Use one of the following values for smart_region or smart_regions. Anything else is silently ignored.

ContinentAccepted values
Europeeu-west, eu-central, eu-north, europe
North Americaus-east, us-west, north-america
Asiaasia, ap-south, ap-northeast, ap-southeast
South Americasouth-america
Africaafrica
Oceaniaoceania

Re-selection on resume

If you change the constraints on a stopped instance, the next start uses the new constraints. The previous provider stays visible in the audit log so you can compare.

Smart is the right choice for almost every launch: the scheduler has visibility into capacity and pricing across every supported provider, and the constraints you set are the only thing you need to reason about. Manual mode is available below for the cases where you need it.

Manual scheduling

Manual mode skips the smart scheduler and lets you pick a specific GPU from the catalog. Use it when you need a fixed hardware profile (compliance, predictable latency), or when you want to override the runtime configuration that the preset would otherwise pick for you.

The cost is locked at launch from the catalog price snapshot, so manual mode does not move to a cheaper provider on its own. The cost only changes when you stop the instance and start it again.

Programmatic launch (API)

The same fields are accepted by the platform API. The frontend builds a CreateInstanceRequest payload and posts it to POST /instances. The example below uses the smart mode defaults:

curl
1curl -X POST https://api.qdiv0.com/v1/instances \
2  -H "Authorization: Bearer $QDIV0_API_KEY" \
3  -H "Content-Type: application/json" \
4  -d '{
5    "huggingface_repo_id": "Qwen/Qwen3-VL-Embedding-2B",
6    "serving_name": "qwen3-vl-2b-demo",
7    "workload_kind": "embeddings",
8    "runtime_preset": "balanced",
9    "scheduling_mode": "smart",
10    "smart_regions": ["eu-west"],
11    "smart_max_price_per_hour_usd": 2.5,
12    "name": "Embedding demo",
13    "description": "Sandbox for multimodal embeddings"
14  }'

After launch

The wizard redirects to the instance detail page. From there you can monitor the launch, attach a firewall, build a Smart Balancer that targets the instance, or attach a cron schedule to start and stop it on a cadence.