Tokenized Completions and Constrained Decoding

Wafer’s self-hosted SGLang routes support pre-tokenized prompts on the OpenAI-compatible /v1/completions endpoint. You can also constrain decoding with SGLang/XGrammar-compatible EBNF by passing ebnf.

Model Support

/v1/completions is only available on Wafer’s self-hosted sglang routes. Models proxied to an upstream that does not expose a text-completion endpoint will return 404 for /v1/completions — use /v1/chat/completions against them instead.

Model	`/v1/completions`	Notes
`GLM-5.2`	Yes	Full token-ID + EBNF / regex / json_schema support
`glm5.2-fast`	Yes	Full token-ID + EBNF / regex / json_schema support
`GLM-5.1`	Yes	Full token-ID + EBNF / regex / json_schema support
`Qwen3.5-397B-A17B`	Yes	Full token-ID + constrained decoding
`Kimi-K2.7-Code`	No	Upstream proxy — chat completions only
`Kimi-K2.6`	No	Upstream proxy — chat completions only
`Qwen3.6-35B-A3B`	No	DashScope upstream — chat completions only
`qwen3.7-max`	No	DashScope upstream — chat completions only
`MiniMax-M3`	No	Upstream proxy — chat completions only

The same applies to ebnf / regex / json_schema constrained decoding — these require an sglang backend, so they only work on the “Yes” rows above. response_format: {"type": "json_object"} on /v1/chat/completions is broadly supported and is the right choice for structured output on the rest of the catalog.

Curl Request

curl -sS "https://pass.wafer.ai/v1/completions" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "GLM-5.1",
    "prompt": [9703],
    "max_tokens": 2,
    "temperature": 0,
    "ebnf": "root ::= \"A\" | \"B\""
  }'

Set stream to true and add -N to stream text completion chunks as server-sent events:

curl -N -sS "https://pass.wafer.ai/v1/completions" \
  -H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "GLM-5.1",
    "prompt": [9703],
    "max_tokens": 16,
    "temperature": 0.2,
    "stream": true
  }'

Request Body

Field	Type	Required	Notes
`model`	string	Yes	Use a model ID that supports tokenized completions, such as `GLM-5.1`.
`prompt`	array	Yes	Non-empty array of non-negative token IDs for one request, or an array of token-ID arrays for batched requests. Text prompts should use `/v1/chat/completions`.
`max_tokens`	integer	No	Maximum generated tokens. Must be positive when provided.
`min_tokens`	integer	No	Minimum generated tokens before stop conditions can end generation.
`temperature`	number	No	Sampling temperature. Use `0` for deterministic decoding.
`top_p`	number	No	Nucleus sampling cutoff.
`top_k`	integer	No	Limits sampling to the top K candidate tokens.
`min_p`	number	No	Minimum probability threshold for candidate tokens.
`frequency_penalty`	number	No	Penalizes tokens based on frequency in the generated text.
`presence_penalty`	number	No	Penalizes tokens that have already appeared.
`repetition_penalty`	number	No	SGLang repetition penalty.
`stop`	string or array	No	Stop sequence or sequences.
`stop_token_ids`	array	No	Stop generation when one of these token IDs is emitted.
`stream`	boolean	No	When `true`, returns streaming completion chunks.
`ebnf`	string	No	SGLang/XGrammar-compatible grammar.
`regex`	string	No	Regex constraint for constrained decoding.
`json_schema`	object	No	JSON schema constraint for structured output.
`logit_bias`	object	No	Token logit adjustments.
`n`	integer	No	Number of completions to generate.
`skip_special_tokens`	boolean	No	Controls whether special tokens are removed from output text.

Advanced SGLang passthrough fields are also accepted when you need lower-level control: custom_params, ignore_eos, no_stop_trim, spaces_between_special_tokens, stop_regex, structural_tag, custom_logit_processor, logprob_start_len, lora_path, priority, return_hidden_states, return_logprob, return_routed_experts, return_text_in_logprobs, rid, token_ids_logprob, and top_logprobs_num.

prompt must be token IDs on this endpoint. A non-empty array like [9703] is valid; ["hello"], an empty array, booleans, negative integers, and mixed token/string arrays are rejected.

Response Shape

Non-streaming responses use the OpenAI text completion shape:

{
  "id": "<request_id>",
  "object": "text_completion",
  "created": 1770000000,
  "model": "GLM-5.1",
  "choices": [
    {
      "index": 0,
      "text": "A",
      "logprobs": null,
      "finish_reason": "stop",
      "matched_stop": null
    }
  ],
  "usage": {
    "prompt_tokens": 1,
    "completion_tokens": 1,
    "total_tokens": 2,
    "prompt_tokens_details": {"cached_tokens": 0},
    "reasoning_tokens": 0
  }
}

For ordinary chat-style messages, use POST https://pass.wafer.ai/v1/chat/completions instead.

Serverless

Dedicated Endpoints

Reference

Tokenized Completions and Constrained Decoding

Model Support

Curl Request

Request Body

Response Shape

​Model Support

​Curl Request

​Request Body

​Response Shape

Model Support

Curl Request

Request Body

Response Shape