GLM-5.1 supports pre-tokenized prompts on the OpenAI-compatible /v1/completions endpoint. You can also constrain decoding with SGLang/XGrammar-compatible EBNF by passing ebnf.
Curl Request
curl -sS "https://pass.wafer.ai/v1/completions" \
-H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
-H "Content-Type: application/json" \
-d '{
"model": "GLM-5.1",
"prompt": [9703],
"max_tokens": 2,
"temperature": 0,
"ebnf": "root ::= \"A\" | \"B\""
}'
Set stream to true and add -N to stream text completion chunks as server-sent events:
curl -N -sS "https://pass.wafer.ai/v1/completions" \
-H "Authorization: Bearer <YOUR_WAFER_API_KEY>" \
-H "Content-Type: application/json" \
-d '{
"model": "GLM-5.1",
"prompt": [9703],
"max_tokens": 16,
"temperature": 0.2,
"stream": true
}'
Request Body
| Field | Type | Required | Notes |
|---|
model | string | Yes | Use a model ID that supports tokenized completions, such as GLM-5.1. |
prompt | array | Yes | Non-empty array of non-negative token IDs for one request, or an array of token-ID arrays for batched requests. Text prompts should use /v1/chat/completions. |
max_tokens | integer | No | Maximum generated tokens. Must be positive when provided. |
min_tokens | integer | No | Minimum generated tokens before stop conditions can end generation. |
temperature | number | No | Sampling temperature. Use 0 for deterministic decoding. |
top_p | number | No | Nucleus sampling cutoff. |
top_k | integer | No | Limits sampling to the top K candidate tokens. |
min_p | number | No | Minimum probability threshold for candidate tokens. |
frequency_penalty | number | No | Penalizes tokens based on frequency in the generated text. |
presence_penalty | number | No | Penalizes tokens that have already appeared. |
repetition_penalty | number | No | SGLang repetition penalty. |
stop | string or array | No | Stop sequence or sequences. |
stop_token_ids | array | No | Stop generation when one of these token IDs is emitted. |
stream | boolean | No | When true, returns streaming completion chunks. |
ebnf | string | No | SGLang/XGrammar-compatible grammar. |
regex | string | No | Regex constraint for constrained decoding. |
json_schema | object | No | JSON schema constraint for structured output. |
logit_bias | object | No | Token logit adjustments. |
n | integer | No | Number of completions to generate. |
skip_special_tokens | boolean | No | Controls whether special tokens are removed from output text. |
Advanced SGLang passthrough fields are also accepted when you need lower-level control: custom_params, ignore_eos, no_stop_trim, spaces_between_special_tokens, stop_regex, structural_tag, custom_logit_processor, logprob_start_len, lora_path, priority, return_hidden_states, return_logprob, return_routed_experts, return_text_in_logprobs, rid, token_ids_logprob, and top_logprobs_num.
prompt must be token IDs on this endpoint. A non-empty array like [9703] is valid; ["hello"], an empty array, booleans, negative integers, and mixed token/string arrays are rejected.
Response Shape
Non-streaming responses use the OpenAI text completion shape:
{
"id": "<request_id>",
"object": "text_completion",
"created": 1770000000,
"model": "GLM-5.1",
"choices": [
{
"index": 0,
"text": "A",
"logprobs": null,
"finish_reason": "stop",
"matched_stop": null
}
],
"usage": {
"prompt_tokens": 1,
"completion_tokens": 1,
"total_tokens": 2,
"prompt_tokens_details": {"cached_tokens": 0},
"reasoning_tokens": 0
}
}
For ordinary chat-style messages, use POST https://pass.wafer.ai/v1/chat/completions instead.