Skip to main content
Dedicated endpoints expose OpenAI-compatible inference at https://<ENDPOINT_HOST>/v1. On supported routes such as GLM-5.1, you can send pre-tokenized prompts to /v1/completions and constrain decoding with SGLang/XGrammar-compatible EBNF by passing ebnf.

Curl Request

curl -sS "https://<ENDPOINT_HOST>/v1/completions" \
  -H "Authorization: Bearer <API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "GLM-5.1",
    "prompt": [9703],
    "max_tokens": 2,
    "temperature": 0,
    "ebnf": "root ::= \"A\" | \"B\""
  }'
Set stream to true and add -N to stream text completion chunks as server-sent events:
curl -N -sS "https://<ENDPOINT_HOST>/v1/completions" \
  -H "Authorization: Bearer <API_KEY>" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "GLM-5.1",
    "prompt": [9703],
    "max_tokens": 16,
    "temperature": 0.2,
    "stream": true
  }'

Request Body

FieldTypeRequiredNotes
modelstringYesUse a model ID configured on your dedicated endpoint.
promptarrayYesNon-empty array of non-negative token IDs for one request, or an array of token-ID arrays for batched requests. Text prompts should use /v1/chat/completions.
max_tokensintegerNoMaximum generated tokens. Must be positive when provided.
min_tokensintegerNoMinimum generated tokens before stop conditions can end generation.
temperaturenumberNoSampling temperature. Use 0 for deterministic decoding.
top_pnumberNoNucleus sampling cutoff.
top_kintegerNoLimits sampling to the top K candidate tokens.
min_pnumberNoMinimum probability threshold for candidate tokens.
frequency_penaltynumberNoPenalizes tokens based on frequency in the generated text.
presence_penaltynumberNoPenalizes tokens that have already appeared.
repetition_penaltynumberNoSGLang repetition penalty.
stopstring or arrayNoStop sequence or sequences.
stop_token_idsarrayNoStop generation when one of these token IDs is emitted.
streambooleanNoWhen true, returns streaming completion chunks.
ebnfstringNoSGLang/XGrammar-compatible grammar.
regexstringNoRegex constraint for constrained decoding.
json_schemaobjectNoJSON schema constraint for structured output.
logit_biasobjectNoToken logit adjustments.
nintegerNoNumber of completions to generate.
skip_special_tokensbooleanNoControls whether special tokens are removed from output text.
Advanced SGLang passthrough fields are also accepted when you need lower-level control: custom_params, ignore_eos, no_stop_trim, spaces_between_special_tokens, stop_regex, structural_tag, custom_logit_processor, logprob_start_len, lora_path, priority, return_hidden_states, return_logprob, return_routed_experts, return_text_in_logprobs, rid, token_ids_logprob, and top_logprobs_num.
prompt must be token IDs on this endpoint. A non-empty array like [9703] is valid; ["hello"], an empty array, booleans, negative integers, and mixed token/string arrays are rejected.

Response Shape

Non-streaming responses use the OpenAI text completion shape:
{
  "id": "<request_id>",
  "object": "text_completion",
  "created": 1770000000,
  "model": "GLM-5.1",
  "choices": [
    {
      "index": 0,
      "text": "A",
      "logprobs": null,
      "finish_reason": "stop",
      "matched_stop": null
    }
  ],
  "usage": {
    "prompt_tokens": 1,
    "completion_tokens": 1,
    "total_tokens": 2,
    "prompt_tokens_details": {"cached_tokens": 0},
    "reasoning_tokens": 0
  }
}
Use the model IDs and capabilities configured for your dedicated endpoint. If a model route on your endpoint does not support /v1/completions, use the standard chat completions path instead.