API reference

Streaming

How responses arrive — complete on the hosted API today, and token-by-token when you self-host.

Hosted API

The managed API currently returns a single complete response. You may send "stream": truefor OpenAI-SDK compatibility — it’s accepted and ignored, and you get the full completion in one object. These models are small and fast, so end-to-end latency is low for typical front-desk turns.

Tip.Token-by-token streaming on the hosted API is on the roadmap and will use the standard OpenAI server-sent-events format below — so SDK code written against stream=True will keep working unchanged when it lands.

When self-hosting

Run the weights with vLLM or llama.cpp and you get native, real-time streaming today — both expose an OpenAI-compatible streaming endpoint. Set stream=True and consume deltas:

python

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

stream = client.chat.completions.create(
    model="fitness",
    messages=[{"role": "user", "content": "Beginner full-body workout?"}],
    stream=True,
)
for chunk in stream:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

See Self-hosting to stand up that endpoint.

The SSE format

When streaming, the response is text/event-stream: a sequence of data: lines, each carrying a chat.completion.chunk object whose choices[0].delta.content holds the next piece of text. The stream ends with a data: [DONE] sentinel. This is byte-for-byte the OpenAI streaming contract, so any OpenAI streaming client parses it without changes.

← Chat completions Errors & rate limits →