Streaming
Server-Sent Events shape, the [DONE] sentinel, and how to reconnect.
Set "stream": true on any chat completions call and the response becomes a Server-Sent Events stream.
Wire format
The response uses Content-Type: text/event-stream. Each chunk is a line prefixed with data: followed by either a JSON object or the literal string [DONE]:
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","created":1700000000,"model":"gpt-4o-mini","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" there"},"finish_reason":null}]}
data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]Chunks are separated by a blank line. The first chunk usually contains the role. Subsequent chunks contain delta.content fragments to concatenate. The penultimate chunk has the final finish_reason. The literal data: [DONE] line terminates the stream.
Concatenating
Build the assistant message by appending every delta.content:
let text = "";
for await (const chunk of stream) {
text += chunk.choices[0]?.delta?.content ?? "";
}For tool calls, deltas arrive on delta.tool_calls[i].function.arguments and may split a single JSON value across many chunks — buffer until the stream ends, then parse.
Finish reasons
The final chunk carries finish_reason:
| Value | Meaning |
|---|---|
stop | The model emitted an end-of-turn token, or hit a stop sequence. |
length | The completion hit max_tokens. |
tool_calls | The model wants to call a tool. Continue the conversation with the tool result. |
content_filter | An upstream safety filter cut the response. |
stop is the only "clean" termination from a generation standpoint — the others are signals to act.
Reconnecting
There is no resume protocol. If the stream drops mid-response (network blip, proxy timeout, runtime idle-out), the partial output is lost. Your options:
- Re-request the full prompt. Simplest. You'll pay for the prompt tokens again.
- Cache by request id. Save
chunk.idfrom the first chunk. On retry, send the prompt again and dedupe by id in your app layer.
For interactive UIs, surface the disconnect to the user rather than auto-retrying silently — the second response may diverge from the first.
Heads up
Aborting a stream mid-response does not cancel the upstream generation in every case. You are billed for the tokens the model actually generated, not the bytes you read.
Common pitfalls
- Buffering proxies. Some serverless platforms hold the response until the connection closes. Test in your actual runtime, not on localhost only.
- Native
fetchwithout streaming. If you useawait response.json(), you collapse the stream to a single object. Useresponse.body.getReader()or the OpenAI SDK'sstream: truemode. - JSON across chunk boundaries. Each
data:line is one complete JSON object — never a partial one. Parse line-by-line, not character-by-character.