Streaming

Set "stream": true on any chat completions call and the response becomes a Server-Sent Events stream.

POST/v1/chat/completions

Wire format

The response uses Content-Type: text/event-stream. Each chunk is a line prefixed with data: followed by either a JSON object or the literal string [DONE]:

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","created":1700000000,"model":"gpt-4o-mini","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":" there"},"finish_reason":null}]}

data: {"id":"chatcmpl-abc","object":"chat.completion.chunk","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Chunks are separated by a blank line. The first chunk usually contains the role. Subsequent chunks contain delta.content fragments to concatenate. The penultimate chunk has the final finish_reason. The literal data: [DONE] line terminates the stream.

Concatenating

Build the assistant message by appending every delta.content:

let text = "";
for await (const chunk of stream) {
  text += chunk.choices[0]?.delta?.content ?? "";
}

For tool calls, deltas arrive on delta.tool_calls[i].function.arguments and may split a single JSON value across many chunks — buffer until the stream ends, then parse.

Finish reasons

The final chunk carries finish_reason:

Value	Meaning
`stop`	The model emitted an end-of-turn token, or hit a stop sequence.
`length`	The completion hit `max_tokens`.
`tool_calls`	The model wants to call a tool. Continue the conversation with the tool result.
`content_filter`	An upstream safety filter cut the response.

stop is the only "clean" termination from a generation standpoint — the others are signals to act.

Reconnecting

There is no resume protocol. If the stream drops mid-response (network blip, proxy timeout, runtime idle-out), the partial output is lost. Your options:

Re-request the full prompt. Simplest. You'll pay for the prompt tokens again.
Cache by request id. Save chunk.id from the first chunk. On retry, send the prompt again and dedupe by id in your app layer.

For interactive UIs, surface the disconnect to the user rather than auto-retrying silently — the second response may diverge from the first.

Heads up

Aborting a stream mid-response does not cancel the upstream generation in every case. You are billed for the tokens the model actually generated, not the bytes you read.

Common pitfalls

Buffering proxies. Some serverless platforms hold the response until the connection closes. Test in your actual runtime, not on localhost only.
Native fetch without streaming. If you use await response.json(), you collapse the stream to a single object. Use response.body.getReader() or the OpenAI SDK's stream: true mode.
JSON across chunk boundaries. Each data: line is one complete JSON object — never a partial one. Parse line-by-line, not character-by-character.