In Browser
	StumbleUpon
	del.icio.us
	Google
	Google Buzz
	reddit
	LinkedIn

	Facebook
	Twitter
	Linkedin
	E-Mail

Generative AI > Large Language Models > LLM Streaming Responses

LLM Streaming Responses

Author: Venkata Sudhakar

When an LLM takes 3-5 seconds to generate a long response, showing a blank screen until it finishes makes your application feel slow and unresponsive - even if the total time is the same. Streaming solves this by sending each word (token) to the user as it is generated, so the reply appears live on screen, letter by letter. This is exactly how ChatGPT and Claude work in the browser. For a customer service chatbot, a product recommendation engine, or a report drafting tool, streaming transforms the user experience from "waiting" to "watching the AI think."

The OpenAI API supports streaming by passing stream=True to the chat completions call. Instead of returning a single response object, it returns an iterator of chunk objects. Each chunk contains a small piece of the response - sometimes a single word, sometimes a few characters. You iterate over the chunks and print or yield each piece as it arrives. The total content is identical to the non-streaming response; only the delivery timing differs. In a web application, you would use Server-Sent Events (SSE) to push each chunk to the browser in real time.

The below example shows a retail customer support chatbot streaming a personalised response about a delayed order - the customer sees words appear immediately rather than staring at a spinner for 4 seconds.

from openai import OpenAI
import time

client = OpenAI(api_key="your-api-key")

# Customer order context
order = {
    "customer_name": "Rahul",
    "order_id":      "ORD-58291",
    "product":       "Sony WH-1000XM5 Headphones",
    "order_date":    "18 March 2025",
    "status":        "Delayed - warehouse processing issue",
    "new_eta":       "26 March 2025"
}

system = """You are a helpful customer service agent for QuickShop electronics.
Be empathetic, professional, and specific. Apologise sincerely for the delay,
explain the reason briefly, give the new delivery date, and offer a small gesture
of goodwill. Keep your reply to 3-4 sentences."""

user_message = (
    f"Hi, I ordered {order['product']} on {order['order_date']} (order "
    f"{order['order_id']}) and it still has not arrived. What is happening?"
)

print("WITHOUT STREAMING - customer waits for full response:")
start = time.time()
response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "system", "content": system},
              {"role": "user",   "content": user_message}],
    temperature=0.4
)
print(f"(waited {time.time()-start:.1f}s before seeing anything)")
print(response.choices[0].message.content)

Non-streaming output - customer waits for the full reply,

WITHOUT STREAMING - customer waits for full response:
(waited 3.8s before seeing anything)
Hi Rahul, we sincerely apologise for the delay with your Sony WH-1000XM5
Headphones order ORD-58291. We experienced a processing issue at our warehouse
that has affected a small number of orders this week. Your order is now on
priority dispatch and will arrive by 26 March 2025. As a gesture of goodwill,
we are adding Rs 200 QuickShop credit to your account.

It gives the following output - words appear one by one as generated,

WITH STREAMING - customer sees words appear immediately:
(0.0s to first word)
Hi Rahul, we sincerely apologise for the delay with your Sony WH-1000XM5
Headphones order ORD-58291. We experienced a processing issue at our warehouse
that has affected a small number of orders this week. Your order is now on
priority dispatch and will arrive by 26 March 2025. As a gesture of goodwill,
we are adding Rs 200 QuickShop credit to your account.

Total characters streamed: 312

# Total time is the same 3.8s - but customer sees "Hi Rahul, we..." after 0.3s
# instead of a blank screen for 3.8s followed by the full text appearing at once
# For long responses (reports, summaries), streaming saves even more perceived time

For web applications, wrap the streaming call in a FastAPI endpoint using StreamingResponse, and on the frontend use the EventSource API or fetch with a readable stream to receive and display each chunk. LangChain LCEL chains also stream natively - just call chain.stream(input) instead of chain.invoke(input) and iterate the same way. Use streaming for any response longer than two sentences, for any chatbot-style interface, and whenever users need to feel like the system is actively working. For very short responses like yes/no answers or single numbers, streaming adds no meaningful benefit.

Send your comments, suggestions or queries regarding this site to [email protected].