๐—ฆ๐˜๐—ฟ๐—ฒ๐—ฎ๐—บ๐—ถ๐—ป๐—ด ๐—Ÿ๐—Ÿ๐—  ๐—ง๐—ผ๐—ธ๐—ฒ๐—ป๐˜€ ๐˜๐—ผ ๐˜๐—ต๐—ฒ ๐—•๐—ฟ๐—ผ๐˜„๐˜€๐—ฒ๐—ฟ

Spinners lie to users. They hide progress.

LLMs take 15 to 40 seconds to write a report. Waiting for the full text feels slow. Streaming tokens makes the text appear live. It feels like ChatGPT.

EventSource is a common tool for this. It has limits. It only supports GET requests. You need POST requests for LLM prompts.

Use fetch and a reader instead. This gives you:

Three rules for production:

  1. Stop the model. Pass the AbortSignal from the browser to the model. When a user hits stop, the GPU stops. Do not pay for tokens nobody reads.

  2. Fix the headers. Proxies often buffer your data. This makes tokens arrive in one big clump. Set no-transform and X-Accel-Buffering to no. This keeps the stream smooth.

  3. Handle timeouts. LLM responses are long. Set maxDuration to avoid function timeouts.

Watch for TCP chunk errors. A data line often splits across two reads. If you parse a half-line, your app crashes. Use a buffer. Hold the last fragment until the next read.

Keep the UI fast. Rapid token updates cause too many re-renders. Append tokens to a ref. Flush them to state in batches.

Source: https://dev.to/pavelespitia/streaming-llm-tokens-to-the-browser-the-production-sse-setup-knh

Optional learning community: https://t.me/GyaanSetuAi