𝗦𝘁𝗿𝗲𝗮𝗺𝗶𝗻𝗴 𝗟𝗟𝗠 𝗧𝗼𝗸𝗲𝗻𝘀 𝘁𝗼 𝘁𝗵𝗲 𝗕𝗿𝗼𝘄𝘀𝗲𝗿

📅1 week ago⏱1 min read

Spinners lie to users. They hide progress.

LLMs take 15 to 40 seconds to write a report. Waiting for the full text feels slow. Streaming tokens makes the text appear live. It feels like ChatGPT.

EventSource is a common tool for this. It has limits. It only supports GET requests. You need POST requests for LLM prompts.

Use fetch and a reader instead. This gives you:

POST request support.
Custom auth headers.
Full control over cancellation.

Three rules for production:

Stop the model. Pass the AbortSignal from the browser to the model. When a user hits stop, the GPU stops. Do not pay for tokens nobody reads.
Fix the headers. Proxies often buffer your data. This makes tokens arrive in one big clump. Set no-transform and X-Accel-Buffering to no. This keeps the stream smooth.
Handle timeouts. LLM responses are long. Set maxDuration to avoid function timeouts.

Watch for TCP chunk errors. A data line often splits across two reads. If you parse a half-line, your app crashes. Use a buffer. Hold the last fragment until the next read.

Keep the UI fast. Rapid token updates cause too many re-renders. Append tokens to a ref. Flush them to state in batches.

Source: https://dev.to/pavelespitia/streaming-llm-tokens-to-the-browser-the-production-sse-setup-knh

Optional learning community: https://t.me/GyaanSetuAi

𝗦𝘁𝗿𝗲𝗮𝗺𝗶𝗻𝗴 𝗟𝗟𝗠 𝗧𝗼𝗸𝗲𝗻𝘀 𝘁𝗼 𝘁𝗵𝗲 𝗕𝗿𝗼𝘄𝘀𝗲𝗿

Continue reading

𝗛𝗶𝗴𝗵 𝗣𝗲𝗿𝗳𝗼𝗿𝗺𝗮𝗻𝗰𝗲 𝗔𝗜 𝗔𝗴𝗲𝗻𝘁𝘀 𝗔𝗿𝗲 𝗗𝗶𝘀𝘁𝗿𝗶𝗯𝘂𝘁𝗲𝗱 𝗦𝘆𝘀𝘁𝗲𝗺𝘀

𝗪𝗵𝘆 𝗬𝗼𝘂𝗿 𝗔𝗽𝗽 𝗗𝗶𝗲𝘀 𝗔𝘁 𝟭 𝗠𝗶𝗹𝗹𝗶𝗼𝗻 𝗥𝗼𝘄𝘀

𝗦𝘁𝗿𝗲𝗮𝗺𝗶𝗻𝗴 𝗔𝗜 𝗥𝗲𝘀𝗽𝗼𝗻𝘀𝗲𝘀 𝗶𝗻 𝗦𝗲𝗿𝘃𝗲𝗿𝗹𝗲𝘀𝘀

𝗦𝘁𝗿𝗲𝗮𝗺𝗶𝗻𝗴 𝗔𝗜 𝗥𝗲𝘀𝗽𝗼𝗻𝘀𝗲𝘀 𝗶𝗻 𝗦𝗲𝗿𝘃𝗲𝗿𝗹𝗲𝘀𝘀 𝗔𝗽𝗽𝘀

𝗦𝘁𝗿𝗲𝗮𝗺𝗶𝗻𝗴 𝗖𝗹𝗮𝘂𝗱𝗲 𝗔𝗣𝗜 𝗥𝗲𝘀𝗽𝗼𝗻𝘀𝗲𝘀 𝗶𝗻 𝗣𝘆𝘁𝗵𝗼𝗻