๐ฆ๐๐ฟ๐๐ฐ๐๐๐ฟ๐ฒ๐ฑ ๐ข๐๐๐ฝ๐๐ ๐ณ๐ฟ๐ผ๐บ ๐๐๐ ๐
LLMs generate tokens, not data structures.
You ask for JSON. The model gives you valid JSON. Then, it adds a conversational sentence at the end. Your parser fails. Your pipeline crashes. Your system breaks at 2 AM.
To build production systems, you must enforce schemas at the token level.
There are three ways to do this:
Prompt-only JSON You tell the model to output JSON. This works about 85% to 95% of the time. It fails because the prompt is just a suggestion. It does not stop the model from adding extra text or missing braces. Use this only for prototyping.
API-level JSON mode and Function Calling Providers like OpenAI, Anthropic, and Gemini use this. They validate tokens during generation. This ensures the output matches your schema. It is the standard for most production apps. It has very low latency.
Grammar-constrained decoding This is for local or self-hosted models. Tools like Outlines or llama.cpp modify the probability of every token. If a token violates your schema, the system masks it out. The model cannot pick an invalid character. This is the most reliable method.
When to use each:
โข Prompt-only: Quick scripts and testing. โข API-level: Production apps using cloud models. โข Grammar-constrained: Self-hosted models and sensitive data.
Key takeaways:
- Token masking is better than resampling. Masking prevents errors before they happen.
- Grammar compilation adds latency. Cache your schemas to save time.
- Avoid constraints for creative writing. Constraints reduce diversity in text.
If you need 99.9% reliability, do not rely on prompts alone. Move your enforcement to the token level.
Optional learning community: https://t.me/GyaanSetuAi