Production AI API Failures
Error messages rarely tell the whole story when your AI feature breaks at 2 AM. I have run OpenAI and Anthropic integrations for a year. I learned to group failures by what they mean for debugging.
Handling Rate Limits
OpenAI 429 errors have different causes. You must check the error code to know how to react.
- Requests-per-minute (RPM) limits recover in seconds.
- Tokens-per-minute (TPM) limits recover in 60 seconds.
- Monthly quota exhaustion stays broken until you add credits or the billing cycle resets.
Do not use exponential backoff for quota issues. It will waste your time.
Anthropic 529 errors mean the provider is overloaded. Treat this like a 503 error. The problem is on their side. Back off and alert your team.
Handling 400 Errors
These failures are usually your fault. Watch for these three patterns:
- Model version mismatches. You updated a name in one place but not in your retry handler.
- Context window overflow. The conversation history grew too large. This often happens due to bad truncation logic.
- Schema validation failures. Your JSON structure has unsupported types or recursive references.
To fix these, log the full request payload for 400 errors. Redact user data first. The response body tells you exactly which field failed.
Handling Timeouts
Timeouts are hard to track because the provider sees nothing wrong.
- Connect timeout. The handshake failed. This happens during provider brownouts or DNS issues. Check your outbound network.
- Read timeout. The model started but did not finish. Your app must handle partial streaming outputs.
- Gateway timeout (504). Your proxy timed out first. The request might still be running at the provider. Use deduplication before you retry.
To debug, separate your connect timeout from your read timeout. Log the time-to-first-token to find where the latency sits.
Handling Provider Issues
- A 500 error often resolves with one retry after two seconds.
- A 503 error means the service is degraded. If the provider status page shows an incident, use a circuit breaker.
- Always record which model version failed. Different models have different reliability levels.
Stop jumping from logs to Slack. Check the provider status page first. It saves you 20 minutes of panic.
Optional learning community: https://t.me/GyaanSetuAi
