Skip to content

Commit 2bb5f62

Browse files
committed
chore: Test and Document Provider Error Tracking System
Fixes #776
1 parent f3c3875 commit 2bb5f62

3 files changed

Lines changed: 1216 additions & 0 deletions

File tree

Lines changed: 215 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,215 @@
1+
# Provider Error Tracking
2+
3+
## Overview
4+
5+
Conduit automatically tracks provider API errors and can disable API keys that consistently fail. This system helps maintain service reliability by detecting and isolating problematic credentials before they impact your users.
6+
7+
## How It Works
8+
9+
When API requests fail, Conduit:
10+
1. Classifies the error type based on HTTP status code
11+
2. Stores the error in Redis for tracking
12+
3. Evaluates whether the key should be disabled
13+
4. Publishes events for real-time dashboard updates
14+
15+
## Error Types
16+
17+
### Fatal Errors (Auto-Disable Keys)
18+
19+
These errors indicate fundamental issues that require intervention:
20+
21+
| Error Type | HTTP Status | Description | Disable Policy |
22+
|------------|-------------|-------------|----------------|
23+
| Invalid API Key | 401 | API key is invalid, revoked, or malformed | **Immediate** - disabled on first occurrence |
24+
| Insufficient Balance | 402 | Account has no credits or quota exhausted | 2 occurrences within 5 minutes |
25+
| Access Forbidden | 403 | Account lacks permission (not balance-related) | 3 occurrences within 10 minutes |
26+
27+
### Warning Errors (Tracked, No Auto-Disable)
28+
29+
These errors are typically transient and don't disable keys:
30+
31+
| Error Type | HTTP Status | Description | Alert Threshold |
32+
|------------|-------------|-------------|-----------------|
33+
| Rate Limit Exceeded | 429 | Too many requests to provider | 10 in 5 minutes |
34+
| Model Not Found | 404 | Requested model doesn't exist | Not tracked |
35+
| Service Unavailable | 503 | Provider experiencing issues | 5 in 10 minutes |
36+
37+
### Transient Errors (Minimal Tracking)
38+
39+
Network errors and timeouts are tracked minimally as they're usually temporary:
40+
- Network connectivity issues
41+
- Request timeouts
42+
- Unknown/unclassified errors
43+
44+
## Managing Provider Errors
45+
46+
### Viewing Error Dashboard
47+
48+
Access provider errors through the Admin Panel:
49+
50+
1. Navigate to **Providers** in the sidebar
51+
2. Look for error indicators on provider cards
52+
3. Click a provider to see detailed error information
53+
54+
**Dashboard shows:**
55+
- Total errors in the last 24 hours
56+
- Fatal vs. warning error breakdown
57+
- Number of disabled keys
58+
- Errors grouped by provider
59+
60+
### Viewing Recent Errors
61+
62+
The recent errors view shows:
63+
- Which key credential caused the error
64+
- Error type and HTTP status code
65+
- Error message from the provider
66+
- Timestamp of occurrence
67+
- Whether it was a fatal or warning error
68+
69+
### Managing Disabled Keys
70+
71+
When a key is disabled:
72+
73+
1. **Identify the Issue**
74+
- Check the error type (Invalid Key, Insufficient Balance, etc.)
75+
- Review the error message from the provider
76+
- Verify the key in your provider's dashboard
77+
78+
2. **Resolve the Problem**
79+
- For **Invalid API Key**: Generate a new key or check for typos
80+
- For **Insufficient Balance**: Add credits to your provider account
81+
- For **Access Forbidden**: Check API permissions and access level
82+
83+
3. **Re-enable the Key**
84+
- Navigate to the disabled key
85+
- Click **Clear Errors & Re-enable**
86+
- Confirm the action
87+
88+
### Manually Disabling Keys
89+
90+
You can manually disable a key for maintenance:
91+
92+
1. Select the provider key
93+
2. Click **Disable Key**
94+
3. Provide a reason (for audit purposes)
95+
4. The key will stop receiving traffic immediately
96+
97+
## Error Retention
98+
99+
- **Fatal errors**: Persisted until manually cleared
100+
- **Warnings**: Retained for 30 days (last 100 per key)
101+
- **Recent error feed**: Last 1,000 errors across all providers
102+
103+
## Provider-Level Disabling
104+
105+
When all keys for a provider are disabled:
106+
- The entire provider is marked as unavailable
107+
- Requests will fail over to other providers (if configured)
108+
- Provider shows "Disabled" status in dashboard
109+
110+
## Best Practices
111+
112+
### Monitoring
113+
114+
1. **Check the dashboard regularly** - Review error trends daily
115+
2. **Set up alerts** - Use webhook integrations for error notifications
116+
3. **Watch for patterns** - Sudden spikes may indicate provider issues
117+
118+
### Multiple Keys
119+
120+
1. **Use multiple API keys** - Distribute load and provide redundancy
121+
2. **Different accounts** - Separate keys from different billing accounts
122+
3. **Primary/Secondary** - Configure primary key with backup alternatives
123+
124+
### Error Prevention
125+
126+
1. **Monitor provider balance** - Keep accounts funded
127+
2. **Rotate keys periodically** - Update credentials before they expire
128+
3. **Test new keys** - Verify keys work before deploying to production
129+
130+
## Troubleshooting
131+
132+
### Key Won't Re-enable
133+
134+
**Symptoms:** Clicking "Clear Errors & Re-enable" doesn't work
135+
136+
**Solutions:**
137+
- Ensure you've actually fixed the underlying issue
138+
- Check the confirmation checkbox is selected
139+
- Verify you have admin permissions
140+
- Check browser console for errors
141+
142+
### Errors Not Appearing
143+
144+
**Symptoms:** Errors occur but don't show in dashboard
145+
146+
**Solutions:**
147+
- Verify Redis is connected and healthy
148+
- Check that the Admin API is running
149+
- Ensure error tracking service is enabled
150+
- Review Admin API logs for errors
151+
152+
### False Positive Disables
153+
154+
**Symptoms:** Keys disabled but actually valid
155+
156+
**Solutions:**
157+
- Check if provider had temporary outage
158+
- Review error timestamps for clustering
159+
- Consider adjusting thresholds if needed
160+
- Report patterns to Conduit team
161+
162+
### High Warning Count
163+
164+
**Symptoms:** Many rate limit warnings without issues
165+
166+
**Solutions:**
167+
- This is informational, not actionable
168+
- Consider distributing load across more keys
169+
- Implement request rate limiting on your side
170+
- Contact provider for higher rate limits
171+
172+
## API Reference
173+
174+
### View Error Statistics
175+
```bash
176+
GET /api/provider-errors/stats?hours=24
177+
```
178+
179+
### View Recent Errors
180+
```bash
181+
GET /api/provider-errors/recent?limit=100
182+
```
183+
184+
### View Specific Key Errors
185+
```bash
186+
GET /api/provider-errors/keys/{keyId}
187+
```
188+
189+
### Clear Errors and Re-enable Key
190+
```bash
191+
POST /api/provider-errors/keys/{keyId}/clear
192+
Content-Type: application/json
193+
194+
{
195+
"reenableKey": true,
196+
"confirmReenable": true,
197+
"reason": "Credits added to account"
198+
}
199+
```
200+
201+
### Manually Disable Key
202+
```bash
203+
POST /api/provider-errors/keys/{keyId}/disable
204+
Content-Type: application/json
205+
206+
{
207+
"reason": "Scheduled maintenance"
208+
}
209+
```
210+
211+
## Related Documentation
212+
213+
- [Provider Architecture](../architecture/provider-system/provider-architecture.md)
214+
- [Error Tracking Developer Guide](../development/error-tracking-architecture.md)
215+
- [Error Tracking Runbook](../operations/error-tracking-runbook.md)

0 commit comments

Comments
 (0)