Configuration Diagnostics - Troubleshooting Runbook
Emergency Response Guide
Critical Issues (Red Status)
Scenario: "All Services Down"
Symptoms: All services show connection failed Priority: P0 - Immediate action required
Investigation Steps:
- Check Cloudflare Status: https://www.cloudflarestatus.com/
- Verify Account Access:bash
npx wrangler whoami # Should show your account details - Test Individual Services:bash
curl https://stratiqx-identity-server-prod.your-account.workers.dev/health curl https://stratiqx-communication-prod.your-account.workers.dev/health curl https://stratiqx-payment-prod.your-account.workers.dev/health
Resolution Steps:
- If Cloudflare is down: Wait for service restoration
- If account access issues: Re-authenticate with
wrangler login - If specific services down: Check deployment logs and redeploy if needed
Scenario: "Database Connection Failed"
Symptoms: D1 database shows red status, connection test failed Priority: P0 - Data access impacted
Investigation Steps:
- Verify Database Exists:bash
npx wrangler d1 list # Should show strategic-intelligence-hub - Test Direct Connection:bash
npx wrangler d1 execute strategic-intelligence-hub --command "SELECT 1" - Check Binding Configuration:bash
grep -A 5 "d1_databases" wrangler.toml
Resolution Steps:
- If database missing: Restore from backup or recreate
- If binding incorrect: Update
wrangler.tomlwith correct database_id - If permissions issue: Check D1 access permissions in Cloudflare dashboard
Warning Issues (Yellow Status)
Scenario: "Environment Mismatch"
Symptoms: Production frontend connecting to staging services Priority: P1 - Plan fix for next deployment
Investigation Steps:
- Check Current Bindings:bash
grep -A 10 "\[env.production\]" wrangler.toml - Verify Service Names:bash
npx wrangler pages deployment list --limit 1 - Check Service Environments:
- Visit
/api/config?token=debug-config-2024 - Look for
configurationPathssection
- Visit
Resolution Steps:
- Update wrangler.toml to point to correct services:toml
[env.production] [[env.production.services]] binding = "AUTH_SERVICE" service = "stratiqx-identity-server-prod" # Ensure -prod suffix - Redeploy after configuration update
- Verify in diagnostics dashboard
Scenario: "Communication Service Secrets Warning"
Symptoms: Secrets show warning status, emails may be failing Priority: P1 - Email functionality impacted
Investigation Steps:
- Check Secret Status:bash
cd stratiqx-communication npx wrangler secret list --env production - Test Email Sending:bash
curl -X POST https://stratiqx-communication-prod.your-account.workers.dev/send \ -H "Content-Type: application/json" \ -H "Authorization: Bearer stratiqx-admin-88de24b7538d03f265d4e94cd96cd055478c76fe3bf259f2e9dd611cf143f8c8" \ -H "X-Service: stratiqx-onboarding" \ -d '{"communication_type": "welcome_email", "user_id": "test", "email": "test@yourdomain.com", "name": "Test"}'
Resolution Steps:
- Missing Secrets - Set them:bash
cd stratiqx-communication npx wrangler secret put RESEND_API_KEY --env production npx wrangler secret put AUTH_ADMIN_TOKEN --env production - Invalid RESEND_API_KEY:
- Check Resend dashboard for API key status
- Verify billing/quota limits
- Generate new API key if needed
- Redeploy communication service after secret updates
Monitoring & Prevention
Scenario: "Service Performance Degradation"
Symptoms: Services responding slowly, high latency Priority: P2 - Monitor and optimize
Investigation Steps:
- Check Service Metrics in Cloudflare dashboard
- Review Recent Deployments for performance regressions
- Monitor Resource Usage (CPU, memory limits)
Resolution Steps:
- Scale Resources if hitting limits
- Optimize Code if performance regression identified
- Add Caching where appropriate
Scenario: "Intermittent Connection Issues"
Symptoms: Services sometimes fail connectivity tests Priority: P2 - Network stability issue
Investigation Steps:
- Check for Rate Limiting:
- Review service logs for 429 responses
- Check API quotas and limits
- Network Path Analysis:
- Test from different locations
- Check for DNS resolution issues
Resolution Steps:
- Implement Retry Logic for transient failures
- Add Circuit Breakers for failing services
- Configure Appropriate Timeouts
Diagnostic Commands Reference
Quick Health Checks
# Full system check
npm run check:config
# API health check
curl -s "https://your-app.pages.dev/api/config?token=debug-config-2024" | jq '.summary'
# Individual service health
curl -s "https://stratiqx-communication-prod.your-account.workers.dev/health" | jq '.'Secret Management
# List all secrets
npx wrangler secret list --env production
# Set new secret
echo "your-secret-value" | npx wrangler secret put SECRET_NAME --env production
# Delete secret
npx wrangler secret delete SECRET_NAME --env productionDatabase Operations
# List databases
npx wrangler d1 list
# Execute query
npx wrangler d1 execute strategic-intelligence-hub --command "SELECT COUNT(*) FROM client_profiles"
# Create backup
npx wrangler d1 backup create strategic-intelligence-hubService Management
# List deployments
npx wrangler pages deployment list --limit 5
# Deploy specific environment
npx wrangler pages deploy dist --env production
# Check service bindings
npx wrangler pages deployment tailEscalation Procedures
P0 - Critical (Service Down)
- Immediate: Check diagnostics dashboard
- Within 5 min: Run emergency diagnostic commands
- Within 15 min: Identify root cause
- Within 30 min: Implement fix or rollback
- Post-incident: Document root cause and prevention steps
P1 - High (Functionality Impacted)
- Within 1 hour: Investigate and identify root cause
- Within 4 hours: Implement fix
- Within 24 hours: Deploy and verify resolution
- Follow-up: Monitor for recurrence
P2 - Medium (Performance/Warning)
- Within 1 day: Investigate and assess impact
- Within 1 week: Plan and implement fix
- Follow-up: Include in regular maintenance cycle
Common Error Messages
| Error Message | Cause | Solution |
|---|---|---|
"Missing required environment variables" | Secrets not set in worker | Set secrets via wrangler secret put |
"Service binding not configured" | Missing binding in wrangler.toml | Add service binding configuration |
"Invalid or expired authentication token" | Wrong admin token | Update token in service configuration |
"Connection failed" | Service unreachable | Check service deployment and network |
"Database connection test failed" | D1 binding issue | Verify database ID and binding |
"Unsupported communication type" | Invalid email type | Use supported types (welcome_email, etc.) |
"Email provider error" | Resend API issue | Check Resend dashboard and API key |
Prevention Checklist
Pre-Deployment
- [ ] Run
npm run check:configlocally - [ ] Verify all services show green status
- [ ] Check for environment mismatches
- [ ] Validate secret expiration dates
Post-Deployment
- [ ] Check diagnostics dashboard within 5 minutes
- [ ] Verify all services operational
- [ ] Test critical user flows
- [ ] Monitor for any new warnings
Regular Maintenance
- [ ] Weekly diagnostics review
- [ ] Monthly secret rotation assessment
- [ ] Quarterly performance optimization
- [ ] Update documentation for configuration changes
Emergency Contact: Check service logs and Cloudflare dashboard firstLast Updated: August 2025