Configuration Diagnostics - Troubleshooting Runbook

Emergency Response Guide

Critical Issues (Red Status)

Scenario: "All Services Down"

Symptoms: All services show connection failed Priority: P0 - Immediate action required

Investigation Steps:

Check Cloudflare Status: https://www.cloudflarestatus.com/

Verify Account Access:

bash

npx wrangler whoami
# Should show your account details

Test Individual Services:

bash

curl https://stratiqx-identity-server-prod.your-account.workers.dev/health
curl https://stratiqx-communication-prod.your-account.workers.dev/health  
curl https://stratiqx-payment-prod.your-account.workers.dev/health

Resolution Steps:

If Cloudflare is down: Wait for service restoration
If account access issues: Re-authenticate with wrangler login
If specific services down: Check deployment logs and redeploy if needed

Scenario: "Database Connection Failed"

Symptoms: D1 database shows red status, connection test failed Priority: P0 - Data access impacted

Investigation Steps:

Verify Database Exists:

bash

npx wrangler d1 list
# Should show strategic-intelligence-hub

Test Direct Connection:

bash

npx wrangler d1 execute strategic-intelligence-hub --command "SELECT 1"

Check Binding Configuration:
bash
```
grep -A 5 "d1_databases" wrangler.toml
```
1

Resolution Steps:

If database missing: Restore from backup or recreate
If binding incorrect: Update wrangler.toml with correct database_id
If permissions issue: Check D1 access permissions in Cloudflare dashboard

Warning Issues (Yellow Status)

Scenario: "Environment Mismatch"

Symptoms: Production frontend connecting to staging services Priority: P1 - Plan fix for next deployment

Investigation Steps:

Check Current Bindings:

bash

grep -A 10 "\[env.production\]" wrangler.toml

Verify Service Names:

bash

npx wrangler pages deployment list --limit 1

Check Service Environments:
- Visit /api/config?token=debug-config-2024
- Look for configurationPaths section

Resolution Steps:

Update wrangler.toml to point to correct services:

toml

[env.production]
[[env.production.services]]
binding = "AUTH_SERVICE"
service = "stratiqx-identity-server-prod"  # Ensure -prod suffix

Redeploy after configuration update
Verify in diagnostics dashboard

Scenario: "Communication Service Secrets Warning"

Symptoms: Secrets show warning status, emails may be failing Priority: P1 - Email functionality impacted

Investigation Steps:

Check Secret Status:

bash

cd stratiqx-communication
npx wrangler secret list --env production

Test Email Sending:

bash

curl -X POST https://stratiqx-communication-prod.your-account.workers.dev/send \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer stratiqx-admin-88de24b7538d03f265d4e94cd96cd055478c76fe3bf259f2e9dd611cf143f8c8" \
  -H "X-Service: stratiqx-onboarding" \
  -d '{"communication_type": "welcome_email", "user_id": "test", "email": "test@yourdomain.com", "name": "Test"}'

Resolution Steps:

Missing Secrets - Set them:

bash

cd stratiqx-communication
npx wrangler secret put RESEND_API_KEY --env production
npx wrangler secret put AUTH_ADMIN_TOKEN --env production

Invalid RESEND_API_KEY:
- Check Resend dashboard for API key status
- Verify billing/quota limits
- Generate new API key if needed
Redeploy communication service after secret updates

Monitoring & Prevention

Scenario: "Service Performance Degradation"

Symptoms: Services responding slowly, high latency Priority: P2 - Monitor and optimize

Investigation Steps:

Check Service Metrics in Cloudflare dashboard
Review Recent Deployments for performance regressions
Monitor Resource Usage (CPU, memory limits)

Resolution Steps:

Scale Resources if hitting limits
Optimize Code if performance regression identified
Add Caching where appropriate

Scenario: "Intermittent Connection Issues"

Symptoms: Services sometimes fail connectivity tests Priority: P2 - Network stability issue

Investigation Steps:

Check for Rate Limiting:
- Review service logs for 429 responses
- Check API quotas and limits
Network Path Analysis:
- Test from different locations
- Check for DNS resolution issues

Resolution Steps:

Implement Retry Logic for transient failures
Add Circuit Breakers for failing services
Configure Appropriate Timeouts

Diagnostic Commands Reference

Quick Health Checks

bash

# Full system check
npm run check:config

# API health check
curl -s "https://your-app.pages.dev/api/config?token=debug-config-2024" | jq '.summary'

# Individual service health
curl -s "https://stratiqx-communication-prod.your-account.workers.dev/health" | jq '.'

Secret Management

bash

# List all secrets
npx wrangler secret list --env production

# Set new secret
echo "your-secret-value" | npx wrangler secret put SECRET_NAME --env production

# Delete secret
npx wrangler secret delete SECRET_NAME --env production

Database Operations

bash

# List databases
npx wrangler d1 list

# Execute query
npx wrangler d1 execute strategic-intelligence-hub --command "SELECT COUNT(*) FROM client_profiles"

# Create backup
npx wrangler d1 backup create strategic-intelligence-hub

Service Management

bash

# List deployments
npx wrangler pages deployment list --limit 5

# Deploy specific environment
npx wrangler pages deploy dist --env production

# Check service bindings
npx wrangler pages deployment tail

Escalation Procedures

P0 - Critical (Service Down)

Immediate: Check diagnostics dashboard
Within 5 min: Run emergency diagnostic commands
Within 15 min: Identify root cause
Within 30 min: Implement fix or rollback
Post-incident: Document root cause and prevention steps

P1 - High (Functionality Impacted)

Within 1 hour: Investigate and identify root cause
Within 4 hours: Implement fix
Within 24 hours: Deploy and verify resolution
Follow-up: Monitor for recurrence

P2 - Medium (Performance/Warning)

Within 1 day: Investigate and assess impact
Within 1 week: Plan and implement fix
Follow-up: Include in regular maintenance cycle

Common Error Messages

Error Message	Cause	Solution
`"Missing required environment variables"`	Secrets not set in worker	Set secrets via `wrangler secret put`
`"Service binding not configured"`	Missing binding in wrangler.toml	Add service binding configuration
`"Invalid or expired authentication token"`	Wrong admin token	Update token in service configuration
`"Connection failed"`	Service unreachable	Check service deployment and network
`"Database connection test failed"`	D1 binding issue	Verify database ID and binding
`"Unsupported communication type"`	Invalid email type	Use supported types (welcome_email, etc.)
`"Email provider error"`	Resend API issue	Check Resend dashboard and API key

Prevention Checklist

Pre-Deployment

[ ] Run npm run check:config locally
[ ] Verify all services show green status
[ ] Check for environment mismatches
[ ] Validate secret expiration dates

Post-Deployment

[ ] Check diagnostics dashboard within 5 minutes
[ ] Verify all services operational
[ ] Test critical user flows
[ ] Monitor for any new warnings

Regular Maintenance

[ ] Weekly diagnostics review
[ ] Monthly secret rotation assessment
[ ] Quarterly performance optimization
[ ] Update documentation for configuration changes

Emergency Contact: Check service logs and Cloudflare dashboard firstLast Updated: August 2025

Configuration Diagnostics - Troubleshooting Runbook ​

Emergency Response Guide ​

Critical Issues (Red Status) ​

Scenario: "All Services Down" ​

Scenario: "Database Connection Failed" ​

Warning Issues (Yellow Status) ​

Scenario: "Environment Mismatch" ​

Scenario: "Communication Service Secrets Warning" ​

Monitoring & Prevention ​

Scenario: "Service Performance Degradation" ​

Scenario: "Intermittent Connection Issues" ​

Diagnostic Commands Reference ​

Quick Health Checks ​

Secret Management ​

Database Operations ​

Service Management ​

Escalation Procedures ​

P0 - Critical (Service Down) ​

P1 - High (Functionality Impacted) ​

P2 - Medium (Performance/Warning) ​

Common Error Messages ​

Prevention Checklist ​

Pre-Deployment ​

Post-Deployment ​

Regular Maintenance ​

Configuration Diagnostics - Troubleshooting Runbook

Emergency Response Guide

Critical Issues (Red Status)

Scenario: "All Services Down"

Scenario: "Database Connection Failed"

Warning Issues (Yellow Status)

Scenario: "Environment Mismatch"

Scenario: "Communication Service Secrets Warning"

Monitoring & Prevention

Scenario: "Service Performance Degradation"

Scenario: "Intermittent Connection Issues"

Diagnostic Commands Reference

Quick Health Checks

Secret Management

Database Operations

Service Management

Escalation Procedures

P0 - Critical (Service Down)

P1 - High (Functionality Impacted)

P2 - Medium (Performance/Warning)

Common Error Messages

Prevention Checklist

Pre-Deployment

Post-Deployment

Regular Maintenance