Skip to content

Configuration Diagnostics - Troubleshooting Runbook

Emergency Response Guide

Critical Issues (Red Status)

Scenario: "All Services Down"

Symptoms: All services show connection failed Priority: P0 - Immediate action required

Investigation Steps:

  1. Check Cloudflare Status: https://www.cloudflarestatus.com/
  2. Verify Account Access:
    bash
    npx wrangler whoami
    # Should show your account details
  3. Test Individual Services:
    bash
    curl https://stratiqx-identity-server-prod.your-account.workers.dev/health
    curl https://stratiqx-communication-prod.your-account.workers.dev/health  
    curl https://stratiqx-payment-prod.your-account.workers.dev/health

Resolution Steps:

  1. If Cloudflare is down: Wait for service restoration
  2. If account access issues: Re-authenticate with wrangler login
  3. If specific services down: Check deployment logs and redeploy if needed

Scenario: "Database Connection Failed"

Symptoms: D1 database shows red status, connection test failed Priority: P0 - Data access impacted

Investigation Steps:

  1. Verify Database Exists:
    bash
    npx wrangler d1 list
    # Should show strategic-intelligence-hub
  2. Test Direct Connection:
    bash
    npx wrangler d1 execute strategic-intelligence-hub --command "SELECT 1"
  3. Check Binding Configuration:
    bash
    grep -A 5 "d1_databases" wrangler.toml

Resolution Steps:

  1. If database missing: Restore from backup or recreate
  2. If binding incorrect: Update wrangler.toml with correct database_id
  3. If permissions issue: Check D1 access permissions in Cloudflare dashboard

Warning Issues (Yellow Status)

Scenario: "Environment Mismatch"

Symptoms: Production frontend connecting to staging services Priority: P1 - Plan fix for next deployment

Investigation Steps:

  1. Check Current Bindings:
    bash
    grep -A 10 "\[env.production\]" wrangler.toml
  2. Verify Service Names:
    bash
    npx wrangler pages deployment list --limit 1
  3. Check Service Environments:
    • Visit /api/config?token=debug-config-2024
    • Look for configurationPaths section

Resolution Steps:

  1. Update wrangler.toml to point to correct services:
    toml
    [env.production]
    [[env.production.services]]
    binding = "AUTH_SERVICE"
    service = "stratiqx-identity-server-prod"  # Ensure -prod suffix
  2. Redeploy after configuration update
  3. Verify in diagnostics dashboard

Scenario: "Communication Service Secrets Warning"

Symptoms: Secrets show warning status, emails may be failing Priority: P1 - Email functionality impacted

Investigation Steps:

  1. Check Secret Status:
    bash
    cd stratiqx-communication
    npx wrangler secret list --env production
  2. Test Email Sending:
    bash
    curl -X POST https://stratiqx-communication-prod.your-account.workers.dev/send \
      -H "Content-Type: application/json" \
      -H "Authorization: Bearer stratiqx-admin-88de24b7538d03f265d4e94cd96cd055478c76fe3bf259f2e9dd611cf143f8c8" \
      -H "X-Service: stratiqx-onboarding" \
      -d '{"communication_type": "welcome_email", "user_id": "test", "email": "test@yourdomain.com", "name": "Test"}'

Resolution Steps:

  1. Missing Secrets - Set them:
    bash
    cd stratiqx-communication
    npx wrangler secret put RESEND_API_KEY --env production
    npx wrangler secret put AUTH_ADMIN_TOKEN --env production
  2. Invalid RESEND_API_KEY:
    • Check Resend dashboard for API key status
    • Verify billing/quota limits
    • Generate new API key if needed
  3. Redeploy communication service after secret updates

Monitoring & Prevention

Scenario: "Service Performance Degradation"

Symptoms: Services responding slowly, high latency Priority: P2 - Monitor and optimize

Investigation Steps:

  1. Check Service Metrics in Cloudflare dashboard
  2. Review Recent Deployments for performance regressions
  3. Monitor Resource Usage (CPU, memory limits)

Resolution Steps:

  1. Scale Resources if hitting limits
  2. Optimize Code if performance regression identified
  3. Add Caching where appropriate

Scenario: "Intermittent Connection Issues"

Symptoms: Services sometimes fail connectivity tests Priority: P2 - Network stability issue

Investigation Steps:

  1. Check for Rate Limiting:
    • Review service logs for 429 responses
    • Check API quotas and limits
  2. Network Path Analysis:
    • Test from different locations
    • Check for DNS resolution issues

Resolution Steps:

  1. Implement Retry Logic for transient failures
  2. Add Circuit Breakers for failing services
  3. Configure Appropriate Timeouts

Diagnostic Commands Reference

Quick Health Checks

bash
# Full system check
npm run check:config

# API health check
curl -s "https://your-app.pages.dev/api/config?token=debug-config-2024" | jq '.summary'

# Individual service health
curl -s "https://stratiqx-communication-prod.your-account.workers.dev/health" | jq '.'

Secret Management

bash
# List all secrets
npx wrangler secret list --env production

# Set new secret
echo "your-secret-value" | npx wrangler secret put SECRET_NAME --env production

# Delete secret
npx wrangler secret delete SECRET_NAME --env production

Database Operations

bash
# List databases
npx wrangler d1 list

# Execute query
npx wrangler d1 execute strategic-intelligence-hub --command "SELECT COUNT(*) FROM client_profiles"

# Create backup
npx wrangler d1 backup create strategic-intelligence-hub

Service Management

bash
# List deployments
npx wrangler pages deployment list --limit 5

# Deploy specific environment
npx wrangler pages deploy dist --env production

# Check service bindings
npx wrangler pages deployment tail

Escalation Procedures

P0 - Critical (Service Down)

  1. Immediate: Check diagnostics dashboard
  2. Within 5 min: Run emergency diagnostic commands
  3. Within 15 min: Identify root cause
  4. Within 30 min: Implement fix or rollback
  5. Post-incident: Document root cause and prevention steps

P1 - High (Functionality Impacted)

  1. Within 1 hour: Investigate and identify root cause
  2. Within 4 hours: Implement fix
  3. Within 24 hours: Deploy and verify resolution
  4. Follow-up: Monitor for recurrence

P2 - Medium (Performance/Warning)

  1. Within 1 day: Investigate and assess impact
  2. Within 1 week: Plan and implement fix
  3. Follow-up: Include in regular maintenance cycle

Common Error Messages

Error MessageCauseSolution
"Missing required environment variables"Secrets not set in workerSet secrets via wrangler secret put
"Service binding not configured"Missing binding in wrangler.tomlAdd service binding configuration
"Invalid or expired authentication token"Wrong admin tokenUpdate token in service configuration
"Connection failed"Service unreachableCheck service deployment and network
"Database connection test failed"D1 binding issueVerify database ID and binding
"Unsupported communication type"Invalid email typeUse supported types (welcome_email, etc.)
"Email provider error"Resend API issueCheck Resend dashboard and API key

Prevention Checklist

Pre-Deployment

  • [ ] Run npm run check:config locally
  • [ ] Verify all services show green status
  • [ ] Check for environment mismatches
  • [ ] Validate secret expiration dates

Post-Deployment

  • [ ] Check diagnostics dashboard within 5 minutes
  • [ ] Verify all services operational
  • [ ] Test critical user flows
  • [ ] Monitor for any new warnings

Regular Maintenance

  • [ ] Weekly diagnostics review
  • [ ] Monthly secret rotation assessment
  • [ ] Quarterly performance optimization
  • [ ] Update documentation for configuration changes

Emergency Contact: Check service logs and Cloudflare dashboard firstLast Updated: August 2025

Strategic Intelligence Hub Documentation