AI Consulting Platform - Administrative Dashboard Guide
🎯 Overview
The AI Consulting Platform Administrative Dashboard provides comprehensive oversight and monitoring capabilities for system administrators. It offers real-time visibility into execution health, failure patterns, performance trends, and automated troubleshooting recommendations.
This dashboard system was designed to enable quick identification of failure sources during orchestrator execution, with specific focus on external service dependencies (D1 Database, R2 Storage, AI Services).
📊 Dashboard API Endpoints
1. System Health Overview
Endpoint: GET /api/admin/dashboard/overview
Purpose: High-level system health monitoring and performance metrics
Key Features:
- Health Score (0-100): Overall system performance rating
- Success Rate: Percentage of successful executions over last 7 days
- Performance Metrics: Average processing times and execution counts
- Error Source Analysis: Categorized failures by service type
- Real-time Alerts: Automated warnings for system issues
- Hourly Performance Trends: 24-hour execution pattern analysis
Sample Response:
{
"dashboard": {
"health": {
"score": 87,
"status": "GOOD",
"total_executions": 156,
"success_rate": 87.18,
"avg_processing_time_ms": 45230
},
"failures": {
"by_source": [
{"error_source": "D1_DATABASE", "error_count": 8},
{"error_source": "AI_SERVICE", "error_count": 5},
{"error_source": "R2_STORAGE", "error_count": 2}
]
},
"alerts": [
{
"level": "MEDIUM",
"type": "DATABASE_ISSUES",
"message": "Database errors detected: 8 occurrences",
"recommendation": "Check D1 database connectivity and schema integrity"
}
]
}
}2. Comprehensive Failure Analysis
Endpoint: GET /api/admin/dashboard/failures
Purpose: Detailed failure analysis and troubleshooting guidance
Query Parameters:
timeframe:24h(default),7d,30d
Key Features:
- Error Pattern Detection: Automatic categorization of failure types
- Step Failure Analysis: Identification of problematic orchestrator steps
- Troubleshooting Recommendations: Automated suggestions based on patterns
- Affected Client Context: Company and industry impact analysis
- Failure Timeline: Historical failure tracking
Error Pattern Categories:
DATABASE_CONSTRAINT: Schema/constraint violationsSCHEMA_MISSING: Missing database tablesD1_CONNECTION: Database connectivity issuesR2_CONNECTION: Storage access problemsAI_SERVICE: AI model generation failuresTIMEOUT: Execution timeoutsRATE_LIMIT: Service quota exceededOTHER: Uncategorized failures
Sample Response:
{
"failure_analysis": {
"error_patterns": [
{
"error_pattern": "DATABASE_CONSTRAINT",
"occurrences": 12,
"latest_occurrence": "2024-01-15T14:30:00Z",
"sample_messages": "NOT NULL constraint failed: orchestrator_processing_log.processing_step"
}
],
"troubleshooting_recommendations": [
{
"issue": "Database Constraint Violations",
"count": 12,
"recommendation": "Run schema migration: migrations/add_workflow_logger_columns.sql",
"priority": "HIGH"
}
]
}
}3. Bulk Execution Monitoring
Endpoint: GET /api/admin/dashboard/executions
Purpose: Recent execution monitoring with bulk status overview
Query Parameters:
limit: Number of executions to return (default: 50, max: 100)status: Filter by status (all,completed,failed,processing)
Key Features:
- Recent Executions: Detailed list with progress tracking
- Long-running Detection: Executions stuck processing >10 minutes
- Progress Visualization: Step completion percentages
- Client Context: Company names and industry identification
- Performance Metrics: Duration analysis and trends
- Health Status Indicators: Visual status classification
Sample Response:
{
"executions": {
"summary": {
"total": 45,
"completed": 38,
"failed": 5,
"processing": 2,
"average_duration_ms": 42150,
"long_running_count": 1
},
"recent": [
{
"tracking_id": "M4A4SSPV",
"processing_status": "completed",
"company_name": "TechCorp Inc",
"industry_id": "technology",
"progress_percentage": 100,
"health_status": "HEALTHY",
"duration_ms": 38420
}
],
"long_running": [
{
"tracking_id": "DJPN8WSX",
"running_seconds": 1240,
"processing_status": "processing"
}
]
}
}🚨 Alert System & Health Indicators
Health Score Calculation
- 90-100: EXCELLENT - System operating optimally
- 85-89: GOOD - Minor issues, monitoring recommended
- 70-84: FAIR - Performance degradation detected
- <70: POOR - Immediate attention required
Automated Alert Types
High Priority Alerts
- FAILURE_RATE: >20% failure rate in recent executions
- AI_SERVICE_ISSUES: >10 AI service failures detected
- SCHEMA_MISSING: Critical database schema errors
Medium Priority Alerts
- DATABASE_ISSUES: >5 D1 database errors detected
- STORAGE_ISSUES: >3 R2 storage errors detected
- LONG_RUNNING: Executions stuck >10 minutes
Alert Response Actions
- Immediate: Check Cloudflare logs for error details
- Investigate: Review failure patterns in dashboard
- Resolve: Apply recommended troubleshooting steps
- Monitor: Track system health improvement
🔧 Troubleshooting Guide
Common Issues & Solutions
1. Database Constraint Violations
Symptoms:
- Error: "NOT NULL constraint failed: orchestrator_processing_log.processing_step"
- High DATABASE_CONSTRAINT error pattern count
Solution:
# Run database migration
wrangler d1 execute STRATEGIC_INTELLIGENCE_DB --file=migrations/add_workflow_logger_columns.sqlPrevention:
- Verify schema integrity before major deployments
- Monitor constraint violation alerts
2. AI Service Rate Limiting
Symptoms:
- Multiple AI_SERVICE failures
- Error messages containing "rate limit" or "quota exceeded"
Solutions:
- Implement exponential backoff in AI service calls
- Monitor Cloudflare AI service quotas
- Consider request batching optimization
3. R2 Storage Access Issues
Symptoms:
- R2_STORAGE error pattern increases
- Storage processing steps failing consistently
Solutions:
- Verify R2 bucket permissions and accessibility
- Check CORS configuration for admin interface
- Monitor R2 usage quotas
4. Long-Running Executions
Symptoms:
- Executions stuck in 'processing' status >10 minutes
- No recent WorkflowLogger updates
Investigation Steps:
- Check current step via status API
- Review WorkflowLogger for last successful step
- Identify bottleneck (D1, R2, or AI service)
- Consider manual intervention or restart
Database Schema Verification
Use the schema check endpoint to verify WorkflowLogger compatibility:
GET /health?schema=checkExpected result: isWorkflowLoggerCompatible: true
📈 Performance Optimization Insights
Key Performance Indicators (KPIs)
Execution Performance
- Target Average Duration: <60 seconds for comprehensive analysis
- Success Rate Target: >95% for production stability
- Step Failure Rate: <2% per individual step
Service Reliability Metrics
- D1 Database Uptime: >99.9%
- R2 Storage Availability: >99.9%
- AI Service Success Rate: >98%
Performance Optimization Strategies
1. Database Optimization
- Index Management: Ensure proper indexing on tracking_id columns
- Connection Pooling: Monitor D1 connection efficiency
- Query Optimization: Review slow queries in processing logs
2. Storage Efficiency
- Asset Size Monitoring: Track R2 storage usage patterns
- Upload Optimization: Implement parallel asset processing
- CDN Integration: Consider CloudFront for asset delivery
3. AI Service Optimization
- Model Selection: Optimize model choice for analysis depth
- Prompt Engineering: Reduce token usage through prompt optimization
- Batch Processing: Group similar analysis requests
📊 Dashboard Integration Recommendations
Frontend Integration
Create admin dashboard components that consume these APIs:
1. Health Overview Widget
// Fetch system health every 30 seconds
const healthData = await fetch('/api/admin/dashboard/overview');
// Display health score, alerts, and key metrics2. Failure Analysis Panel
// Filter failures by timeframe and pattern
const failures = await fetch('/api/admin/dashboard/failures?timeframe=24h');
// Show error patterns, troubleshooting recommendations3. Execution Monitor Table
// Real-time execution status with auto-refresh
const executions = await fetch('/api/admin/dashboard/executions?status=all&limit=25');
// Display with progress bars, health indicatorsMonitoring & Alerting Integration
CloudFlare Analytics Integration
- Set up custom metrics for dashboard KPIs
- Configure alerts for critical thresholds
- Track performance trends over time
Webhook Notifications
Consider implementing webhook notifications for:
- Health score drops below 80
- Critical alert generation
- System-wide failure pattern detection
Data Export Capabilities
Future enhancement opportunities:
- CSV export of failure analysis data
- Performance report generation
- Historical trend analysis
🔍 Advanced Administrative Features
WorkflowLogger Deep Dive
The WorkflowLogger integration provides step-by-step execution tracking:
Step Categories:
- Profile Operations: Database queries for client data
- AI Processing: Model inference and content generation
- Storage Operations: R2 uploads and metadata storage
- System Operations: Status updates and error handling
Logging Metadata:
{
"step_name": "ai_financial_analysis",
"step_number": 5,
"status": "completed",
"duration_ms": 3240,
"tokens_used": 289,
"metadata": {
"model": "@cf/meta/llama-3.1-8b-instruct",
"content_length": 1847,
"step_type": "ai_analysis"
}
}Error Source Identification
The system automatically categorizes errors by service:
D1 Database Errors:
- Connection timeouts
- Schema constraint violations
- Query syntax errors
- Permission issues
R2 Storage Errors:
- Upload failures
- Access permission errors
- Quota exceeded errors
- Network connectivity issues
AI Service Errors:
- Model inference failures
- Rate limit exceeded
- Invalid prompt formats
- Service unavailability
🚀 Best Practices for Administrators
Daily Monitoring Routine
- Check Health Overview - Review overall system health score
- Scan Recent Failures - Identify any new failure patterns
- Monitor Long-Running Executions - Address stuck processes
- Review Performance Trends - Track system performance over time
Weekly Analysis Tasks
- Failure Pattern Analysis - Deep dive into recurring issues
- Performance Optimization - Identify bottlenecks and improvements
- Capacity Planning - Monitor resource usage trends
- System Maintenance - Apply necessary updates and fixes
Incident Response Workflow
- Detection: Automated alerts or manual discovery
- Assessment: Use dashboard APIs to understand scope
- Investigation: Identify root cause using error categorization
- Resolution: Apply appropriate troubleshooting steps
- Follow-up: Monitor system recovery and prevent recurrence
Proactive Monitoring
- Set up automated health checks
- Configure alert thresholds appropriate for your usage
- Regularly review and update troubleshooting procedures
- Maintain documentation of common issues and solutions
🔗 API Reference Summary
| Endpoint | Method | Purpose | Key Parameters |
|---|---|---|---|
/api/admin/dashboard/overview | GET | System health & performance metrics | None |
/api/admin/dashboard/failures | GET | Detailed failure analysis | timeframe: 24h, 7d, 30d |
/api/admin/dashboard/executions | GET | Bulk execution monitoring | status: all, completed, failed, processinglimit: 1-100 |
/api/admin/status/:trackingId | GET | Individual execution details | trackingId: specific execution |
/api/admin/profiles | GET | Client profile management | page, limit, status, search |
📝 Changelog & Version Information
Version: 2.3.1-enhanced Last Updated: January 2024
Features Added:
- Comprehensive admin dashboard API system
- Automated error categorization and troubleshooting
- WorkflowLogger integration for step-by-step tracking
- Real-time health monitoring and alerting
- Performance trend analysis and optimization insights
Dependencies:
- Cloudflare Workers Runtime
- D1 Database (STRATEGIC_INTELLIGENCE_DB)
- R2 Storage (STRATIQX_REPORTS)
- Cloudflare AI Services
- WorkflowLogger integration
For technical support or feature requests, please refer to the project repository or contact the development team.