Phase 5: Monitoring
Phase 5: Monitoring & Observability
In this phase, we learn that a model’s journey is not over once it’s deployed. We must detect and fix “Model Decay.”
🟢 Level 1: Standard Software Metrics
The basics of system health.
- Latency: How long does a prediction take?
- Throughput: Requests per second.
- Error Rate: HTTP 500s or 400s.
🟡 Level 2: Data & Concept Drift
The unique challenge of ML. Models fail because the World Changes.
1. Data Drift (Feature Drift)
The distribution of input data changes.
- Example: You trained on young users, but your production users are older.
2. Concept Drift
The relationship between input and output changes.
- Example: A “Luxury” brand in 2010 might not be considered “Luxury” in 2024.
🔴 Level 3: Closing the Feedback Loop
3. Ground Truth & Accuracy
How do we know if the prediction was right?
- Immediate: If the user clicks the recommendation.
- Delayed: If the user pays their loan 6 months later.
4. Alerting Strategies
Use Evidently AI or Deepchecks to monitor drift.
- Rule: If Data Drift (KS Test), trigger a Slack alert and an automatic retraining job.