Skip to content

Phase 5: Monitoring

Phase 5: Monitoring & Observability

In this phase, we learn that a model’s journey is not over once it’s deployed. We must detect and fix “Model Decay.”

🟢 Level 1: Standard Software Metrics

The basics of system health.

Latency: How long does a prediction take?
Throughput: Requests per second.
Error Rate: HTTP 500s or 400s.

🟡 Level 2: Data & Concept Drift

The unique challenge of ML. Models fail because the World Changes.

1. Data Drift (Feature Drift)

The distribution of input data changes.

Example: You trained on young users, but your production users are older.

2. Concept Drift

The relationship between input and output changes.

Example: A “Luxury” brand in 2010 might not be considered “Luxury” in 2024.

🔴 Level 3: Closing the Feedback Loop

3. Ground Truth & Accuracy

How do we know if the prediction was right?

Immediate: If the user clicks the recommendation.
Delayed: If the user pays their loan 6 months later.

4. Alerting Strategies

Use Evidently AI or Deepchecks to monitor drift.

Rule: If Data Drift $> 0.05$ (KS Test), trigger a Slack alert and an automatic retraining job.