Anomaly Detection for Collision KPIs: Beyond Static Thresholds

You set an alert: "notify me if cycle time exceeds 9 days." Monday, 9.2 days. Tuesday, 9.4. Wednesday, 9.1. Thursday, 9.0. Friday, 9.3.

Five alerts in a week. All expected variance. Zero useful signals.

By month two you stopped reading the emails. By month four you removed yourself from the alert. By month six the shop had a real problem and no one knew for three weeks.

This is the static-threshold trap, and it's why alerting doesn't work at most MSOs. Anomaly detection is the fix—not with hype-grade AI, but with techniques that have existed for decades and just aren't built into CCC.

Here's what actually works, what doesn't, and how to deploy it at MSO scale.

Why Static Thresholds Fail

A static threshold is a single number: "alert if X > 9." It doesn't know:

Normal variance. If your cycle time bounces between 7 and 10 naturally, a 9.2 is noise.
Seasonality. Q4 is always slower. Winter storms spike volume. Thresholds don't care.
Shop context. Shop A's 9-day average is Shop B's crisis.
Trend direction. A slow climb from 6 to 9 is a problem. A bounce from 8 to 9 isn't.

Static thresholds produce alerts at roughly the rate of noise, which is most of the time. The signal gets buried and people stop reading.

What Anomaly Detection Actually Does

Anomaly detection compares each new observation to the distribution of recent observations for that specific shop, KPI, and context. If today's value is within the expected range given history, no alert. If it's an outlier, alert.

The techniques range from dead simple to elaborate. For collision KPI work at MSO scale, you do not need elaborate.

Technique 1: Rolling Z-Score

Compute the mean and standard deviation of the KPI over the last 28 days. Flag any observation more than 2.5 or 3 standard deviations from the mean.

Dead simple. Runs in SQL. Works for cycle time, severity, supplement rate, delivery counts, and most collision KPIs.

Limitations: assumes the KPI is roughly normally distributed and stable. Breaks on seasonality.

Technique 2: Rolling Z-Score With Seasonal Adjustment

Same as above, but compare against the same day-of-week or same-week-of-year history. Handles "Mondays are always slow" or "December always spikes."

Still runs in SQL with a slightly fancier query. Handles most collision seasonality.

Technique 3: Prophet or ARIMA

Time-series models that decompose trend, seasonality, and noise. Forecast an expected range for today. Flag observations outside the range.

Real model, real library (Facebook's Prophet, statsmodels). Slight operational overhead but not hard. Handles fancier patterns—multiple seasonalities, trend changes.

Use this when rolling Z-score stops giving you precision.

Technique 4: Multivariate Models

Multiple KPIs correlated. Severity and supplement rate spiking together is different from either alone. Models like isolation forests can catch these patterns.

Real AI territory. Worth it for mature deployments. Overkill for a first pass.

What to Alert On

Not every KPI deserves alerting. A rough framework:

Priority	KPI examples	Why
Alert-worthy	Cycle time, WIP aging, supplement rate, delivery count, severity	Direct operational impact, high variance = real problem
Alert-worthy	Carrier-specific performance, tech productivity anomalies, CSI drops	Actionable at GM level
Watch only	Gross profit percentage, parts margin, labor rate	Correlated with many things; hard to action from an alert alone
Rarely alert	Administrative metrics	These don't move fast and don't need same-day response

Start with 5–8 alert-worthy metrics per shop. Expand only when the team is handling the alerts well.

Deployment That Actually Works

Run Nightly, Not Real-Time

For collision KPIs, the business doesn't need minute-by-minute anomaly detection. Nightly at 5 a.m., before the shop opens, is plenty. This simplifies everything—cost, reliability, debugging.

One Alert Per Anomaly, Not Per Metric

If a shop's cycle time spiked because a tech walked out and it's cascading into WIP aging and delivery counts, don't send three alerts. Send one: "Shop X has cascading anomalies; likely root cause: reduced capacity."

This requires joining anomaly outputs and narrating them—the same LLM layer that produces ops briefs works here.

Include Context

An alert that says "cycle time anomaly at Shop X" is useless. An alert that says "cycle time spiked from 7.1 to 9.8 days week-over-week. Tech absenteeism up 40%. Three ROs aged past 14 days: #12431, #12445, #12457" is actionable.

Give the alert everything the GM needs to act without opening a dashboard.

Feedback Loop

When an alert fires, log whether it was useful. Over time, tune thresholds by KPI. Suppress the ones that produce noise. Promote the ones that reliably fire on real problems.

Delivery Channel

Email, Teams, Slack—pick the channel the GM already reads. Do not build a separate alerts portal. It won't get opened.

Why Most MSOs Don't Do This

Three reasons:

Data isn't in a warehouse. Can't run a rolling Z-score on CCC reports you're pulling manually.
No one owns it. Ops doesn't know the stats; data doesn't know the ops. Anomaly detection sits in the gap.
False-start fatigue. The team tried alerts with static thresholds, got buried, and wrote off alerting generally.

All three are solvable. The warehouse work unlocks the technique. The ownership gap is a people problem. The false-start fatigue takes one well-deployed round of sane alerts to reverse.

The Payoff

A well-deployed anomaly detection layer catches problems 3–10 days earlier than month-end review or human pattern-matching. At a 20-shop MSO, that's several problems per month caught before they propagate.

The ROI isn't in the AI. It's in the reduced time from problem-emerging to problem-acknowledged.

Ready for alerting that doesn't cry wolf?

We've deployed anomaly detection across 100+ shops using techniques that actually work at collision scale—not generic ML platform vaporware.

Schedule a Call →