Your Model Learned the Wrong Thing

May 03, 2026

A quality inspection model ships with 98% accuracy. It catches scratches, dents, and discoloration with near-perfect precision on the test set. The team celebrates.

Three weeks later, the model starts missing scratches on a new batch of components. Nothing changed in the model. The components are the same specification. But the defect rate climbed from 2% to 11% overnight.

What happened? The model never learned to detect scratches. It learned to detect the shadow that scratches cast under the specific lighting angle used during training. When the overhead light was replaced during routine maintenance, the shadow pattern changed — and the model lost the feature it was actually using.

The shortcut problem

Neural networks are optimization machines. Given training data, they will find the easiest statistical pattern that separates the classes — and that pattern isn't always the one you intended.

Classic examples: a model that detects pneumonia from chest X-rays actually learned to detect the "portable" label on bedside X-ray machines (sicker patients get bedside scans). A model that classifies wolves vs. huskies learned to detect snow in the background. A defect detection model that learned to detect lighting artifacts instead of physical damage.

These aren't bugs in the model. They're rational responses to training data that contained spurious correlations. The model found a shortcut. It works — until the correlation breaks.

Why interpretability is harder than it sounds

Grad-CAM and attention maps are the standard tools for checking what a model is "looking at." They produce heatmaps overlaid on the input image, highlighting regions that most influenced the prediction.

The problem is that these heatmaps are approximations. They show which regions contributed to the output, not why they contributed. A heatmap highlighting the defect region doesn't prove the model learned "defect." It might have learned "this region has different pixel statistics" — which could be the defect, or the shadow, or the lighting gradient, or a camera artifact.

For deep networks with millions of parameters, honest interpretability has fundamental limits. We can visualize individual features, attention patterns, and gradient-based attributions. But we cannot fully explain how 25 million parameters interact to produce a specific decision on a specific input.

What Day Two validation looks like

Test with controlled perturbations. Intentionally change the things that shouldn't matter — lighting angle, background color, camera position — and verify the model's predictions are stable. If accuracy drops when you change the lighting, the model learned the lighting, not the defect.

Validate on production data, not lab data. Your training set was curated. Production data has variance your lab never replicated. Run your model on two weeks of real production output before trusting its accuracy number.

Rotate your validation set. A static test set tells you how the model performs on that specific data. It tells you nothing about new conditions. Continuously add fresh production samples to your validation pipeline.

Accept the uncertainty. For critical applications, a well-calibrated model that says "I'm 60% confident this is a defect — please have a human check" is more valuable than a model that says "I'm 99% confident" and is wrong about which feature it learned.

The 98% accuracy was real. The model's understanding of why it was accurate was not. Day Two is when the difference between correlation and causation stops being a philosophy-class distinction and becomes a production incident.