Class Activation Mapping (CAM) and its extensions have become indispensable tools for visualizing the evidence behind deep network predictions. However, by relying on a final softmax classifier, these methods suffer from two fundamental distortions: additive logit shifts that arbitrarily bias importance scores, and sign collapse that conflates excitatory and inhibitory features.
We propose a simple, architecture-agnostic dual-branch sigmoid head that
decouples localization from classification. Given any pretrained model, we clone
its classification head into a parallel branch ending in perclass sigmoid outputs,
freeze the original softmax head, and fine-tune only the sigmoid branch with
class-balanced binary supervision. At inference, softmax retains recognition accuracy,
while class evidence maps are generated from the sigmoid branch — preserving both magnitude
and sign of feature contributions. Our method integrates seamlessly with most CAM variants
and incurs negligible overhead. Extensive evaluations on fine-grained tasks (CUB-200-2011, Stanford Cars)
and WSOL benchmarks (ImageNet-1K, OpenImages-30K) show improved explanation fidelity and consistent
Top-1 Localization gains — without any drop in classification accuracy.
All CAM variants ultimately form a heatmap by linearly combining feature maps with per-channel weights.
However, when weights are derived from softmax-based scores, softmax’s invariances break these assumptions:
(a) Additive Logit Shift. Adding a constant δ to all feature weights leaves the softmax probability yk
unchanged but disproportionately amplifies feature fi in the heatmap.
(b) Sign Collapse. Subtracting δ flips formerly positive feature weights to negative without affecting yk,
causing previously highlighted regions to vanish.
In both cases, identical classification outputs produce drastically different localization maps.
To disentangle these distortions, we introduce a dual-branch sigmoid head that decouples
localization from classification.
Training. Starting from a pretrained classifier, we copy its head h into a new branch ˜h with
identical architecture, but fresh parameters. The sigmoid branch outputs class-wise scores. Here,
the original softmax head and backbone remain frozen.
Inference. After feature extraction, the frozen softmax head predicts the class label k*. In parallel, any CAM variant computes
per-channel importance scores w̃k* (via weights or gradients) for sk*,
which are rectified by clamping to positive values. These positive-only scores are then linearly combined with
the feature maps to produce the final class evidence heatmap M̃k*.
Fine-grained explanation fidelity on CUB-200-2011 and Stanford Cars. For % Average Drop (lower is better) and % Increase in Confidence (higher is better), improved values are shown in blue and worsened values in red; parentheses indicate the change relative to the baseline.
WSOL results on ImageNet-1K and OpenImages-30K. For each base method we shade the baseline row in gray; “+ Ours” rows report updated scores with their Δ shown in parentheses (blue for gains, red for drops).
Additional qualitative explanation examples on fine-grained datasets: VGG-16 on CUB-200-2011 (top) and ResNet-50 on Stanford Cars (bottom).
Additional qualitative WSOL examples on ImageNet-1K using VGG-16 (top), ResNet-50 (middle), and InceptionV3 (bottom). Predicted bounding boxes are shown in green, and ground-truth boxes in red.
@article{oh:2025:beyondsoftmax,
title = {Beyond Softmax: Dual-Branch Sigmoid Architecture for Accurate Class Activation Maps},
author = {Oh, Yoojin and Noh, Junhyug},
journal = {British Machine Vision Conference (BMVC)},
year = {2025}
}