The realm of artificial intelligence has been buzzing about the triumphs of Vision Transformers (ViTs) for various image-related tasks. However, as it turns out, these models have been caught staring at the wrong things — literally. A recent collaboration between Meta and INRIA researchers has offered an uncomplicated yet effective remedy for this issue.
What’s the Fuss About?
ViTs are notoriously good at dissecting images into attention maps, helping us understand where the model’s focus lies. But strangely enough, these models often fixate on inconsequential areas of the background, neglecting the main subject of the image. Researchers illustrated this phenomenon with attention maps, highlighting that this isn’t just a quirk but a consistent issue, occurring in both supervised models like DeiT and CLIP and self-supervised ones like DINOv2.
The Root of the Problem
Upon closer scrutiny, researchers found that a small proportion of patch tokens — around 2% — showed exceptionally high L2 norms. In simpler terms, these specific tokens were excessively dominant in the calculations, diverting the model’s attention toward irrelevant patches. Such extreme token values can cause the model to overfit, become numerically unstable, or fail to generalize well.
The Recycling Theory
The research team believes that this odd behavior isn’t random. They hypothesize that during the training phase, the models discard low-value patches to focus on capturing broader, more global image features. While this ‘recycling’ strategy might improve efficiency, it comes at a cost: it leads to unpredictable attention maps and issues in dense tasks like image segmentation.
The Ingenious Fix: Registers
To steer ViTs back on track, the research proposes introducing “registers,” or temporary storage tokens, into the sequence. This simple addition acts as a placeholder for global features, negating the need for the model to repurpose other patches. The result? Attention maps that are far more coherent, and surprisingly, even some slight improvements in model performance across various benchmarks.
This eye-opening study illuminates two significant points: ViTs are capable of developing unexpected behaviors, like repurposing patches and simple architectural tweaks can bring about significant improvements. It’s a reminder that the secrets lying within the black boxes of neural networks can offer invaluable insights for fine-tuning their performance. In the ever-evolving field of AI, even small changes can make a big splash.
Hope you enjoyed today’s newsletter
⚡️ Join over 200,000 people using the Superpower ChatGPT extension on Chrome and Firefox.