T-MARS : Improving Visual Representations
by Circumventing Text Feature Learning

Pratyush Maini^*¹ Sachin Goyal^*¹ Zachary Lipton¹ Zico Kolter^1,2 Aditi Raghunathan¹
¹Carnegie Mellon University ²Bosch Center for AI

arXiv GitHub Summary

TLDR

We propose an algorithm to filter web datasets used for training CLIP in order to learn better visual representations, and achieve state-of-art zeroshot accuracy on vision tasks.

Goal

1. Vision language models like CLIP are trained on web-crawled image caption pairs.
2. We aim to filter these web-datasets for better visual representation learning and improve zero-shot performance.
3. Filtering out bad samples will allow allocating computation resources to useful datapoints.

A Look at the LAION Dataset

1. Our analysis shows an interesting observation: a large fraction of image-caption web datasets (such as LAION) have images that contain text inside them. Often, the text is the only feature correlated with the caption.
2. We aim to remove such images, as they promote model to learn optical character recognition rather than learning better visual features.

Method

1. Text Detection: We perform text detection using off-the-shelf OCR models.
2. Text Masking: We inpaint the pixels where text is detected.
3. Re-scoring and Filtering: Finally, we retain only those images whose corresponding masked images have a high CLIP similarity score with the original caption i.e. have visual features correlated with the caption.

Results

Logarithmic Scaling Trends

On various data pool sizes from 2M to 64M shows that the accuracy gains enjoyed by T-MARS linearly increase as data and compute are scaled exponentially.

State-of-Art on Datacomp

T-MARS outperforms the top-ranked method on the “medium scale” of DataComp (a data filtering benchmark) by a margin of 6.5% on ImageNet and 4.7% on VTAB.

Zero-shot accuracies for various filtering strategies on the small and medium pools of the DataComp benchmark. ∩ denotes the intersection between two filtering strategies. T-MARS outperforms the state-of-art on DataComp by a margin of 5% on the medium scale (ImageNet).

The Marginal Utility of Different Data Types

We investigate the utility of various data types in LAION. Images with text as the only predictive feature hurt the model as much as adding mislabeled examples to the dataset!! Images with both visual & text features are as useful as those with no text & should not be removed!

T-MARS : Improving Visual Representations by Circumventing Text Feature Learning