Vision Foundation Models: When Does Size Matter?

Large vision models may seem attractive, but domain-specific models can get you farther.

It may seem like AI is at its peak hype cycle, but some application areas are just getting started. Large language models (LLMs) stole our attention a little over a year ago, but the enabling technology has been incubating for years. Now, the lessons we’ve learned from LLMs are trickling into other areas, leaving them well-poised for their own advancements.

Computer vision is one such area. Just as foundation models like GPT set the stage for chatbots and various other language applications, image-based foundation models are enabling a revolution in advanced image analysis, from personalized medicine to precision agriculture to industrial automation.

While early LLMs had fewer than 1 billion parameters, today’s GPT, Bard, and LLaMa now have more than one trillion parameters. The largest computer vision models like DINO and Segment Anything top out around 1 billion parameters. They’re not yet as large as LLMs but are heading in that direction.

Training such a large model requires an enormous amount of training data. For example, DINOv2 was trained with 142 million images. Using the advancements of self-supervised learning, the training data does not even need to be labeled. Massive amounts of unlabeled data are all that is needed to learn patterns.

For general-purpose applications, large training sets and large models are paving the way for new utility. They can be easily adapted for classification, detection, or segmentation tasks on many different types of imagery.

In many ways, bigger is better.

The Problem with Large Models

The problem comes when you take a massive general-purpose model and apply it to data that looks different. Images that contain different patterns. Instead of faces, buildings, and street signs, perhaps it’s roads and trees viewed from a drone or satellite above. Or it could be cells and glands imaged through a microscope. Or parts on a manufacturing line.

To apply an existing foundation model to one of these examples, you need to finetune it for a particular task. Perhaps distinguishing tumor from benign tissue. Given a few thousand examples of each class, the weights of the large foundation model can be adjusted – a significantly smaller amount of data than would be required to learn this task from scratch. This process of adapting the model is called finetuning.

When you finetune a general-purpose vision model on related imagery, it converges quickly to a good model on your downstream task. But on different imagery, your model is much more likely to overfit. This means that it will perform well on your training set but make mistakes on unseen images.

This is because the large foundation model looks for many different patterns in the images. And some of them may happen to be related to the downstream task on the small training set. But the same patterns do enable correct predictions on unseen data. They are just spurious correlations.

This is much more likely to happen with a large model trained on disparate imagery.

Small Vision Models to the Rescue

How do you solve this? You need to build a model that learns the patterns in your unique imagery. The patterns that are meaningful for downstream tasks on that same modality of images.

You likely don’t have a massive number of images available, so you can’t build a large vision model. But you can build a perfectly good small- or medium-sized vision model.

This domain-specific foundation model will be suitable for various downstream tasks on your imagery with just a little finetuning. It won’t be very helpful for other types of images – but you don’t need it to be.

Size does matter. But bigger isn’t necessarily better. For niche applications, adapt your model to the data you have available. A smarter focused model will get you much farther than a large clunky one that looks at the wrong patterns.

Does your organization have an image dataset from a unique modality? Start your journey to a domain-specific foundation model with a Foundation Model Assessment. Get a clear perspective on the ROI for your proprietary image data before wasting months of experimentation on the wrong path.