Artificial Intelligence (AI) is revolutionizing computer vision, transforming it from a basic tool of perception into an dynamic engine of visual understanding. With unprecedented precision in object recognition and contextual awareness, AI is unlocking new dimensions of insight from visual data. This leap is propelling the next wave of innovation across industries—from autonomous vehicles that navigate with human-like intuition, to intelligent factories, diagnostic healthcare systems, and the smart cities of tomorrow.
Computer vision made significant strides in its ambitious journey to emulate human perception during the last couple of decades of the 20th century using advanced image processing filters. Since the early 2010s, however, convolutional neural networks (CNNs) started replacing legacy computer vision algorithms for detecting, classifying, and segmentation. Vision transformers, known as ViTs, are now quickly making an impact by offering even greater accuracy in some use cases.
ViTs have their roots in the popular transformer architecture (ex. BERT) that has revolutionized the field of natural language processing (NLP) . Vision transformers take a novel approach to tackling computer vision tasks by re-using key NLP concepts such as self-attention and positional significance. While ViTs usually work better when pre-trained on large amounts of data, they subsequently can use more efficient fine-tuning methods, to make them feasible in compute constrained edge solutions. Those qualities generally make them appealing for edge developers.
With these advantages, we think ViT models represent a foundational technology shift that will be widely used for a range of video analytics tasks in the years ahead.
What are vision transformers?
ViTs are AI models that employ a transformer architecture to accomplish popular vision analytic tasks such as classification, detection, and segmentation. ViT models split images into a series of patches and then represent these patches by positional embedding before feeding them into a transformer. One could think of this embedding vector as the conceptual equivalent of the manually curated feature vector of traditional computer vision filers. Just as NLP models learn semantic relations between each word in a prompt using self-attention mechanisms, ViTs train by exploring the semantic relationships between image patches.
CNNs are built on the principle of local connectivity and spatial hierarchy. They treat images as structured grids of pixels and use convolutional layers to extract localized features through sliding filters, gradually building up to a global understanding of the image via stacked layers and pooling operations.
Training a convolutional neural network to recognize objects in an image is like teaching a child to identify different types of vehicles. You initially show the child pictures of cars, trucks, bicycles, and motorcycles, explaining the distinguishing features of each. As the child sees more examples and gets more descriptions, they start to recognize these vehicles on their own. Over time, the child becomes adept at identifying vehicles based on their unique shapes, sizes, and other defining characteristics.
Similarly, in training a CNN, you present it with a data set containing images of various objects—cars, animals, household items, etc. The neural network learns to differentiate these objects by recognizing distinctive visual features, just as the child learns to distinguish vehicles based on specific attributes. With continuous exposure and adjustments to its parameters, neural networks become proficient at recognizing and categorizing different objects in images.
In contrast, ViTs discard convolutions entirely. Instead, they treat images as sequences of fixed-size patches—much like words in a sentence—and process them using self-attention mechanisms. This allows ViTs to model long-range dependencies and global context from the very beginning, without relying on spatial locality. The result is a more flexible and scalable architecture that aligns closely with the design of transformer models used in natural language processing.
Visual representation of Transformer Encoder
A key contrast between CNNs and ViTs is the former relies on “local connectivity” and builds them up to global understanding via hierarchical generalizations while the latter utilize self-attention, a “global” approach that takes in information from an entire image at once. This helps ViTs to semantically connect details located “far away” from each other in an image, something CNNs fails to do typically due to its 'local' approach. Imagine an image as a puzzle; each piece holds a fragment of the larger picture. But rather than manually assembling the puzzle piece by piece and row by row, you have a team of experts who specialize in recognizing patterns and similarities among the pieces. Like ViTs, these experts analyze each piece, identify its context and connection to other pieces, and work together to arrange them to form and comprehend the entire image. The fact that ViTs take in all of the pixels at once—as opposed to CNNs that use a sequential sliding window approach—also makes their implementation work better than CNNs with parallel processing. As a result, parallel computer architectures such as graphics processing units (GPUs) generally find ViTs more efficient than CNNs.
A machine putting a puzzle of a highway together
Why ViTs can accurately identify and analyze images
The global approach used by transformers makes these ViT models better at semantically relating distant image details to each other, which tends to make them more accurate than convolutional neural networks when analyzing images. This is especially true in applications where global dependencies and contextual understanding of images are vital.
You can compare training a vision transformer to teaching a new conductor to lead an orchestra. The conductor needs to understand the nuances of each instrument, or visual element, and how they contribute to the symphony as a whole. During rehearsals, or training, the conductor gradually learns how each instrument contributes to the music, just as a vision transformer learns to interpret different image patches and starts to recognize patterns and relationships between them.
As the conductor and orchestra members rehearse together and trade feedback, the conductor gains a greater understanding of how to make adjustments and interpret the music as well as possible. Similarly, the vision transformer refines its attention mechanisms to improve its understanding of images.
Additionally, because the core of ViTs is based on transformer architecture, one can easily apply all the techniques used in transformers to improve ViT models. One such useful technique is parameter efficient fine-tuning (PEFT)—a popular technique used to transfer-learn (aka fine-tune) transformers on new datasets. Using a PEFT technique like low-rank adaptation (LoRA), one can fine-tune a transformer model with a remarkably small training dataset and a less complex training setup.
How ViTs can advance video analytics applications
ViTs bring transformers’ ability to comprehend contextual relationships and understand hidden rules into the vision world. Therefore, ViTs show great potential to revolutionize computer vision, similar to the way transformers revolutionized NLP. As with any technology revolution, those who are early in the game get to ride the revolutionary wave and stand to reap the most benefits from it.
More accuracy, long-term efficiency, and the less complex architecture of transformers compared to present vision technologies make ViTs applicable to many video analytics markets across industries. Developers looking at creating new computer vision solutions, especially those for video analytics tasks across a variety of industries, should weigh those benefits of ViTs versus other computer vision technologies. Especially as transformer models mature, they should be able to handle many of the tasks currently addressed by vision solutions using convolutional neural networks.
With impressive performance in analyzing complex images, ViTs seem ideal for use in “smart city” vision applications, where images often do not have the same visual patterns as those in more contained environments. These AI models also could replace older vision technology and potentially improve insights in environments such as critical infrastructure, factories, energy-related or heavy industrial facilities, and more as ViTs and their capabilities evolve.
On a more personal, everyday level, using ViTs to analyze medical images can lead to more accurate, potentially life-saving diagnoses of cancer or other serious conditions. For the visually impaired and other audiences that need image captioning, which describes the contents of an image, popular vision language models such as Phi-3 Vision (which uses a ViT for vision encoding), can provide that with more precision and speed. The same attributes of precision and speed could help ViTs improve the responsiveness of AI vision-powered uses in retail environments, where accurate object recognition is highly valued.
There are questions that those interested in using ViTs should ask before using them in vision solutions: What additional technical knowledge is required to incorporate ViTs into your solution compared with more familiar convolutional neural networks? What is the availability of supporting libraries and tools to implement ViT models in production, as ViTs are still relatively new? How much model pretraining is required for particular uses? What are the requirements for hardware, memory, and power? Additionally, given the non-vision-based origins of the technology behind ViTs, we’re still understanding how certain model architectural decisions and tuning of hyperparameters affect the overall efficiency and accuracy of ViTs compared to CNNs in vision solutions.
It is also worth noting that with the introduction of YOLO-World like models, it is shown that CNN architectures could also be extended to vision-language modeling that facilitate interaction between visual and linguistic domains. In case of YOLO-World, this allowed the model to do zero-shot open-vocabulary detections. These innovations opens an avenue to use familiar CNN architectures to tap in to similar kind of benefits ViT model offer without needing to adopt transformer architectures.
Learn more about vision transformers
As VITs are still an emerging technology, we’re discovering how exactly they work, and how we can best use their capabilities. At Intel, we have a full portfolio of computer vision technologies even as we help to advance new ones.
Additionally, we can help customers better understand transformer models such as ViTs. We also invite you to dive deeper into the full range of Intel AI solutions and what they can do.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.