In today’s digital age, with technical constraints and legal regulations constantly increasing, the need for contextualizing content to personalize ads while ensuring user privacy has never been so important. Moreover, advertisers strive to maintain a positive and trustworthy association between their brand and the content in which their ads are displayed. Thus, categorizing text, images and video content can provide a better user experience through better contextual ads, while providing better performance and guarantees to advertisers. It stands out how using image categorization of a website’s content to place meaningful adjacent ads can be beneficial for all stakeholders.
The ability to classify and understand images has thus become increasingly valuable across various industries. Whether for content organization, targeted advertising, or customer analysis, image classification plays a crucial role in extracting meaningful insights from visual data.
In this blog post, we will explore the process of fine-tuning a Vision Transformer (ViT) model from the Hugging Face library for image classification into the categories of the Interactive Advertising Bureau (IAB) Tech Lab’s Content Taxonomy, the standard in the ad industry. Vision Transformers have recently demonstrated remarkable success in computer vision, making them an intriguing choice for image classification. By leveraging the power of pre-trained models and aligning the predictions with the IAB taxonomy, we can unlock the potential to accurately categorize images.
Throughout this post, we will delve into the fundamental concepts of ViTs, explaining their architecture and advantages over traditional computer vision techniques. Then, we will introduce the IAB taxonomy categories, highlighting their significance in organizing visual content. Afterwards, we will reach the core of the blog post: the fine-tuning process. Starting with selecting a pre-trained model from the Hugging Face model hub, we will quickly implement data preprocessing, model training, and evaluation leveraging the power of Amazon SageMaker Studio.
Vision Transformers, introduced in 2021 by Dosovitskiy et al., have emerged as a powerful alternative to traditional architectures like convolutional neural networks (CNNs) in computer vision tasks. Transformers, which were initially designed for natural language processing (NLP) tasks, have recently demonstrated remarkable success in image classification tasks.
A pure vision transformer breaks down an image into smaller patches, treating them as tokens (similar to words in NLP). Each patch is then linearly embedded, and position embeddings are added to perform classification by using the approach of adding an extra learnable “classification token” [CLS] to the sequence. The resulting sequence of vectors is fed to a standard transformer encoder and processed through multiple layers of self-attention mechanisms.
Extensive work has been done to compare ViTs to state-of-the-art CNNs on their performance in image classification tasks (Maurício, et al. 2023). Let’s briefly cover more information about the performance of the two architectures and what distinguishes them.
Compared to CNNs, ViTs offer better scalability and generalization capabilities. When pre-trained on large amounts of data and then transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB), ViT attains excellent results compared to state-of-the-art CNNs, while requiring substantially fewer computational resources to train.
Thus, by pre-training ViTs on large-scale datasets such as ImageNet-21k (14 million images, 21,843 classes), we can leverage the learned representations to adapt to specific image classification tasks through fine-tuning, thereby reducing the need for extensive training from scratch, thus decreasing training time and overall costs.
However, CNNs do achieve excellent results when training from scratch on smaller data volumes as those required by Vision Transformers. This different behavior seems to be due to the presence in CNNs of inductive biases, which are exploited by these networks to grasp more rapidly the particularities of the analyzed images even if, on the other hand, they end up making it more complex to understand global relations. On the other hand, the ViTs are free from these biases, which enables them to also capture long-range dependencies in images at the cost of more training data. The self-attention mechanism allows the model to capture global and local dependencies within the image, enabling it to understand the contextual relationships between different regions.
Among some other notable differences between the two architectures, ViTs perform better and are more resilient to images with natural or adverse disturbances compared to CNNs, and ViTs are more robust to adversarial attacks and CNNs are more sensitive to high-frequency features.
Overall, transformer-based architecture – or the combination of ViTs with CNNs – allows for better accuracy compared to CNN networks, performing better compared due to the work of the self-attention mechanism. ViTs architecture is lighter than CNNs, consuming fewer computational resources and taking less training time, and is more robust than CNN networks for images that have noise or are augmented.
On the other hand, CNNs can generalize better with smaller datasets and get better accuracy than ViTs, but in contrast, ViTs have the advantage of learning information better with fewer images when fine tuning. However, it has been noted that ViT performance may struggle with generalization when trained on smaller image datasets.
The Interactive Advertising Bureau (IAB) has developed a comprehensive taxonomy that plays a pivotal role in organizing and categorizing digital content. Their taxonomy provides a standardized framework for content classification, enabling efficient content targeting, contextual advertising, and audience segmentation.
The IAB taxonomy is structured hierarchically, with several levels of categories and subcategories. At the top level, it covers broad content categories such as Arts & Entertainment, Sports, and more.
Each category is then further divided into more specific subcategories, creating a more precise and granular classification system that can capture a wide range of content types. For example, within the Sports category, there are subcategories like Basketball, Soccer and Tennis.
In the digital advertising industry, the IAB taxonomy serves various purposes. It helps advertisers align their campaigns with specific content categories, ensuring that their advertisements are shown in relevant contexts. Content publishers can use the taxonomy to categorize their content effectively, making it easier for users to discover relevant information. Moreover, the IAB taxonomy facilitates audience segmentation by enabling advertisers to target specific categories that align with their target audience’s interests.
In the following section, we will explore how to fine-tune a ViT model to accurately classify images into the specific IAB taxonomy categories.
Fine-tuning a vision transformer model involves adapting a pre-trained model on a large-scale dataset and fine-tuning to smaller downstream tasks, such as image classification into the IAB taxonomy categories. This process allows us to leverage the knowledge and representations learned by the pre-trained model, saving time and computational resources. For this, we remove the pre-trained prediction head and attach a new feedforward layer, with K outputs being the number of downstream classes.
To fine-tune the ViT model we will be using the Hugging Face library. This library offers a wide range of pre-trained vision transformer models and provides an easy-to-use interface for fine-tuning and deploying these models. By utilizing the Hugging Face library, we can take advantage of their model availability, easy integration, strong community support, and the benefits of transfer learning.
Since we wanted to tackle this problem with a supervised training model using the IAB categories as labels, we needed a properly labeled dataset. For this, we collected and labeled an in-house dataset of commercial-use allowed images. As a baseline, images are single labeled with a single Tier 1 IAB category. The dataset consists of image files, organized in folders per category, so that we can easily load it into the appropriate format using the Hugging Face datasets library.
Using SageMaker’s Studio capabilities, we can easily launch a notebook backed with a GPU instance, such as a ml.g4dn.xlarge, to conduct our experiment of training a baseline classification model.
To get started, let’s first install the required libraries from Hugging Face and PyTorch into our notebook’s virtual environment:
Note that the exclamation mark (!) allows us to run shell commands from inside a Jupyter Notebook code cell.
Now, let’s do the necessary imports for the notebook to run properly:
Let’s define some variables:
We leverage the create an image dataset feature to easily load our custom dataset by specifying our local folder where the images are stored in folders by category.
Let’s perform some data cleaning and remove from dataset images which are non-RGB (single-channel, grayscale).
Now we split our dataset into Train, Validation and Test sets.
Finally, we can prepare these images for our model.
When ViT models are trained, specific transformations are applied to images being fed into them so they fit the expected input format.
To make sure we apply the correct transformations, we will use an ‘AutoProcessor’ initialized with a configuration that was saved to Hugging Face’s Hub along with the pretrained “google/vit-base-patch16-224-in21k” model we plan to use.
So, we preprocess our datasets with the model’s image AutoProcessor:
We now save our processed datasets:
Let’s proceed to the training step. We begin by loading the train and validation datasets.
Our new fine-tuned model will have the following number of output classes:
Now, let’s define a ‘ViTForImageClassification’, which places a linear layer (nn.Linear) on top of a pre-trained ViT model. The linear layer is placed on top of the last hidden state of the [CLS] token, which serves as a good representation of an entire image.
We also specify the number of output neurons by setting the `num_labels` parameter.
Let’s proceed by defining a `compute_metrics` function. This function is used to compute any defined target metrics at every evaluation step.
Here, we use the `accuracy` metric from `datasets`, which can easily be used to compare the predictions with the expected labels on the validation set.
We also define a custom metric to compute the accuracy at “k” for our predictions. The accuracy at “k” is the number of instances where the real label is in the set of the “k” most probable classes.
As we would like to know the actual class names as outputs, rather than just integer indexes, we set the ‘id2label’ and ‘label2id’ mapping as attributes to the configuration of the model (which can be accessed as model.config):
Next, we specify the output directories of the model and other artifacts, and the set of hyperparameters we’ll use for training.
We just need to define two more things before we can start training.
First, the ‘TrainingArguments’, which is a class that contains all the attributes to customize the training. There we set the evaluation and checkpoints to be done at the end of each epoch, the output directories and set out hyperparameters (such as the learning rate and batch_sizes).
Then, we create a ‘Trainer’, where we pass the model name, the ‘TrainingArguments’, ‘compute_metrics’ function, datasets and feature extractor.
Start training by calling ‘trainer.train()’:
By inspecting the training metrics, we can see how loss and accuracy stabilize by the end of training.
Figure 3: Loss and accuracy after fine-tuning
Now that we are satisfied with our results we just have to save the trained model.
We finally have our fine-tuned ViT model. Now, we need to verify its performance on the test set. Let’s first reload our model, create a new Trainer instance and then call the `evaluate` method to validate the performance of the model on the test set.
Let’s see our model’s performance in the wild. We’ll get a random image from the web and see how the model predicts the most likely labels for it:
By following these steps and leveraging the capabilities of fine-tuning and transfer learning with ViTs, we demonstrate we can achieve a good baseline image classification into the IAB taxonomy categories. This opens up opportunities for various applications such as content targeting, contextual advertising, content personalization and audience segmentation.
With the availability of pre-trained models and user-friendly libraries like Hugging Face, fine-tuning vision transformers has become more accessible and efficient.
As we conclude, while we are happy with the baseline performance we achieved, it’s essential to consider possible future work and model improvements to improve the model’s performance. Here are some areas to focus on:
Hyperparameter Tuning: Perform systematic hyperparameter search to identify optimal values, including learning rate, batch size, and weight decay.
Model Architecture: Explore different transformer architectures, such as hybrid models combining transformers with CNNs, to further enhance performance.
Data Augmentation: Apply techniques like random cropping, rotation, flipping, and color jittering to artificially expand the training dataset and improve model generalization.
Dataset Improvement: Train the model on larger and more task specific datasets. Training sets may vary depending on the task. For example, it may depend on whether we are categorizing web or video content. Moreover, in any case, collecting additional annotated images, especially for challenging or underrepresented categories, can boost the model’s performance.
Finally, remember to stay tuned to the latest developments in the field, as new techniques and advancements continue to shape the world of computer vision and artificial intelligence.
Thank you for reading, and if you have any questions or feedback, please feel free to reach out to us. Happy fine-tuning and image classification with ViTs and Hugging Face!
[2010.11929] An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale
An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale | Papers With Code
[2108.08810] Do Vision Transformers See Like Convolutional Neural Networks?
In today’s digital age, with technical constraints and legal regulations constantly increasing, the need for contextualizing content to personalize ads while ensuring user privacy has never been so important. Moreover, advertisers strive to maintain a positive and trustworthy association between their brand and the content in which their ads are displayed. Thus, categorizing text, images and video content can provide a better user experience through better contextual ads, while providing better performance and guarantees to advertisers. It stands out how using image categorization of a website’s content to place meaningful adjacent ads can be beneficial for all stakeholders.
The ability to classify and understand images has thus become increasingly valuable across various industries. Whether for content organization, targeted advertising, or customer analysis, image classification plays a crucial role in extracting meaningful insights from visual data.
In this blog post, we will explore the process of fine-tuning a Vision Transformer (ViT) model from the Hugging Face library for image classification into the categories of the Interactive Advertising Bureau (IAB) Tech Lab’s Content Taxonomy, the standard in the ad industry. Vision Transformers have recently demonstrated remarkable success in computer vision, making them an intriguing choice for image classification. By leveraging the power of pre-trained models and aligning the predictions with the IAB taxonomy, we can unlock the potential to accurately categorize images.
Throughout this post, we will delve into the fundamental concepts of ViTs, explaining their architecture and advantages over traditional computer vision techniques. Then, we will introduce the IAB taxonomy categories, highlighting their significance in organizing visual content. Afterwards, we will reach the core of the blog post: the fine-tuning process. Starting with selecting a pre-trained model from the Hugging Face model hub, we will quickly implement data preprocessing, model training, and evaluation leveraging the power of Amazon SageMaker Studio.
Vision Transformers, introduced in 2021 by Dosovitskiy et al., have emerged as a powerful alternative to traditional architectures like convolutional neural networks (CNNs) in computer vision tasks. Transformers, which were initially designed for natural language processing (NLP) tasks, have recently demonstrated remarkable success in image classification tasks.
A pure vision transformer breaks down an image into smaller patches, treating them as tokens (similar to words in NLP). Each patch is then linearly embedded, and position embeddings are added to perform classification by using the approach of adding an extra learnable “classification token” [CLS] to the sequence. The resulting sequence of vectors is fed to a standard transformer encoder and processed through multiple layers of self-attention mechanisms.
Extensive work has been done to compare ViTs to state-of-the-art CNNs on their performance in image classification tasks (Maurício, et al. 2023). Let’s briefly cover more information about the performance of the two architectures and what distinguishes them.
Compared to CNNs, ViTs offer better scalability and generalization capabilities. When pre-trained on large amounts of data and then transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB), ViT attains excellent results compared to state-of-the-art CNNs, while requiring substantially fewer computational resources to train.
Thus, by pre-training ViTs on large-scale datasets such as ImageNet-21k (14 million images, 21,843 classes), we can leverage the learned representations to adapt to specific image classification tasks through fine-tuning, thereby reducing the need for extensive training from scratch, thus decreasing training time and overall costs.
However, CNNs do achieve excellent results when training from scratch on smaller data volumes as those required by Vision Transformers. This different behavior seems to be due to the presence in CNNs of inductive biases, which are exploited by these networks to grasp more rapidly the particularities of the analyzed images even if, on the other hand, they end up making it more complex to understand global relations. On the other hand, the ViTs are free from these biases, which enables them to also capture long-range dependencies in images at the cost of more training data. The self-attention mechanism allows the model to capture global and local dependencies within the image, enabling it to understand the contextual relationships between different regions.
Among some other notable differences between the two architectures, ViTs perform better and are more resilient to images with natural or adverse disturbances compared to CNNs, and ViTs are more robust to adversarial attacks and CNNs are more sensitive to high-frequency features.
Overall, transformer-based architecture – or the combination of ViTs with CNNs – allows for better accuracy compared to CNN networks, performing better compared due to the work of the self-attention mechanism. ViTs architecture is lighter than CNNs, consuming fewer computational resources and taking less training time, and is more robust than CNN networks for images that have noise or are augmented.
On the other hand, CNNs can generalize better with smaller datasets and get better accuracy than ViTs, but in contrast, ViTs have the advantage of learning information better with fewer images when fine tuning. However, it has been noted that ViT performance may struggle with generalization when trained on smaller image datasets.
The Interactive Advertising Bureau (IAB) has developed a comprehensive taxonomy that plays a pivotal role in organizing and categorizing digital content. Their taxonomy provides a standardized framework for content classification, enabling efficient content targeting, contextual advertising, and audience segmentation.
The IAB taxonomy is structured hierarchically, with several levels of categories and subcategories. At the top level, it covers broad content categories such as Arts & Entertainment, Sports, and more.
Each category is then further divided into more specific subcategories, creating a more precise and granular classification system that can capture a wide range of content types. For example, within the Sports category, there are subcategories like Basketball, Soccer and Tennis.
In the digital advertising industry, the IAB taxonomy serves various purposes. It helps advertisers align their campaigns with specific content categories, ensuring that their advertisements are shown in relevant contexts. Content publishers can use the taxonomy to categorize their content effectively, making it easier for users to discover relevant information. Moreover, the IAB taxonomy facilitates audience segmentation by enabling advertisers to target specific categories that align with their target audience’s interests.
In the following section, we will explore how to fine-tune a ViT model to accurately classify images into the specific IAB taxonomy categories.
Fine-tuning a vision transformer model involves adapting a pre-trained model on a large-scale dataset and fine-tuning to smaller downstream tasks, such as image classification into the IAB taxonomy categories. This process allows us to leverage the knowledge and representations learned by the pre-trained model, saving time and computational resources. For this, we remove the pre-trained prediction head and attach a new feedforward layer, with K outputs being the number of downstream classes.
To fine-tune the ViT model we will be using the Hugging Face library. This library offers a wide range of pre-trained vision transformer models and provides an easy-to-use interface for fine-tuning and deploying these models. By utilizing the Hugging Face library, we can take advantage of their model availability, easy integration, strong community support, and the benefits of transfer learning.
Since we wanted to tackle this problem with a supervised training model using the IAB categories as labels, we needed a properly labeled dataset. For this, we collected and labeled an in-house dataset of commercial-use allowed images. As a baseline, images are single labeled with a single Tier 1 IAB category. The dataset consists of image files, organized in folders per category, so that we can easily load it into the appropriate format using the Hugging Face datasets library.
Using SageMaker’s Studio capabilities, we can easily launch a notebook backed with a GPU instance, such as a ml.g4dn.xlarge, to conduct our experiment of training a baseline classification model.
To get started, let’s first install the required libraries from Hugging Face and PyTorch into our notebook’s virtual environment:
Note that the exclamation mark (!) allows us to run shell commands from inside a Jupyter Notebook code cell.
Now, let’s do the necessary imports for the notebook to run properly:
Let’s define some variables:
We leverage the create an image dataset feature to easily load our custom dataset by specifying our local folder where the images are stored in folders by category.
Let’s perform some data cleaning and remove from dataset images which are non-RGB (single-channel, grayscale).
Now we split our dataset into Train, Validation and Test sets.
Finally, we can prepare these images for our model.
When ViT models are trained, specific transformations are applied to images being fed into them so they fit the expected input format.
To make sure we apply the correct transformations, we will use an ‘AutoProcessor’ initialized with a configuration that was saved to Hugging Face’s Hub along with the pretrained “google/vit-base-patch16-224-in21k” model we plan to use.
So, we preprocess our datasets with the model’s image AutoProcessor:
We now save our processed datasets:
Let’s proceed to the training step. We begin by loading the train and validation datasets.
Our new fine-tuned model will have the following number of output classes:
Now, let’s define a ‘ViTForImageClassification’, which places a linear layer (nn.Linear) on top of a pre-trained ViT model. The linear layer is placed on top of the last hidden state of the [CLS] token, which serves as a good representation of an entire image.
We also specify the number of output neurons by setting the `num_labels` parameter.
Let’s proceed by defining a `compute_metrics` function. This function is used to compute any defined target metrics at every evaluation step.
Here, we use the `accuracy` metric from `datasets`, which can easily be used to compare the predictions with the expected labels on the validation set.
We also define a custom metric to compute the accuracy at “k” for our predictions. The accuracy at “k” is the number of instances where the real label is in the set of the “k” most probable classes.
As we would like to know the actual class names as outputs, rather than just integer indexes, we set the ‘id2label’ and ‘label2id’ mapping as attributes to the configuration of the model (which can be accessed as model.config):
Next, we specify the output directories of the model and other artifacts, and the set of hyperparameters we’ll use for training.
We just need to define two more things before we can start training.
First, the ‘TrainingArguments’, which is a class that contains all the attributes to customize the training. There we set the evaluation and checkpoints to be done at the end of each epoch, the output directories and set out hyperparameters (such as the learning rate and batch_sizes).
Then, we create a ‘Trainer’, where we pass the model name, the ‘TrainingArguments’, ‘compute_metrics’ function, datasets and feature extractor.
Start training by calling ‘trainer.train()’:
By inspecting the training metrics, we can see how loss and accuracy stabilize by the end of training.
Figure 3: Loss and accuracy after fine-tuning
Now that we are satisfied with our results we just have to save the trained model.
We finally have our fine-tuned ViT model. Now, we need to verify its performance on the test set. Let’s first reload our model, create a new Trainer instance and then call the `evaluate` method to validate the performance of the model on the test set.
Let’s see our model’s performance in the wild. We’ll get a random image from the web and see how the model predicts the most likely labels for it:
By following these steps and leveraging the capabilities of fine-tuning and transfer learning with ViTs, we demonstrate we can achieve a good baseline image classification into the IAB taxonomy categories. This opens up opportunities for various applications such as content targeting, contextual advertising, content personalization and audience segmentation.
With the availability of pre-trained models and user-friendly libraries like Hugging Face, fine-tuning vision transformers has become more accessible and efficient.
As we conclude, while we are happy with the baseline performance we achieved, it’s essential to consider possible future work and model improvements to improve the model’s performance. Here are some areas to focus on:
Hyperparameter Tuning: Perform systematic hyperparameter search to identify optimal values, including learning rate, batch size, and weight decay.
Model Architecture: Explore different transformer architectures, such as hybrid models combining transformers with CNNs, to further enhance performance.
Data Augmentation: Apply techniques like random cropping, rotation, flipping, and color jittering to artificially expand the training dataset and improve model generalization.
Dataset Improvement: Train the model on larger and more task specific datasets. Training sets may vary depending on the task. For example, it may depend on whether we are categorizing web or video content. Moreover, in any case, collecting additional annotated images, especially for challenging or underrepresented categories, can boost the model’s performance.
Finally, remember to stay tuned to the latest developments in the field, as new techniques and advancements continue to shape the world of computer vision and artificial intelligence.
Thank you for reading, and if you have any questions or feedback, please feel free to reach out to us. Happy fine-tuning and image classification with ViTs and Hugging Face!
[2010.11929] An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale
An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale | Papers With Code
[2108.08810] Do Vision Transformers See Like Convolutional Neural Networks?