Fine-Tuning a Vision Transformer

Eduardo Simon
.
June 1, 2023
Fine-Tuning a Vision Transformer

Fine-Tuning a Vision Transformer Model for Image Classification into IAB Taxonomy Categories

In today’s digital age, with technical constraints and legal regulations constantly increasing, the need for contextualizing content to personalize ads while ensuring user privacy has never been so important. Moreover, advertisers strive to maintain a positive and trustworthy association between their brand and the content in which their ads are displayed. Thus, categorizing text, images and video content can provide a better user experience through better contextual ads, while providing better performance and guarantees to advertisers. It stands out how using image categorization of a website’s content to place meaningful adjacent ads can be beneficial for all stakeholders.

The ability to classify and understand images has thus become increasingly valuable across various industries. Whether for content organization, targeted advertising, or customer analysis, image classification plays a crucial role in extracting meaningful insights from visual data.

In this blog post, we will explore the process of fine-tuning a Vision Transformer (ViT) model from the Hugging Face library for image classification into the categories of the Interactive Advertising Bureau (IAB) Tech Lab’s Content Taxonomy, the standard in the ad industry. Vision Transformers have recently demonstrated remarkable success in computer vision, making them an intriguing choice for image classification. By leveraging the power of pre-trained models and aligning the predictions with the IAB taxonomy, we can unlock the potential to accurately categorize images.

Throughout this post, we will delve into the fundamental concepts of ViTs, explaining their architecture and advantages over traditional computer vision techniques. Then, we will introduce the IAB taxonomy categories, highlighting their significance in organizing visual content. Afterwards, we will reach the core of the blog post: the fine-tuning process. Starting with selecting a pre-trained model from the Hugging Face model hub, we will quickly implement data preprocessing, model training, and evaluation leveraging the power of Amazon SageMaker Studio.

Understanding Vision Transformers

Vision Transformers, introduced in 2021 by Dosovitskiy et al., have emerged as a powerful alternative to traditional architectures like convolutional neural networks (CNNs) in computer vision tasks. Transformers, which were initially designed for natural language processing (NLP) tasks, have recently demonstrated remarkable success in image classification tasks.

A pure vision transformer breaks down an image into smaller patches, treating them as tokens (similar to words in NLP). Each patch is then linearly embedded, and position embeddings are added to perform classification by using the approach of adding an extra learnable “classification token” [CLS] to the sequence. The resulting sequence of vectors is fed to a standard transformer encoder and processed through multiple layers of self-attention mechanisms.

Figure 1: Model overview (Dosovitskiy, et al. 2021)

Extensive work has been done to compare ViTs to state-of-the-art CNNs on their performance in image classification tasks (Maurício, et al. 2023). Let’s briefly cover more information about the performance of the two architectures and what distinguishes them.

Compared to CNNs, ViTs offer better scalability and generalization capabilities. When pre-trained on large amounts of data and then transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB), ViT attains excellent results compared to state-of-the-art CNNs, while requiring substantially fewer computational resources to train.

Thus, by pre-training ViTs on large-scale datasets such as ImageNet-21k (14 million images, 21,843 classes), we can leverage the learned representations to adapt to specific image classification tasks through fine-tuning, thereby reducing the need for extensive training from scratch, thus decreasing training time and overall costs.

However, CNNs do achieve excellent results when training from scratch on smaller data volumes as those required by Vision Transformers. This different behavior seems to be due to the presence in CNNs of inductive biases, which are exploited by these networks to grasp more rapidly the particularities of the analyzed images even if, on the other hand, they end up making it more complex to understand global relations. On the other hand, the ViTs are free from these biases, which enables them to also capture long-range dependencies in images at the cost of more training data. The self-attention mechanism allows the model to capture global and local dependencies within the image, enabling it to understand the contextual relationships between different regions.

Among some other notable differences between the two architectures, ViTs perform better and are more resilient to images with natural or adverse disturbances compared to CNNs, and ViTs are more robust to adversarial attacks and CNNs are more sensitive to high-frequency features.

Overall, transformer-based architecture – or the combination of ViTs with CNNs – allows for better accuracy compared to CNN networks, performing better compared due to the work of the  self-attention mechanism. ViTs architecture is lighter than CNNs, consuming fewer computational resources and taking less training time, and is more robust than CNN networks for images that have noise or are augmented.

On the other hand, CNNs can generalize better with smaller datasets and get better accuracy than ViTs, but in contrast, ViTs have the advantage of learning information better with fewer images when fine tuning. However, it has been noted that ViT performance may struggle with generalization when trained on smaller image datasets.

Understanding the IAB Content Taxonomy

The Interactive Advertising Bureau (IAB) has developed a comprehensive taxonomy that plays a pivotal role in organizing and categorizing digital content. Their taxonomy provides a standardized framework for content classification, enabling efficient content targeting, contextual advertising, and audience segmentation.

The IAB taxonomy is structured hierarchically, with several levels of categories and subcategories. At the top level, it covers broad content categories such as Arts & Entertainment, Sports, and more.

Each category is then further divided into more specific subcategories, creating a more precise and granular classification system that can capture a wide range of content types. For example, within the Sports category, there are subcategories like Basketball, Soccer and Tennis.

In the digital advertising industry, the IAB taxonomy serves various purposes. It helps advertisers align their campaigns with specific content categories, ensuring that their advertisements are shown in relevant contexts. Content publishers can use the taxonomy to categorize their content effectively, making it easier for users to discover relevant information. Moreover, the IAB taxonomy facilitates audience segmentation by enabling advertisers to target specific categories that align with their target audience’s interests.

In the following section, we will explore how to fine-tune a ViT model to accurately classify images into the specific IAB taxonomy categories.

Fine – Tuning a Vision Transformer Model

Fine-tuning a vision transformer model involves adapting a pre-trained model on a large-scale dataset and fine-tuning to smaller downstream tasks, such as image classification into the IAB taxonomy categories. This process allows us to leverage the knowledge and representations learned by the pre-trained model, saving time and computational resources. For this, we remove the pre-trained prediction head and attach a new feedforward layer, with K outputs being the number of downstream classes.

To fine-tune the ViT model we will be using the Hugging Face library. This library offers a wide range of pre-trained vision transformer models and provides an easy-to-use interface for fine-tuning and deploying these models. By utilizing the Hugging Face library, we can take advantage of their model availability, easy integration, strong community support, and the benefits of transfer learning.

Since we wanted to tackle this problem with a supervised training model using the IAB categories as labels, we needed a properly labeled dataset. For this, we collected and labeled an in-house dataset of commercial-use allowed images. As a baseline, images are single labeled with a single Tier 1 IAB category. The dataset consists of image files, organized in folders per category, so that we can easily load it into the appropriate format using the Hugging Face datasets library.

Using SageMaker’s Studio capabilities, we can easily launch a notebook backed with a GPU instance, such as a ml.g4dn.xlarge, to conduct our experiment of training a baseline classification model.

To get started, let’s first install the required libraries from Hugging Face and PyTorch into our notebook’s virtual environment:

no-line-numbers !pip install transformers==4.26.1 datasets==2.10.1 evaluate==0.4.0 -q
no-line-numbers !pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 -q

Note that the exclamation mark (!) allows us to run shell commands from inside a Jupyter Notebook code cell.

Now, let’s do the necessary imports for the notebook to run properly:

no-line-numbers import evaluate import json import numpy as np import os import pandas as pd import pyarrow as pa import requests import torch from datasets import load_dataset, load_from_disk, Dataset, Features, Array3D from io import BytesIO from transformers import AutoProcessor, ViTFeatureExtractor, ViTForImageClassification, Trainer, TrainingArguments, default_data_collator from typing import Tuple from PIL import Image

Let’s define some variables:

no-line-numbers # The directory where our images are saved in folders by category images_dir = "./dataset" # The output directory of the processed datasets train_save_path = "./processed-datasets/train" val_save_path = "./processed-datasets/val" test_save_path = "./processed-datasets/test" # Sizes of dataset splits val_size = 0.2 test_size = 0.1 # Name of model as named in the HuggingFace Hub model_name = "google/vit-base-patch16-224-in21k"

We leverage the create an image dataset feature to easily load our custom dataset by specifying our local folder where the images are stored in folders by category.

no-line-numbers dataset = load_dataset("imagefolder", data_dir=images_dir, split='train')

Let’s perform some data cleaning and remove from dataset images which are non-RGB (single-channel, grayscale).

no-line-numbers # Remove from dataset images which are non-RGB (single-channel, grayscale) condition = lambda data: data['image'].mode == 'RGB' dataset = dataset.filter(condition)

Now we split our dataset into Train, Validation and Test sets.

no-line-numbers def split_dataset( dataset: Dataset, val_size: float=0.2, test_size: float=0.1 ) -> Tuple[Dataset, Dataset, Dataset]: """ Returns a tuple with three random train, validation and test subsets by splitting the passed dataset. Size of the validation and test sets defined as a fraction of 1 with the `val_size` and `test_size` arguments. """ print("Splitting dataset into train, validation and test sets...") # Split dataset into train and (val + test) sets split_size = round(val_size + test_size, 3) dataset = dataset.train_test_split(shuffle=True, test_size=split_size) # Split (val + test) into val and test sets split_ratio = round(test_size / (test_size + val_size), 3) val_test_sets = dataset['test'].train_test_split(shuffle=True, test_size=split_ratio) train_dataset = dataset["train"] val_dataset = val_test_sets["train"] test_dataset = val_test_sets["test"] return train_dataset, val_dataset, test_dataset # Split dataset into train and test sets train_dataset, val_dataset, test_dataset = split_dataset(dataset, val_size, test_size)

Finally, we can prepare these images for our model.

When ViT models are trained, specific transformations are applied to images being fed into them so they fit the expected input format.

To make sure we apply the correct transformations, we will use an ‘AutoProcessor’ initialized with a configuration that was saved to Hugging Face’s Hub along with the pretrained “google/vit-base-patch16-224-in21k” model we plan to use.

So, we preprocess our datasets with the model’s image AutoProcessor:

no-line-numbers def process_examples(examples, image_processor): """Processor helper function. Used to process batches of images using the passed image_processor. Parameters ---------- examples A batch of image examples. image_processor A HuggingFace image processor for the selected model. Returns ------- examples A batch of processed image examples. """ # Get batch of images images = examples['image'] # Preprocess inputs = image_processor(images=images) # Add pixel_values examples['pixel_values'] = inputs['pixel_values'] return examples def apply_processing( model_name: str, train_dataset: Dataset, val_dataset: Dataset, test_dataset: Dataset ) -> Tuple[Dataset, Dataset, Dataset]: """ Apply model's image AutoProcessor to transform train, validation and test subsets. Returns train, validation and test datasets with `pixel_values` in torch tensor type. """ # Extend the features features = Features({ **train_dataset.features, 'pixel_values': Array3D(dtype="float32", shape=(3, 224, 224)), }) # Instantiate image_processor image_processor = AutoProcessor.from_pretrained(model_name) # Preprocess images train_dataset = train_dataset.map(process_examples, batched=True, features=features, fn_kwargs={"image_processor": image_processor}) val_dataset = val_dataset.map(process_examples, batched=True, features=features, fn_kwargs={"image_processor": image_processor}) test_dataset = test_dataset.map(process_examples, batched=True, features=features, fn_kwargs={"image_processor": image_processor}) # Set to torch format for training train_dataset.set_format('torch', columns=['pixel_values', 'label']) val_dataset.set_format('torch', columns=['pixel_values', 'label']) test_dataset.set_format('torch', columns=['pixel_values', 'label']) # Remove unused column train_dataset = train_dataset.remove_columns("image") val_dataset = val_dataset.remove_columns("image") test_dataset = test_dataset.remove_columns("image") return train_dataset, val_dataset, test_dataset # Apply AutoProcessor train_dataset, val_dataset, test_dataset = apply_processing(model_name, train_dataset, val_dataset, test_dataset)


We now save our processed datasets:

no-line-numbers # Save train, validation and test preprocessed datasets train_dataset.save_to_disk(train_save_path, num_shards=1) val_dataset.save_to_disk(val_save_path, num_shards=1) test_dataset.save_to_disk(test_save_path, num_shards=1)


Let’s proceed to the training step. We begin by loading the train and validation datasets.

no-line-numbers train_dataset = load_from_disk(train_save_path) val_dataset = load_from_disk(val_save_path)


Our new fine-tuned model will have the following number of output classes:

no-line-numbers num_classes = train_dataset.features["label"].num_classes

Now, let’s define a ‘ViTForImageClassification’, which places a linear layer (nn.Linear) on top of a pre-trained ViT model. The linear layer is placed on top of the last hidden state of the [CLS] token, which serves as a good representation of an entire image.

We also specify the number of output neurons by setting the `num_labels` parameter.

no-line-numbers # Download model from model hub model = ViTForImageClassification.from_pretrained(model_name, num_labels=num_classes) # Download feature extractor from hub feature_extractor = ViTFeatureExtractor.from_pretrained(model_name)

Let’s proceed by defining a `compute_metrics` function. This function is used to compute any defined target metrics at every evaluation step.

Here, we use the `accuracy` metric from `datasets`, which can easily be used to compare the predictions with the expected labels on the validation set.

We also define a custom metric to compute the accuracy at “k” for our predictions. The accuracy at “k” is the number of instances where the real label is in the set of the “k” most probable classes.

no-line-numbers # K for top accuracy metric k_for_top_acc = 3 # Compute metrics function for binary classification acc_metric = evaluate.load("accuracy", module_type="metric") def compute_metrics(eval_pred): predicted_probs, labels = eval_pred # Accuracy predicted_labels = np.argmax(predicted_probs, axis=1) acc = acc_metric.compute(predictions=predicted_labels, references=labels) # Top-K Accuracy top_k_indexes = [np.argpartition(row, -k_for_top_acc)[-k_for_top_acc:] for row in predicted_probs] top_k_classes = [top_k_indexes[i][np.argsort(row[top_k_indexes[i]])] for i, row in enumerate(predicted_probs)] top_k_classes = np.flip(np.array(top_k_classes), 1) acc_k = { f"accuracy_k" : sum([label in predictions for predictions, label in zip(top_k_classes, labels)]) / len(labels) } # Merge metrics acc.update(acc_k) return acc

As we would like to know the actual class names as outputs, rather than just integer indexes, we set the ‘id2label’ and ‘label2id’ mapping as attributes to the configuration of the model (which can be accessed as model.config):

no-line-numbers # Change labels id2label = {key:train_dataset.features["label"].names[index] for index,key in enumerate(model.config.id2label.keys())} label2id = {train_dataset.features["label"].names[index]:value for index,value in enumerate(model.config.label2id.values())} model.config.id2label = id2label model.config.label2id = label2id

Next, we specify the output directories of the model and other artifacts, and the set of hyperparameters we’ll use for training.

no-line-numbers model_dir = "./model" output_data_dir = "./outputs" # Total number of training epochs to perform num_train_epochs = 15 # The batch size per GPU/TPU core/CPU for training per_device_train_batch_size = 32 # The batch size per GPU/TPU core/CPU for evaluation per_device_eval_batch_size = 64 # The initial learning rate for AdamW optimizer learning_rate = 2e-5 # Number of steps used for a linear warmup from 0 to learning_rate warmup_steps = 500 # The weight decay to apply to all layers except all bias and LayerNorm weights in AdamW optimizer weight_decay = 0.01 main_metric_for_evaluation = "accuracy"

We just need to define two more things before we can start training.

First, the ‘TrainingArguments’, which is a class that contains all the attributes to customize the training. There we set the evaluation and checkpoints to be done at the end of each epoch, the output directories and set out hyperparameters (such as the learning rate and batch_sizes).

Then, we create a ‘Trainer’, where we pass the model name, the ‘TrainingArguments’, ‘compute_metrics’ function, datasets and feature extractor.

no-line-numbers # Define training args training_args = TrainingArguments( output_dir = model_dir, num_train_epochs = num_train_epochs, per_device_train_batch_size = per_device_train_batch_size, per_device_eval_batch_size = per_device_eval_batch_size, warmup_steps = warmup_steps, weight_decay = weight_decay, evaluation_strategy = "epoch", save_strategy = "epoch", logging_strategy = "epoch", logging_dir = f"{output_data_dir}/logs", learning_rate = float(learning_rate), load_best_model_at_end = True, metric_for_best_model = main_metric_for_evaluation, ) # Create Trainer instance trainer = Trainer( model = model, args = training_args, compute_metrics = compute_metrics, train_dataset = train_dataset, eval_dataset = val_dataset, data_collator = default_data_collator, tokenizer = feature_extractor )

Start training by calling ‘trainer.train()’:

no-line-numbers trainer.train()

By inspecting the training metrics, we can see how loss and accuracy stabilize by the end of training.

no-line-numbers log_history = pd.DataFrame(trainer.state.log_history) log_history = log_history.fillna(0) log_history = log_history.groupby(['epoch']).sum() log_history

no-line-numbers log_history[["loss", "eval_loss", "eval_accuracy", "eval_accuracy_k"]].plot(subplots=True)I I

Figure 3: Loss and accuracy after fine-tuning

Now that we are satisfied with our results we just have to save the trained model.

no-line-numbers trainer.save_model(model_dir)

We finally have our fine-tuned ViT model. Now, we need to verify its performance on the test set. Let’s first reload our model, create a new Trainer instance and then call the `evaluate` method to validate the performance of the model on the test set.

no-line-numbers # Load dataset test_dataset = load_from_disk(test_save_path) # Load trained model model = ViTForImageClassification.from_pretrained('./model') # Load feature extractor feature_extractor = ViTFeatureExtractor.from_pretrained('./model') # Create Trainer instance trainer = Trainer( model=model, compute_metrics=compute_metrics, data_collator=default_data_collator, tokenizer=feature_extractor ) # Evaluate model eval_results = trainer.evaluate(eval_dataset=test_dataset) # Writes eval_result to file which can be accessed later with open(os.path.join(output_data_dir, "eval_results.json"), "w") as writer: print(f"Logging evaluation results at {output_data_dir}/eval_results.json") writer.write(json.dumps(eval_results)) print(json.dumps(eval_results, indent=4))

no-line-numbers { "eval_loss": 2.441051483154297, "eval_accuracy": 0.48269581056466304, "eval_accuracy_k": 0.7103825136612022, "eval_runtime": 1.9207, "eval_samples_per_second": 285.83, "eval_steps_per_second": 35.924 }

Let’s see our model’s performance in the wild. We’ll get a random image from the web and see how the model predicts the most likely labels for it:

no-line-numbers # Get test image from the web test_image_url = 'https://media.cnn.com/api/v1/images/stellar/prod/111024080409-steve-jobs-book.jpg' response = requests.get(test_image_url) test_image = Image.open(BytesIO(response.content)) # Resize and display the image aspect_ratio = test_image.size[0] / test_image.size[1] max_height = 250 resized_width = int(max_height * aspect_ratio) resized_img = test_image.resize((resized_width, max_height)) display(resized_img)

no-line-numbers # Predict the top k classes for the test image inputs = feature_extractor(images=test_image, return_tensors="pt").to("cuda") outputs = model(**inputs) logits = outputs.logits top_classes = torch.topk(outputs.logits, k_for_top_acc).indices.flatten().tolist() for i, class_idx in enumerate(top_classes): print(str(i + 1), "- Predicted class:", model.config.id2label[class_idx])

Results

no-line-numbers 1 - Predicted class: Education 2 - Predicted class: Books and Literature 3 - Predicted class: Productivity

Conclusion & Future Work

By following these steps and leveraging the capabilities of fine-tuning and transfer learning with ViTs, we demonstrate we can achieve a good baseline image classification into the IAB taxonomy categories. This opens up opportunities for various applications such as content targeting, contextual advertising, content personalization and audience segmentation.

With the availability of pre-trained models and user-friendly libraries like Hugging Face, fine-tuning vision transformers has become more accessible and efficient.
As we conclude, while we are happy with the baseline performance we achieved, it’s essential to consider possible future work and model improvements to improve the model’s performance. Here are some areas to focus on:

Hyperparameter Tuning: Perform systematic hyperparameter search to identify optimal values, including learning rate, batch size, and weight decay.

Model Architecture: Explore different transformer architectures, such as hybrid models combining transformers with CNNs, to further enhance performance.

Data Augmentation: Apply techniques like random cropping, rotation, flipping, and color jittering to artificially expand the training dataset and improve model generalization.

Dataset Improvement: Train the model on larger and more task specific datasets. Training sets may vary depending on the task. For example, it may depend on whether we are categorizing web or video content. Moreover, in any case, collecting additional annotated images, especially for challenging or underrepresented categories, can boost the model’s performance.

Finally, remember to stay tuned to the latest developments in the field, as new techniques and advancements continue to shape the world of computer vision and artificial intelligence.

Thank you for reading, and if you have any questions or feedback, please feel free to reach out to us. Happy fine-tuning and image classification with ViTs and Hugging Face!

Sources

[2010.11929] An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale

An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale | Papers With Code

[2108.08810] Do Vision Transformers See Like Convolutional Neural Networks?

Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review

Stay ahead of the curve on the latest trends and insights in big data, machine learning and artificial intelligence. Don't miss out and subscribe to our newsletter!

Download your e-book today!

Download your report today!

Fine-Tuning a Vision Transformer Model for Image Classification into IAB Taxonomy Categories

In today’s digital age, with technical constraints and legal regulations constantly increasing, the need for contextualizing content to personalize ads while ensuring user privacy has never been so important. Moreover, advertisers strive to maintain a positive and trustworthy association between their brand and the content in which their ads are displayed. Thus, categorizing text, images and video content can provide a better user experience through better contextual ads, while providing better performance and guarantees to advertisers. It stands out how using image categorization of a website’s content to place meaningful adjacent ads can be beneficial for all stakeholders.

The ability to classify and understand images has thus become increasingly valuable across various industries. Whether for content organization, targeted advertising, or customer analysis, image classification plays a crucial role in extracting meaningful insights from visual data.

In this blog post, we will explore the process of fine-tuning a Vision Transformer (ViT) model from the Hugging Face library for image classification into the categories of the Interactive Advertising Bureau (IAB) Tech Lab’s Content Taxonomy, the standard in the ad industry. Vision Transformers have recently demonstrated remarkable success in computer vision, making them an intriguing choice for image classification. By leveraging the power of pre-trained models and aligning the predictions with the IAB taxonomy, we can unlock the potential to accurately categorize images.

Throughout this post, we will delve into the fundamental concepts of ViTs, explaining their architecture and advantages over traditional computer vision techniques. Then, we will introduce the IAB taxonomy categories, highlighting their significance in organizing visual content. Afterwards, we will reach the core of the blog post: the fine-tuning process. Starting with selecting a pre-trained model from the Hugging Face model hub, we will quickly implement data preprocessing, model training, and evaluation leveraging the power of Amazon SageMaker Studio.

Understanding Vision Transformers

Vision Transformers, introduced in 2021 by Dosovitskiy et al., have emerged as a powerful alternative to traditional architectures like convolutional neural networks (CNNs) in computer vision tasks. Transformers, which were initially designed for natural language processing (NLP) tasks, have recently demonstrated remarkable success in image classification tasks.

A pure vision transformer breaks down an image into smaller patches, treating them as tokens (similar to words in NLP). Each patch is then linearly embedded, and position embeddings are added to perform classification by using the approach of adding an extra learnable “classification token” [CLS] to the sequence. The resulting sequence of vectors is fed to a standard transformer encoder and processed through multiple layers of self-attention mechanisms.

Figure 1: Model overview (Dosovitskiy, et al. 2021)

Extensive work has been done to compare ViTs to state-of-the-art CNNs on their performance in image classification tasks (Maurício, et al. 2023). Let’s briefly cover more information about the performance of the two architectures and what distinguishes them.

Compared to CNNs, ViTs offer better scalability and generalization capabilities. When pre-trained on large amounts of data and then transferred to multiple mid-sized or small image recognition benchmarks (ImageNet, CIFAR-100, VTAB), ViT attains excellent results compared to state-of-the-art CNNs, while requiring substantially fewer computational resources to train.

Thus, by pre-training ViTs on large-scale datasets such as ImageNet-21k (14 million images, 21,843 classes), we can leverage the learned representations to adapt to specific image classification tasks through fine-tuning, thereby reducing the need for extensive training from scratch, thus decreasing training time and overall costs.

However, CNNs do achieve excellent results when training from scratch on smaller data volumes as those required by Vision Transformers. This different behavior seems to be due to the presence in CNNs of inductive biases, which are exploited by these networks to grasp more rapidly the particularities of the analyzed images even if, on the other hand, they end up making it more complex to understand global relations. On the other hand, the ViTs are free from these biases, which enables them to also capture long-range dependencies in images at the cost of more training data. The self-attention mechanism allows the model to capture global and local dependencies within the image, enabling it to understand the contextual relationships between different regions.

Among some other notable differences between the two architectures, ViTs perform better and are more resilient to images with natural or adverse disturbances compared to CNNs, and ViTs are more robust to adversarial attacks and CNNs are more sensitive to high-frequency features.

Overall, transformer-based architecture – or the combination of ViTs with CNNs – allows for better accuracy compared to CNN networks, performing better compared due to the work of the  self-attention mechanism. ViTs architecture is lighter than CNNs, consuming fewer computational resources and taking less training time, and is more robust than CNN networks for images that have noise or are augmented.

On the other hand, CNNs can generalize better with smaller datasets and get better accuracy than ViTs, but in contrast, ViTs have the advantage of learning information better with fewer images when fine tuning. However, it has been noted that ViT performance may struggle with generalization when trained on smaller image datasets.

Understanding the IAB Content Taxonomy

The Interactive Advertising Bureau (IAB) has developed a comprehensive taxonomy that plays a pivotal role in organizing and categorizing digital content. Their taxonomy provides a standardized framework for content classification, enabling efficient content targeting, contextual advertising, and audience segmentation.

The IAB taxonomy is structured hierarchically, with several levels of categories and subcategories. At the top level, it covers broad content categories such as Arts & Entertainment, Sports, and more.

Each category is then further divided into more specific subcategories, creating a more precise and granular classification system that can capture a wide range of content types. For example, within the Sports category, there are subcategories like Basketball, Soccer and Tennis.

In the digital advertising industry, the IAB taxonomy serves various purposes. It helps advertisers align their campaigns with specific content categories, ensuring that their advertisements are shown in relevant contexts. Content publishers can use the taxonomy to categorize their content effectively, making it easier for users to discover relevant information. Moreover, the IAB taxonomy facilitates audience segmentation by enabling advertisers to target specific categories that align with their target audience’s interests.

In the following section, we will explore how to fine-tune a ViT model to accurately classify images into the specific IAB taxonomy categories.

Fine – Tuning a Vision Transformer Model

Fine-tuning a vision transformer model involves adapting a pre-trained model on a large-scale dataset and fine-tuning to smaller downstream tasks, such as image classification into the IAB taxonomy categories. This process allows us to leverage the knowledge and representations learned by the pre-trained model, saving time and computational resources. For this, we remove the pre-trained prediction head and attach a new feedforward layer, with K outputs being the number of downstream classes.

To fine-tune the ViT model we will be using the Hugging Face library. This library offers a wide range of pre-trained vision transformer models and provides an easy-to-use interface for fine-tuning and deploying these models. By utilizing the Hugging Face library, we can take advantage of their model availability, easy integration, strong community support, and the benefits of transfer learning.

Since we wanted to tackle this problem with a supervised training model using the IAB categories as labels, we needed a properly labeled dataset. For this, we collected and labeled an in-house dataset of commercial-use allowed images. As a baseline, images are single labeled with a single Tier 1 IAB category. The dataset consists of image files, organized in folders per category, so that we can easily load it into the appropriate format using the Hugging Face datasets library.

Using SageMaker’s Studio capabilities, we can easily launch a notebook backed with a GPU instance, such as a ml.g4dn.xlarge, to conduct our experiment of training a baseline classification model.

To get started, let’s first install the required libraries from Hugging Face and PyTorch into our notebook’s virtual environment:

no-line-numbers !pip install transformers==4.26.1 datasets==2.10.1 evaluate==0.4.0 -q
no-line-numbers !pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 -q

Note that the exclamation mark (!) allows us to run shell commands from inside a Jupyter Notebook code cell.

Now, let’s do the necessary imports for the notebook to run properly:

no-line-numbers import evaluate import json import numpy as np import os import pandas as pd import pyarrow as pa import requests import torch from datasets import load_dataset, load_from_disk, Dataset, Features, Array3D from io import BytesIO from transformers import AutoProcessor, ViTFeatureExtractor, ViTForImageClassification, Trainer, TrainingArguments, default_data_collator from typing import Tuple from PIL import Image

Let’s define some variables:

no-line-numbers # The directory where our images are saved in folders by category images_dir = "./dataset" # The output directory of the processed datasets train_save_path = "./processed-datasets/train" val_save_path = "./processed-datasets/val" test_save_path = "./processed-datasets/test" # Sizes of dataset splits val_size = 0.2 test_size = 0.1 # Name of model as named in the HuggingFace Hub model_name = "google/vit-base-patch16-224-in21k"

We leverage the create an image dataset feature to easily load our custom dataset by specifying our local folder where the images are stored in folders by category.

no-line-numbers dataset = load_dataset("imagefolder", data_dir=images_dir, split='train')

Let’s perform some data cleaning and remove from dataset images which are non-RGB (single-channel, grayscale).

no-line-numbers # Remove from dataset images which are non-RGB (single-channel, grayscale) condition = lambda data: data['image'].mode == 'RGB' dataset = dataset.filter(condition)

Now we split our dataset into Train, Validation and Test sets.

no-line-numbers def split_dataset( dataset: Dataset, val_size: float=0.2, test_size: float=0.1 ) -> Tuple[Dataset, Dataset, Dataset]: """ Returns a tuple with three random train, validation and test subsets by splitting the passed dataset. Size of the validation and test sets defined as a fraction of 1 with the `val_size` and `test_size` arguments. """ print("Splitting dataset into train, validation and test sets...") # Split dataset into train and (val + test) sets split_size = round(val_size + test_size, 3) dataset = dataset.train_test_split(shuffle=True, test_size=split_size) # Split (val + test) into val and test sets split_ratio = round(test_size / (test_size + val_size), 3) val_test_sets = dataset['test'].train_test_split(shuffle=True, test_size=split_ratio) train_dataset = dataset["train"] val_dataset = val_test_sets["train"] test_dataset = val_test_sets["test"] return train_dataset, val_dataset, test_dataset # Split dataset into train and test sets train_dataset, val_dataset, test_dataset = split_dataset(dataset, val_size, test_size)

Finally, we can prepare these images for our model.

When ViT models are trained, specific transformations are applied to images being fed into them so they fit the expected input format.

To make sure we apply the correct transformations, we will use an ‘AutoProcessor’ initialized with a configuration that was saved to Hugging Face’s Hub along with the pretrained “google/vit-base-patch16-224-in21k” model we plan to use.

So, we preprocess our datasets with the model’s image AutoProcessor:

no-line-numbers def process_examples(examples, image_processor): """Processor helper function. Used to process batches of images using the passed image_processor. Parameters ---------- examples A batch of image examples. image_processor A HuggingFace image processor for the selected model. Returns ------- examples A batch of processed image examples. """ # Get batch of images images = examples['image'] # Preprocess inputs = image_processor(images=images) # Add pixel_values examples['pixel_values'] = inputs['pixel_values'] return examples def apply_processing( model_name: str, train_dataset: Dataset, val_dataset: Dataset, test_dataset: Dataset ) -> Tuple[Dataset, Dataset, Dataset]: """ Apply model's image AutoProcessor to transform train, validation and test subsets. Returns train, validation and test datasets with `pixel_values` in torch tensor type. """ # Extend the features features = Features({ **train_dataset.features, 'pixel_values': Array3D(dtype="float32", shape=(3, 224, 224)), }) # Instantiate image_processor image_processor = AutoProcessor.from_pretrained(model_name) # Preprocess images train_dataset = train_dataset.map(process_examples, batched=True, features=features, fn_kwargs={"image_processor": image_processor}) val_dataset = val_dataset.map(process_examples, batched=True, features=features, fn_kwargs={"image_processor": image_processor}) test_dataset = test_dataset.map(process_examples, batched=True, features=features, fn_kwargs={"image_processor": image_processor}) # Set to torch format for training train_dataset.set_format('torch', columns=['pixel_values', 'label']) val_dataset.set_format('torch', columns=['pixel_values', 'label']) test_dataset.set_format('torch', columns=['pixel_values', 'label']) # Remove unused column train_dataset = train_dataset.remove_columns("image") val_dataset = val_dataset.remove_columns("image") test_dataset = test_dataset.remove_columns("image") return train_dataset, val_dataset, test_dataset # Apply AutoProcessor train_dataset, val_dataset, test_dataset = apply_processing(model_name, train_dataset, val_dataset, test_dataset)


We now save our processed datasets:

no-line-numbers # Save train, validation and test preprocessed datasets train_dataset.save_to_disk(train_save_path, num_shards=1) val_dataset.save_to_disk(val_save_path, num_shards=1) test_dataset.save_to_disk(test_save_path, num_shards=1)


Let’s proceed to the training step. We begin by loading the train and validation datasets.

no-line-numbers train_dataset = load_from_disk(train_save_path) val_dataset = load_from_disk(val_save_path)


Our new fine-tuned model will have the following number of output classes:

no-line-numbers num_classes = train_dataset.features["label"].num_classes

Now, let’s define a ‘ViTForImageClassification’, which places a linear layer (nn.Linear) on top of a pre-trained ViT model. The linear layer is placed on top of the last hidden state of the [CLS] token, which serves as a good representation of an entire image.

We also specify the number of output neurons by setting the `num_labels` parameter.

no-line-numbers # Download model from model hub model = ViTForImageClassification.from_pretrained(model_name, num_labels=num_classes) # Download feature extractor from hub feature_extractor = ViTFeatureExtractor.from_pretrained(model_name)

Let’s proceed by defining a `compute_metrics` function. This function is used to compute any defined target metrics at every evaluation step.

Here, we use the `accuracy` metric from `datasets`, which can easily be used to compare the predictions with the expected labels on the validation set.

We also define a custom metric to compute the accuracy at “k” for our predictions. The accuracy at “k” is the number of instances where the real label is in the set of the “k” most probable classes.

no-line-numbers # K for top accuracy metric k_for_top_acc = 3 # Compute metrics function for binary classification acc_metric = evaluate.load("accuracy", module_type="metric") def compute_metrics(eval_pred): predicted_probs, labels = eval_pred # Accuracy predicted_labels = np.argmax(predicted_probs, axis=1) acc = acc_metric.compute(predictions=predicted_labels, references=labels) # Top-K Accuracy top_k_indexes = [np.argpartition(row, -k_for_top_acc)[-k_for_top_acc:] for row in predicted_probs] top_k_classes = [top_k_indexes[i][np.argsort(row[top_k_indexes[i]])] for i, row in enumerate(predicted_probs)] top_k_classes = np.flip(np.array(top_k_classes), 1) acc_k = { f"accuracy_k" : sum([label in predictions for predictions, label in zip(top_k_classes, labels)]) / len(labels) } # Merge metrics acc.update(acc_k) return acc

As we would like to know the actual class names as outputs, rather than just integer indexes, we set the ‘id2label’ and ‘label2id’ mapping as attributes to the configuration of the model (which can be accessed as model.config):

no-line-numbers # Change labels id2label = {key:train_dataset.features["label"].names[index] for index,key in enumerate(model.config.id2label.keys())} label2id = {train_dataset.features["label"].names[index]:value for index,value in enumerate(model.config.label2id.values())} model.config.id2label = id2label model.config.label2id = label2id

Next, we specify the output directories of the model and other artifacts, and the set of hyperparameters we’ll use for training.

no-line-numbers model_dir = "./model" output_data_dir = "./outputs" # Total number of training epochs to perform num_train_epochs = 15 # The batch size per GPU/TPU core/CPU for training per_device_train_batch_size = 32 # The batch size per GPU/TPU core/CPU for evaluation per_device_eval_batch_size = 64 # The initial learning rate for AdamW optimizer learning_rate = 2e-5 # Number of steps used for a linear warmup from 0 to learning_rate warmup_steps = 500 # The weight decay to apply to all layers except all bias and LayerNorm weights in AdamW optimizer weight_decay = 0.01 main_metric_for_evaluation = "accuracy"

We just need to define two more things before we can start training.

First, the ‘TrainingArguments’, which is a class that contains all the attributes to customize the training. There we set the evaluation and checkpoints to be done at the end of each epoch, the output directories and set out hyperparameters (such as the learning rate and batch_sizes).

Then, we create a ‘Trainer’, where we pass the model name, the ‘TrainingArguments’, ‘compute_metrics’ function, datasets and feature extractor.

no-line-numbers # Define training args training_args = TrainingArguments( output_dir = model_dir, num_train_epochs = num_train_epochs, per_device_train_batch_size = per_device_train_batch_size, per_device_eval_batch_size = per_device_eval_batch_size, warmup_steps = warmup_steps, weight_decay = weight_decay, evaluation_strategy = "epoch", save_strategy = "epoch", logging_strategy = "epoch", logging_dir = f"{output_data_dir}/logs", learning_rate = float(learning_rate), load_best_model_at_end = True, metric_for_best_model = main_metric_for_evaluation, ) # Create Trainer instance trainer = Trainer( model = model, args = training_args, compute_metrics = compute_metrics, train_dataset = train_dataset, eval_dataset = val_dataset, data_collator = default_data_collator, tokenizer = feature_extractor )

Start training by calling ‘trainer.train()’:

no-line-numbers trainer.train()

By inspecting the training metrics, we can see how loss and accuracy stabilize by the end of training.

no-line-numbers log_history = pd.DataFrame(trainer.state.log_history) log_history = log_history.fillna(0) log_history = log_history.groupby(['epoch']).sum() log_history

no-line-numbers log_history[["loss", "eval_loss", "eval_accuracy", "eval_accuracy_k"]].plot(subplots=True)I I

Figure 3: Loss and accuracy after fine-tuning

Now that we are satisfied with our results we just have to save the trained model.

no-line-numbers trainer.save_model(model_dir)

We finally have our fine-tuned ViT model. Now, we need to verify its performance on the test set. Let’s first reload our model, create a new Trainer instance and then call the `evaluate` method to validate the performance of the model on the test set.

no-line-numbers # Load dataset test_dataset = load_from_disk(test_save_path) # Load trained model model = ViTForImageClassification.from_pretrained('./model') # Load feature extractor feature_extractor = ViTFeatureExtractor.from_pretrained('./model') # Create Trainer instance trainer = Trainer( model=model, compute_metrics=compute_metrics, data_collator=default_data_collator, tokenizer=feature_extractor ) # Evaluate model eval_results = trainer.evaluate(eval_dataset=test_dataset) # Writes eval_result to file which can be accessed later with open(os.path.join(output_data_dir, "eval_results.json"), "w") as writer: print(f"Logging evaluation results at {output_data_dir}/eval_results.json") writer.write(json.dumps(eval_results)) print(json.dumps(eval_results, indent=4))

no-line-numbers { "eval_loss": 2.441051483154297, "eval_accuracy": 0.48269581056466304, "eval_accuracy_k": 0.7103825136612022, "eval_runtime": 1.9207, "eval_samples_per_second": 285.83, "eval_steps_per_second": 35.924 }

Let’s see our model’s performance in the wild. We’ll get a random image from the web and see how the model predicts the most likely labels for it:

no-line-numbers # Get test image from the web test_image_url = 'https://media.cnn.com/api/v1/images/stellar/prod/111024080409-steve-jobs-book.jpg' response = requests.get(test_image_url) test_image = Image.open(BytesIO(response.content)) # Resize and display the image aspect_ratio = test_image.size[0] / test_image.size[1] max_height = 250 resized_width = int(max_height * aspect_ratio) resized_img = test_image.resize((resized_width, max_height)) display(resized_img)

no-line-numbers # Predict the top k classes for the test image inputs = feature_extractor(images=test_image, return_tensors="pt").to("cuda") outputs = model(**inputs) logits = outputs.logits top_classes = torch.topk(outputs.logits, k_for_top_acc).indices.flatten().tolist() for i, class_idx in enumerate(top_classes): print(str(i + 1), "- Predicted class:", model.config.id2label[class_idx])

Results

no-line-numbers 1 - Predicted class: Education 2 - Predicted class: Books and Literature 3 - Predicted class: Productivity

Conclusion & Future Work

By following these steps and leveraging the capabilities of fine-tuning and transfer learning with ViTs, we demonstrate we can achieve a good baseline image classification into the IAB taxonomy categories. This opens up opportunities for various applications such as content targeting, contextual advertising, content personalization and audience segmentation.

With the availability of pre-trained models and user-friendly libraries like Hugging Face, fine-tuning vision transformers has become more accessible and efficient.
As we conclude, while we are happy with the baseline performance we achieved, it’s essential to consider possible future work and model improvements to improve the model’s performance. Here are some areas to focus on:

Hyperparameter Tuning: Perform systematic hyperparameter search to identify optimal values, including learning rate, batch size, and weight decay.

Model Architecture: Explore different transformer architectures, such as hybrid models combining transformers with CNNs, to further enhance performance.

Data Augmentation: Apply techniques like random cropping, rotation, flipping, and color jittering to artificially expand the training dataset and improve model generalization.

Dataset Improvement: Train the model on larger and more task specific datasets. Training sets may vary depending on the task. For example, it may depend on whether we are categorizing web or video content. Moreover, in any case, collecting additional annotated images, especially for challenging or underrepresented categories, can boost the model’s performance.

Finally, remember to stay tuned to the latest developments in the field, as new techniques and advancements continue to shape the world of computer vision and artificial intelligence.

Thank you for reading, and if you have any questions or feedback, please feel free to reach out to us. Happy fine-tuning and image classification with ViTs and Hugging Face!

Sources

[2010.11929] An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale

An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale | Papers With Code

[2108.08810] Do Vision Transformers See Like Convolutional Neural Networks?

Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review

Stay ahead of the curve on the latest trends and insights in big data, machine learning and artificial intelligence. Don't miss out and subscribe to our newsletter!