Computer Vision

Computer Vision is a dynamic branch of artificial intelligence that empowers machines to interpret and analyze images and videos. It employs sophisticated algorithms and deep learning models to execute a variety of tasks including object recognition, action detection, image segmentation, and even scene reconstruction and pose estimation enabling systems to identify patterns and anomalies accurately. Today, computer vision is foundational in diverse applications ranging from autonomous vehicles and medical diagnostics to surveillance, augmented reality, and robotics, continuously evolving to transform how we interact with and understand the visual world.

Implemented Computer Vision Tasks

This repository contains a collection of links to my projects that demonstrate implementations of Computer Vision models in Python. It includes several basic models built from scratch using Convolutional Neural Networks (CNNs). CNNs are a type of neural network designed to process data with a grid-like structure, such as images, and are particularly effective at recognizing visual patterns by capturing local features through convolution operations.

The repository also includes models that use Transfer Learning, a technique that reuses high-performance models such as Vision Transformers (ViT) and YOLO11 pretrained on large datasets. This allows achieving high performance with lower computational costs. Transfer Learning takes advantage of the knowledge acquired by models trained on millions of images or videos and applies it to new, specific tasks, improving both efficiency and accuracy.

The following are the Computer Vision tasks I have implemented so far:

Image Classification: This task involves assigning a label or class to a whole image. The input consists of pixel values that make up an image, whether in grayscale or RGB, and the goal is to predict the class to which the image belongs.

Fine-Grained Image Classification on the CUB-200-2011 Dataset using ConvNeXt V2

Object Detection: Object detection models identify and locate instances of objects, such as cars, people, buildings, animals, etc., in images or videos. They return bounding box coordinates along with class labels and confidence scores for each detected object.

Oriented Object Detection on the DIOR-R Dataset using YOLO11-obb

Image Segmentation: Image segmentation classifies each pixel in an image into a category or a specific instance of a category. This task is divided into three types:
- Semantic Segmentation: Assigns a class label to each pixel in an image without distinguishing between different instances of the same class.
Semantic Segmentation on the LandCover.ai Dataset using SegFormer
- Instance Segmentation: Goes beyond Object Detection by labeling each pixel that belongs to a detected object with a specific class and instance. In this way, the models not only provide the coordinates of the bounding box, along with class labels and confidence scores, but also generate binary masks for each detected instance in an image.
Instance Segmentation on the BDD100K Dataset using YOLO11-seg
- Panoptic Segmentation: Combines semantic segmentation and instance segmentation by assigning each pixel in an image both a class and an instance label. This allows for a detailed segmentation of complex scenes.
Panoptic Segmentation on the LaRS Dataset using Mask2Former