This project aims to evaluate the performance of the Segment Anything Model (SAM) in anomaly segmentation without additional training. The goal is to determine whether a foundation model like SAM can effectively segment anomalies.
Recent advances in anomaly detection have largely followed an unsupervised learning paradigm, primarily due to the scarcity of labeled anomalous data. These methods typically train models using only normal samples, detecting anomalies as deviations from the learned distribution. While effective in controlled settings, this approach suffers from several limitations:
- It generally follows a One-Class One-Model paradigm, requiring a separate model for each category or domain.
- Performance is often highly sensitive to the statistical distribution of the validation data, making generalization difficult.
- Most models struggle with pixel-wise anomaly segmentation, especially in complex or noisy backgrounds.
To overcome these limitations, the field has recently turned toward leveraging foundation models—large-scale models pretrained on diverse datasets—to build unified, generalizable anomaly detection and segmentation frameworks. These models, such as CLIP, DINOv2, and SAM (Segment Anything Model), exhibit strong zero-shot or few-shot capabilities, making them ideal for domains with limited supervision.
In line with this trend, this project aims to explore the feasibility of using SAM for anomaly segmentation tasks. While SAM has demonstrated impressive generalization in natural image segmentation, its ability to localize anomalous regions without supervision remains largely unexplored.
This work investigates whether SAM can effectively segment anomalies when provided with appropriate prompt inputs. Furthermore, We propose a method to automatically generate those prompts using feature-based anomaly maps, simulating real-world scenarios where ground-truth masks are not available.
We hope this exploration can contribute to future research on building foundation model-based anomaly segmentation pipelines that require minimal task-specific tuning.
- The evaluation is conducted using MVTec-AD Dataset, which is a benchmark dataset for industrial anomaly detection.
To explore the potential of Segment Anything Model (SAM) in anomaly segmentation tasks, We first aimed to answer a fundamental question:
“If we provide well-designed prompts, can SAM accurately segment anomalous regions without additional training?”
To verify this, We conducted controlled experiments using ground-truth (GT) masks from benchmark datasets.
We developed a method to generate different types of SAM prompts based on GT masks:
-
Box only: Bounding boxes that tightly cover the anomalous regions.
-
Box + 1 point: Bounding boxes with a single point located inside the anomalies
-
Box + multiple points: Bounding boxes with 20 points sampled within the anomalous regions.
This setup allowed us to simulate varying levels of prior information and test SAM’s robustness under different prompt configurations.
In real-world applications, GT masks are not available during inference. Therefore, it is necessary to develop a practical method for approximating the location of anomalies without relying on manual annotations.
To address this, we propose using anomaly maps as a proxy. These maps are generated by computing feature similarity between a query image and a few normal reference samples, following a strategy inspired by few-shot anomaly detection. The resulting similarity map highlights regions that deviate from normal patterns and provides a coarse localization of potential anomalies.
In this context, the anomaly map plays a crucial role by serving as a spatial guide that informs SAM where potential anomalies are likely to exist, enabling it to focus its segmentation on those regions.
The anomaly map is then binarized via thresholding, and the resulting binary mask is used to automatically generate prompt inputs for SAM (e.g., points or bounding boxes). This enables effective anomaly segmentation without requiring GT masks, effectively simulating real-world scenarios where manual annotations are not available. Importantly, the rough localization provided by the anomaly map can be refined by SAM, enabling precise segmentation of anomalous regions with minimal supervision.
- Prepare the MVTec-AD dataset.
- Run
test_mvtec.pyto evaluate SAM’s anomaly segmentation. - Use
anomaly_map.pyto generate anomaly maps for SAM prompts.
This script creates SAM's prompt input using GT masks and saves the anomaly segmentation results. It consists of four different prompt settings:
- Creates a box prompt covering the entire image.
python test_mvtec.py --data_dir "DATASET_DIR"- Creates a bounding box around the anomalous region.
python test_mvtec.py --data_dir "DATASET_DIR" --mode b- Creates a bounding box along with one point inside the anomalous region.
python test_mvtec.py --data_dir "DATASET_DIR" --mode bp- Creates a bounding box along with multiple points inside the anomalous region.
python test_mvtec.py --data_dir "DATASET_DIR" --mode bpsThis script generates a heatmap-like anomaly localization map based on feature similarity between a query image and normal images.
- Uses Swin Transformer as the image encoder to extract features.
- The generated anomaly map is used to create SAM's prompt input.
python test_mvtec.py --data_dir "DATASET_DIR" --save_dir "SAVE_DIR"The table below compares the anomaly segmentation performance of SAM under different prompt conditions.
- The first three columns show results when prompts are generated using ground-truth (GT) masks.
- The last column shows performance when using prompts generated by our few-shot anomaly map method, which does not rely on GT masks.
Each cell shows results in the format (IoU / P-AUROC), capturing both localization and segmentation quality.
| Type | SAM (Box only) | SAM (Box with 1 point) | SAM (Box with 20 points) | SAM (Few-shot) (Box with 1 point) |
|---|---|---|---|---|
| Bottle | 76.8 / 97.0 | 74.8 / 95.9 | 78.8 / 99.1 | 51.5 / 81.4 |
| Cable | 69.2 / 96.3 | 68.9 / 96.2 | 72.6 / 98.4 | 55.2 / 82.4 |
| Capsule | 57.9 / 97.3 | 59.1 / 97.5 | 58.4 / 99.3 | 54.7 / 87.6 |
| Carpet | 59.3 / 97.7 | 59.3 / 97.8 | 52.7 / 97.1 | 45.9 / 83.0 |
| Grid | 51.9 / 84.0 | 56.4 / 86.1 | 47.4 / 95.2 | 33.4 / 71.5 |
| Hazelnut | 74.9 / 96.6 | 74.6 / 96.4 | 75.5 / 98.3 | 49.0 / 83.8 |
| Leather | 58.1 / 98.4 | 60.0 / 99.0 | 57.6 / 99.5 | 43.5 / 84.5 |
| Metal Nut | 78.4 / 98.0 | 78.9 / 98.1 | 77.4 / 98.9 | 50.2 / 83.3 |
| Pill | 73.4 / 98.1 | 72.6 / 98.7 | 68.7 / 99.4 | 49.8 / 85.8 |
| Screw | 62.0 / 91.2 | 64.9 / 94.6 | 66.8 / 99.3 | 69.9 / 88.4 |
| Tile | 72.8 / 92.1 | 70.0 / 88.0 | 67.3 / 95.6 | 45.5 / 80.7 |
| Toothbrus | 71.7 / 96.0 | 69.8 / 96.6 | 67.6 / 98.1 | 56.2 / 89.9 |
| Transistor | 50.3 / 88.5 | 54.0 / 91.3 | 57.1 / 94.1 | 43.8 / 77.9 |
| Wood | 71.9 / 89.8 | 71.5 / 89.5 | 70.0 / 94.2 | 39.4 / 75.8 |
| Zipper | 60.0 / 92.0 | 64.6 / 93.0 | 69.1 / 97.3 | 46.8 / 80.8 |
| Unified | 65.9 / 94.2 | 66.6 / 94.6 | 65.8 / 97.6 | 49.0 / 82.3 |
- Box only: Only bounding boxes that tightly covering the anomalous regions.
- Box with 1 point: Bounding boxes with a single point located inside the anomalies
- Box with 20 points: Bounding boxes with 20 points sampled within the anomalous regions.
-
Performance Trend: 20 points > 1 point > Box only
- In most object types, providing 20 points leads to the highest performance.
- Significant improvements are observed for complex shapes or fine-grained textures such as
Cable,Screw,Zipper, andPill. - Average performance (
Unifiedrow):- Box only:
65.9 / 94.2 - 1 point:
66.6 / 94.6 - 20 points:
65.8 / 97.6→ notable boost in P-AUROC score
- Box only:
-
Single-point supervision already improves results
- Even adding just one point shows measurable improvements over bounding box alone in categories like
Capsule,Carpet,Grid, andMetal Nut. - This suggests minimal user interaction can meaningfully enhance segmentation accuracy.
- Even adding just one point shows measurable improvements over bounding box alone in categories like
-
Per-category insights
- Categories like
Metal Nut,Bottle, andPillperform consistently well under all settings. - On the other hand, classes like
Grid,Transistor, andCarpetstruggle under Box-only supervision and benefit substantially from additional points.
- Categories like
-
P-AUROC is more sensitive to refinement
- P-AUROC tends to show larger improvements than IoU when additional points are provided.
- For example, in
Leather,Capsule, andScrew, P-AUROC reaches 99+ with 20 points.
Through our experiments, we observed that SAM’s performance significantly degrades when inaccurate or suboptimal prompts are provided. The method we used to generate these prompts—based on feature similarity and anomaly maps—has several limitations that must be considered:
-
Sensitivity to structural mismatch between query and reference images:
The anomaly map relies on feature similarity between the query image and a small set of normal reference images. If the reference samples have significantly different shapes or structures compared to the query, the resulting similarity map may fail to accurately highlight the anomalous regions. -
Difficulties in threshold selection during binarization:
While the goal is to provide a rough localization of potential anomalies to SAM, the process of converting the anomaly map into a binary mask is highly sensitive to the threshold value. An inappropriate threshold can introduce substantial noise or cause important regions to be missed, affecting the quality of the prompts fed into SAM.
These limitations suggest that refining the anomaly map generation pipeline—especially in terms of reference selection and adaptive thresholding—could lead to more robust and generalizable anomaly segmentation using SAM.

