Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
98 changes: 85 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,12 @@
# 811372A-3007 Software Development, Maintenance and Operations 2025 Projects
# 🧩 Project 1: Developer De-duplication

This repository contains example data and scripts showcasing data collection and processing
for projects of the Software Development, Maintenance and Operations.
This project is part of **811372A-3007 Software Development, Maintenance and Operations**, showcasing data collection and processing for developer identity resolution in large open-source repositories.

The three projects are:
---

- Project 1: Developer de-duplication
- Project 2: Temporal centrality and monitoring metrics
- Project 3: Temporal state detection and anomaly detection
## 📁 Repository Used

## Contents
For this project, we used the **Servo** open-source repository, which has over 1000 contributors and 1000+ commits.

- `project1devs/`: Directory with data for Project 1
- `devs.csv`: List of developers mined from eShopOnContainersProject
Expand All @@ -28,12 +25,87 @@ The three projects are:
- `project3anomaly.py`: Script demonstrating temporal state detection on temporal networks of microservice calls for Project 3
- `requirements.txt`: List of used libraries with specified versions

---

## Running the scripts
## 🎯 Objective

The scripts were developed and tested on a Mac (UNIX) environment with Python 3.10.
There should be no compatibility issues with running the scripts on Windows.
The goal was to identify and merge duplicate developer identities that arise due to:
- Different emails used on various devices
- Misspellings, casing, or formatting differences
- Git misconfigurations (usernames instead of names)

The versions of imported libraries are provided in `requirements.txt`.
We began with the **Bird heuristic**, implemented its logic, and then designed improved solutions using:
1. **Jaro–Winkler + Domain Weighted Matching**
2. **Semantic Similarity using Sentence Transformers + FAISS**

It is recommended to create a Python virtual environment and install the exact versions there.
---

## ⚙️ Approaches and Results

| Method | Precision | Recall | F1 Score | Description |
|--------|------------|---------|-----------|--------------|
| **Bird Heuristic (t = 0.7)** | 0.066 | 0.695 | 0.121 | Baseline heuristic using Levenshtein similarity and name-prefix checks. High recall, poor precision. |
| **Improved Heuristic (Jaro–Winkler, t = 0.75)** | 0.301 | 0.375 | 0.334 | Added Jaro–Winkler similarity, domain check, and weighted scoring. Better balance between recall and precision. |
| **FAISS Semantic (t = 0.8)** | **0.411** | **0.606** | **0.490** | Used SentenceTransformer embeddings + FAISS similarity search. Achieved best trade-off between recall and accuracy. |

---

## 📊 Visual Comparison

The following chart shows how each method performs in terms of **Precision**, **Recall**, and **F1 Score**:

![Performance Comparison](week2/faiss/metrics/all%20comparisons.png)

> The semantic FAISS model achieved the **highest overall accuracy (F1 = 0.49)** while maintaining good recall and significantly reducing false positives compared to the Bird baseline.

---

## 🧠 Insights

- **Bird heuristic** → detects many duplicates but produces too many false positives.
- **Improved heuristic** → balances precision and recall by filtering weak matches.
- **FAISS semantic model** → semantically understands name variations and nicknames, providing the most robust solution.

---

## 📦 Tools & Libraries

- **PyDriller** → commit mining
- **pandas** → data handling
- **python-Levenshtein** → similarity scoring (Bird)
- **jellyfish** → Jaro–Winkler similarity
- **SentenceTransformers** → text embeddings
- **FAISS** → scalable vector similarity search
- **matplotlib** → visualization

---

## 📁 Folder Structure

```text
SDMO2025Project_Group8/
├── week1/
│ ├── devs/ # Bird heuristic outputs
│ ├── metrics/bird/ # Bird metrics and charts
│ └── project1developers.py # Bird heuristic implementation
├── week2/
│ ├── jaro-winkler/ # Improved heuristic (Jaro–Winkler)
│ ├── faiss/ # Semantic + FAISS implementation
│ └── metrics/ # Visual and CSV performance reports
└── README.md

```

---

## 🧾 Conclusion

The **semantic FAISS-based model** demonstrates the best trade-off between accuracy and scalability for real-world developer identity deduplication tasks.
It efficiently handles complex name variations and minimizes manual validation, outperforming traditional heuristics like Bird’s method.

---

📈 *Group 8 SDMO2025 – University of Oulu*
255 changes: 0 additions & 255 deletions project1devs/devs.csv

This file was deleted.

Loading