We propose a model for generating typo-squatting domains that people are prone to mistakenly send emails to by collecting real-world domain typos. To construct this model, we created an email address dataset based on actual domains and conducted an experiment in which 300 participants were asked to input 40 email addresses each. This allowed us to gather domain typos in email addresses.
This repository provides access to the typo domains collected from our study.
The typo domains collected in our study meet the following criteria:
- Domains conforming to RFC5321 and RFC5322
- Domains using top-level domains (TLDs) approved by ICANN
- Domains with a Damerau-Levenshtein distance of 1 to 5 from the correct domain
The table below provides examples of typo domains:
Correct Domain | Typo Domain | Description |
---|---|---|
example.com | exampl.com | Deletion |
example.com | eaxmple.com | Transposition |
example.com | example.co | TLD Change |
example.com | example.co.jp | Damerau-Levenshtein distance = 3 |
example.co.jp | exampleco.jp | Deletion of (.) |
The table below presents examples of typo domains not collected in this study:
Correct Domain | Excluded Typo Domain | Description |
---|---|---|
example.com | example..com | Does not conform to RFC |
example.com | example.coom | TLD not approved by ICANN |
example.com | exampletyposquatting.com | Damerau-Levenshtein distance ≥ 6 |
We provide the following datasets:
typosquatting_domains_one.csv
: Collected typo domains (correct domains with a Damerau-Levenshtein distance of 1)typosquatting_domains_two_or_more.csv
: Collected typo domains (Damerau-Levenshtein distance of 2 to 5)domains.csv
: Domains used in the experiment- Includes domains with MX records from the TOPIX-listed companies
local_parts.csv
: Local parts used in the experiment- Generated based on the Ministry of Health, Labour and Welfare’s Vital Statistics Survey using Japanese surnames and given names
- Surnames: Top 5,000 surnames from Myoji-Yurai.net
- Given names: Names appearing in the Meiji Yasuda Life Name Rankings from 1958 to 2024
- Name combinations: Generated based on Interseller’s research on email address patterns
correct_email_addresses.csv
: Email addresses used in the experiment- Randomly generated by combining domains from
domains.csv
and local parts fromlocal_parts.csv
- Randomly generated by combining domains from
If you use our dataset, please cite our paper.
The citation should look like this in a paper written in English (or any non-Japanese language):
Soma Sugahara, Rannosuke Hoshina, Tetsutaro Uehara: "Proposal of a Typosquatting Domain Generation Model based on the analysis of Typographical Error Tendencies", IPSJ SIG Technical Reports, Vol. 2025-IOT-68, No.59, p. 1-8, 2025
The citation should look like this in a paper written in Japanese:
菅原颯真, 星名藍乃介, 上原哲太郎: "タイプミス傾向の分析に基づくタイポスクワッティングドメイン生成モデルの提案", 研究報告インターネットと運用技術(IOT), Vol. 2025-IOT-68, No.59, p. 1-8, 2025
Tetsutaro Uehara (College of Information Science and Engineering, Ritsumeikan University)