Enhancing Subdomain Discovery: Evaluating AI-Based Word Embedding Models for Accurate Prediction
Master of Applied IT
Client company:Tom Broumels and Mark Madsen
Lazar Dimitrovski
Othman Kouhi
Project description
The main question of the project is how AI-based word embedding models, such as Word2Vec, FastText, and Doc2Vec, can effectively enhance subdomain discovery and prediction within the Dutch .nl domain space. The design challenge lies in addressing the complexity of identifying hidden or rarely used subdomains, which traditional methods struggle with due to their reliance on brute force or static heuristics. This research aims to determine whether these models can reveal meaningful patterns and relationships in subdomain data and compare their performance under various configurations. The goal is to evaluate which model best balances precision and recall while ensuring scalability and practicality for organizations. The project also explores the integration of AI-driven approaches to improve cybersecurity by identifying vulnerabilities in DNS structures and enhancing the accuracy and efficiency of subdomain enumeration processes.
Context
This project focuses on subdomain discovery within the context of the Domain Name System (DNS), a critical infrastructure for organizing and managing digital assets. Subdomains are widely used to structure services (e.g., mail.example.nl or blog.example.nl), enhance functionality, and improve security. However, discovering hidden or unused subdomains is challenging due to their vast configurations and reliance on incomplete data. Subdomains often host sensitive services, staging environments, or third-party applications, making their identification essential for cybersecurity.
The research emphasizes Dutch domains under the .nl top-level domain (TLD), leveraging AI-driven word embedding models such as Word2Vec, FastText, and Doc2Vec. These models transform textual data (subdomain keywords) into vectors to reveal semantic relationships and patterns, potentially enhancing subdomain prediction accuracy.
Results
The project demonstrates that AI-based word embedding models(Word2Vec, FastText, and Doc2Vec)significantly enhance subdomain discovery, addressing gaps in traditional methods like brute force enumeration or manual exploration. Doc2Vec proved to be the most effective model, achieving the highest precision (78.8%) and F1-scores across various configurations, especially under stricter filtering criteria. Its ability to identify valid subdomains with fewer false positives highlights its value for organizations focused on improving cybersecurity while minimizing the effort required for manual validation. Word2Vec stood out for its broad coverage, predicting the largest number of novel subdomains. This makes it particularly useful for exploratory tasks and mapping uncommon or hidden subdomains, providing organizations with a comprehensive view of their digital assets. FastText demonstrated consistent performance across all configurations, making it a reliable option for identifying subdomains with unique or rare keywords.
The results showed that AI models significantly outperformed baseline methods, such as random subdomain selection or keyword-based predictions. Doc2Vec achieved an accuracy of 74.85%, compared to only 0.76% for the baseline methods. This demonstrates the substantial advantage of AI-driven approaches in predicting meaningful subdomains. These findings validate the use of word embedding models for enhancing cybersecurity through accurate subdomain enumeration. They provide a scalable, automated solution that helps organizations identify vulnerabilities, manage digital assets effectively, and strengthen their overall DNS security practices.