Augmenting CVE Descriptions with Stable Diffusion to Balance Dataset for CWE Classification

Problem Statement:

Mapping Common Vulnerabilities and Exposures (CVEs) to Common Weakness Enumerations (CWEs) is crucial for understanding and mitigating security vulnerabilities. However, the dataset of CVE descriptions is unbalanced, with certain CWEs, such as CWE-79 (Cross-Site Scripting), having significantly more associated CVEs than others. This imbalance creates challenges when training machine learning classifiers, such as neural networks, which rely on a balanced dataset for optimal performance. Without augmentation, underrepresented CWE classes could result in biased models with poor generalization to minority classes. This thesis proposes using a data augmentation method based on stable diffusion to synthetically balance the distribution of CVE descriptions across CWEs.

Research Objectives:

Literature Review
1. Explore data augmentation techniques in machine learning, particularly stable diffusion.
2. Identify previous applications of diffusion models in text augmentation and security domains.
Application of Stable Diffusion for Augmentation
1. Analyze the distribution of CVE descriptions across CWEs, identifying the most underrepresented classes.
Application of Stable Diffusion for Augmentation
1. Develop a method for augmenting CVE descriptions using stable diffusion.
2. Ensure that the generated descriptions are semantically relevant to the original CWE class.
Model Training and Evaluation
1. Evaluate the performance of the model, particularly its ability to generalize across underrepresented CWE classes.
Assessment of Augmentation Impact
1. Measure the impact of the stable diffusion-based augmentation on classification performance, focusing on improvements in precision, recall, and overall accuracy.