Schema-Based Synthetic Knowledge Graph Generation Using AI Techniques

MASTER Assignment

Schema-Based Synthetic Knowledge Graph Generation Using AI Techniques

Type : Master M-CS

Period: March - August, 2025

Student : Arzani, R. (Radmehr, Student M-CS)

Date Final project: August 21, 2025

Thesis

Supervisors:

Abstract:

Accessing real-world data is often problematic due to privacy concerns, especially in sen- sitive domains. As a result, synthetic data is increasingly used as a proxy for real data. However, most current approaches to synthetic data generation still rely heavily on access to real datasets. These methods typically analyze real data to extract distributions and feature relationships. Synthetic data are used not only to train generative models but also to test infrastructure and tools, produce real-world analysis over sensitive data. Common techniques like domain adaptation mostly treat synthetic data as a source context and apply trained models to real-world target contexts. These approaches are valuable for augmenting limited datasets or enabling training in virtual environments; yet, they remain fundamentally dependent on real data. This dependency can introduce privacy risks. In many cases even limited exposure to real datasets may lead to model memorization of out- liers or unique data points. As a result, the development of domain-independent, privacy- preserving synthetic data generators remains a critical challenge. Recent work has begun exploring synthetic data generation from data schema constraints alone, without access to any real samples. However, these efforts often fail to capture the complex interrelation- ships and statistical distributions found in real data. This thesis introduces SRDF-GEN1 as a web-based synthetic RDF data generator to investigate a novel domain-independent approach to generating high-fidelity synthetic RDF data, using AI data generation models. This approach is specifically guided by SHACL constraints, with the goal of maintaining structural and semantic coherence while safeguarding data privacy. Using several deep- generative models and knowledge graph ontologies, this work aims to address the gap between realism and privacy in synthetic data generation.