SecBERT: Analyzing reports using BERT-like models

MASTER Assignment

Type : Master M-CS

Period: May, 2022 - Dec, 2022

Student : Liberato, M. (Matteo, Student M-CS)

Date Final project: Dec 19, 2022

Thesis

Supervisors:

Abstract:

Natural Language Processing (NLP) is a field of computer science which enables computers to interact with human language rough the use of specific software. Generic NLP tools do not work well on domain-specific language, as each domain has unique characteristics that a generic tool is not trained to handle. The domain of cyber security, has a variety of unique difficulties, such as the need to understand ever-evolving technical terms, and, in the case of Cyber Threat Intelligence (CTI) reports, the extraction of Indicators of Compromise (IoCs) and attack campaigns. After evaluating how existing systems addressed these issues we created SecBERT by training BERT, a state-of-the-art neural network for NLP tasks, using cyber security data. We evaluated SecBERT using a Masked Language Modeling task, in which sentences from cyber security reports were masked and SecBERT was used to predict the hidden parts. The performance of models trained on the cyber security language-domain improved in precision by 3.4\% to 5.2\%, compared to the baseline of models trained on general language performing the same task.