MASTER Assignment
Effects of inserting domain vocabulary and fine-tuning BERT for German legal language
Type : Master M-CS
Period: Mar, 2019- Nov, 2019
Student : Yeung Tai, C.M. (Chin Man, Student M-CS)
Date Final project: November, 28, 2019
Supervisors:
Abstract:
We explore in this study the effects of domain adaptation in NLP using the state-of-the-art pre-trained language model BERT. Using its German pre-trained version and a dataset from OpenLegalData containing over 100,000 German court decisions, we fine-tuned the language model and inserted legal domain vocabulary to create a German Legal BERT model. We evaluate the performance of this model on downstream tasks including classification, regression and similarity. For each task, we compare simple yet robust machine learning methods such as TFIDF and FastText against different BERT models, mainly the Multilingual BERT, the German BERT and our fine-tuned German Legal BERT. For the classification task, the reported results reveal that all models were equally performant. For the regression task, our German Legal BERT model was able to slightly improve over FastText and the other BERT models but it is still considerably outperformed by TFIDF. In a within-subject study (N=16), we asked subjects to evaluate the relevancy of documents retrieved by similarity compared to a reference case law. Our findings indicate that the German Legal BERT, to a small degree, was able to capture better legal information for comparison. We observed that further fine-tuning a BERT model in the legal domain when the pre-trained language model already included legal data yields marginal gains in performance.