20181030 Extract fraud offender information from text

Master Assignment

Extract offender information from text

Type: Master HMI

Location: FraudeHelpDesk

Period: Feb, 2018 - Jul, 2018 

Date final project: 30 October 2018

Thesis

Student: Rens, E. (Eduard, Student M-HMI)

Supervisors:

Abstract:

This study is an attempt to investigate the feasibility and reliability of extracting a specific target information automatically out of Unlabeled Dutch Text. The Text sources consists out of stored email exchanges between the dutch help organization Fraudehelpdesk and the victims/informants that report fraud-incidents. The target information to extract is about the offenders that attempted a fraud. The research describes all the necessary processes that are needed to detect offender information and to extract the detected offender information. It starts with assigning Part of Speech (POS) tagging and Named Entity Recognition (NER) on each word by comparing the performance of external dutch data sources for POS & NER tagging and shows a reliable way to attain a good performance on predicting POS & NER tags on unlabeled Data.

The research provides further information about relation extraction information based on the Clause Information Extraction (Clause IE) approach which forms relations based on the sentence structure of a text. The Clause IE attempts to form a clause in triplets in form of a Subject-Verb-Object relation. The research shows that additional information can be created by using a manual annotation on the actual data as well as on the formed clauses which helps to distinguish target information from other information. By combining the additional obtained information with the formed clauses the information about offenders are predicted and extracted into a database. The results are showing that from 28400 text entries are thousands of offender information extracted, which is structured in information about the offenders Name, Organization, IBAN, Website, Phone, Address and Emails. The research shows also the comparison between existing offender information and the extracted data in which more new information was obtained than missing information. Only a few % were found to be identical or similar. The comparison compared all found Named entities in the text, by storing and comparing non-offender information as well. Such a result concludes that the provided information from the fraud/incidents and the existing offender data did not contain all information about the fraud-incidents, some missing information might be in the attachments that were excluded from the research, or older or newer email exchanges weren’t included in the fraud-incidents. The performance of each process is measured for each used algorithm by annotating a small part of the unlabeled data or newly formed clauses, which is also able to show how far an improvement increases the performance on extracting offender information. information extraction application are used periodically each half year to extract offender information from all stored text that was gather in half a year. The application saves a lot of time compared to extract information manually. The information needs only to be looked for correctness on the extracted data. A help-tool could also be used to show all extracted Named entities and its prediction to be offender or not on each fraud-incident text. So a user can decide itself if the extracted data is correct.