UTFaculteitenEEMCSDisciplines & departementenDMBAssignmentsFinished AssignmentsFinished Master Assignments temp[M] Automatic instance-based matching of database schemas of webharvested product data

[M] Automatic instance-based matching of database schemas of webharvested product data

Master Assignment

Using Machine Learning to automatically build knowledge bases for the harvesting and matching of product data. 

Type: Master M-CS 

Period: Dec, 2016 - August, 2017

Student: Classified

Supervisors:

Abstract:

In both web harvesting and in the merging of databases it often occurs that object often have properties which while identical in concept use different names. Having a knowledge base capable of linking the various aliases of concepts together as well as defining the range and unit of measurement for values would go a long way towards increasing the ease and accuracy of such tasks. However building knowledge bases requires time, domain knowledge and with the addition of more data sources would still need to be expanded. In this Final Project we want to develop an automated way to populate a knowledge base with various concepts and aliases as well as the limitation to properties and their values. To achieve this we intend to use machine learning to compare properties with each other.

Validation and Approach:

We intend to validate our research along two avenues of approach. Our first method of validation is sample the developed collection using our process and to check this for consistency. Our second method of validation is to use our collection in a web harvesting and data matching software and from sampling the results of this asses how useful our collection is, by calculating precision and recall.

Main Research Question:

How can we automatically build a collection, linking concepts to multiple aliases and defining ranges and units of measurement for the values associated with these concepts,  suitable for the harvesting and matching of product data from template-based websites, by use of existing data from similar template-based sources and exploiting data-commonalities?