UTFacultiesEEMCSDisciplines & departmentsDSSeminars - No upcoming events - Sessions are postponed due to COVID-1917th Data Science seminar: Nils Witt (Leibniz Information Centre for Economics) - Collection-Document Summaries - Or How To Quickly Assess Whether You Should Read This Paper

17th Data Science seminar: Nils Witt (Leibniz Information Centre for Economics) - Collection-Document Summaries - Or How To Quickly Assess Whether You Should Read This Paper

Title: Collection-Document Summaries - Or How To Quickly Assess Whether You Should Read This Paper

Abstract: 

When researcher use a search engine to look for literature, typically the search engine does not take into account the knowledge (or the lack thereof) of the researcher. The rank of the documents in the result list is determined by how well that document matches the search query. We mitigate this one-size-fits-all problem by determining the concepts in documents and classifying them as either familiar or new.

We solve this problem by simplifying and reformulating it, such that it can be solved with a neural network that learns to associate sentences with keywords. We use keywords as proxies for concepts because of their conciseness and their presence in many large scale datasets.

Our results suggest that our model is capable of predicting from which document a keyword originated. We achieved a F1-score of 65% for collections and documents with a high topical overlap and 78% for randomly assembled collections and documents at a collection size of 10 in both cases. The F1-score of 76% shows that the model is also capable of detecting common keywords between two documents.

Our model is capable of judging how well-suited a keyword is with respect to a text. This allows its application beyond the scope of this research.