[M] Detect and correct wrong transliterations

Master Assignment

Detect and correct wrong transliterations.

Type: Master CS

Period: TBD

Student: (Unassigned)

If you are interested please contact :


Description:

OCLC is a global library cooperative that provides shared technology services, original research and community programs for its membership and the library community at large. Collectively with member libraries, OCLC maintains WorldCat, the world’s most comprehensive database of information about library collections. WorldCat now hosts more than 540 million bibliographic records in 483 languages, aggregated from 18,000 libraries in 123 countries. 

As the WorldCat continues to grow in quantity, OCLC is actively exploring data science, advanced machine learning, linked data and visualisation technologies to improve data quality, transform bibliographic descriptions into actionable knowledge, as well as provide more functionalities for professional cataloguers and develop more services for end users of the libraries. 

OCLC is looking for students (for internship or master thesis) who are enthusiastic to apply AI technologies to address the following problems, starting from February 2023:

Problem #1: Detect and correct wrong transliterations

About half of the Russian language records in Worldcat only have Latin transliterations. However, Cyrillic is often transliterated wrongly. For example, the word для should be transliterated as DLI͡A if using the Library of Congress transliteration table, but often the ligatures are missing which becomes DLIA.

Work has been done to generate Cyrillic from Latin. The current work recognizes bad transliterations and skips the records. It would be ideal to correct the transliteration – or correct the generated Cyrillic. The error recognition code is currently rule-based. The task is to create 1) a model to detect bad transliterations. 2) a model to correct bad transliterations.  

References: 

Problem #2: Cluster uncontrolled/unlinked names in Worldcat 

Most of the name fields in Worldcat are not linked to an identifier. For those fields, are all the fields referencing the same name string referring to the same person? How do we know if there are multiple “Jenny Toves” in Worldcat? Or how do we know if “J Toves” is the same as “Jenny Toves”? The situation gets worse when different scripts or transliteration are used to represent the same names. 

For example, how can we decide confidently the following names from three different bibliographic records are referring to the same person:

>100  1 $aDostoyevsky, Fyodor $d1821-1881

>100  1 $aDostoevskij, Fëdor Mihajlovič $d1821-1881

>100  1 $aDostoevskii, Fedor Mikhailovich

This task is to disambiguate person names by looking at the contextual clues in the bibliographic records (titles, summaries, subjects, co-authors, etc), taking into account the difference scripts and their transliterations. 

References:

The dataset will consist of a couple of million bibliographic records from WorldCat, available via a research agreement between UTwente and OCLC.

If you are interested, please contact Shenghui Wang (shenghui.wang@utwente.nl