Increasing Accessibility of Electronic Theses and Dissertations (ETDs) Through Chapter-level Classification

Published in VtechWorks: Viginia Tech ETD, 2020

Recommended citation: Palakh Mignonne Jude. June, 2020. Increasing Accessibility of Electronic The- ses and Dissertations (ETDs) Through Chapter-level Classification. MS thesis, Computer Science, Virginia Tech (June, 2020). http://hdl.handle.net/10919/99294 http://hdl.handle.net/10919/99294

Great progress has been made to leverage the improvements made in natural language processing and machine learning to better mine data from journals, conference proceedings, and other digital library documents. However, these advances do not extend well to book-length documents such as electronic theses and dissertations (ETDs). ETDs contain extensive research data; stakeholders – including researchers, librarians, students, and educators – can benefit from increased access to this corpus. Challenges arise while working with this corpus owing to the varied nature of disciplines covered as well as the use of domain-specific language. Prior systems are not tuned to this corpus. This research aims to increase the accessibility of ETDs by the automatic classification of chapters of an ETD using machine learning and deep learning techniques. This work utilizes an ETD-centric target classification system. It demonstrates the use of custom trained word and document embeddings to generate better vector representations of this corpus. It also describes a methodology to leverage extractive summaries of chapters of an ETD to aid in the classification process. Our findings indicate that custom embeddings and the use of summarization techniques can increase the performance of the classifiers. The chapter-level labels generated by this research help to identify the level of interdisciplinarity in the corpus. The automatic classifiers can also be further used in a search engine interface that would help users to find the most appropriate chapters.