Segmenting Electronic Theses and Dissertations By Chapters.

Published in VtechWorks: Viginia Tech ETD, 2023

Recommended citation: Javaid Akbar Manzoor. 2023. Segmenting Electronic Theses and Dissertations By Chapters. Thesis. Virginia Tech. https://vtechworks.lib.vt.edu/handle/10919/113246 http://hdl.handle.net/10919/113246

Electronic theses and dissertations (ETDs) are structured documents in which chapters are major components. There is a lack of any repository that contains chapter boundary details alongside these structured documents. Revealing these details of the documents can help increase accessibility. This research explores the manipulation of ETDs marked up using LaTeX to generate chapter boundaries. We use this to create a data set of 1,459 ETDs and their chapter boundaries. Additionally, for the task of automatic segmentation of unseen documents, we prototype three deep learning models that are trained using this data set. We hope to encourage researchers to incorporate LaTeX manipulation techniques to create similar data sets.