Applications of data analysis on scholarly long documents

Published in 2022 IEEE International Conference on Big Data (Big Data), 2022

Recommended citation: Bipasha Banerjee, William A. Ingram, Jian Wu, and Edward A. Fox. 2022. Applications of data analysis on scholarly long documents. In IEEE International Conference on Big Data, Big Data 2022, Osaka, Japan, December 17-20, 2022. IEEE, 2473–2481. https://doi.org/10.1109/BigData55660.2022.10020935 https://10.1109/BigData55660.2022.10020935

Theses and dissertations record the work of graduate students and are typically a requirement at the culmination of the graduate degree. Thus, they contain important information that reflects a graduate student’s exploration of their research topic. Although print submission was commonplace early on, most universities now require students to submit an electronic version. The electronic document referred to as an ETD henceforth has become the primary way of submitting, storing, and distributing graduate work. Millions of such documents have been created in the past two decades. They are maintained and stored by university libraries, digital repositories, and other academic publishing companies. These online repositories have increased access to such documents. Nonetheless, these documents fail to meet the needs of researchers, who find it challenging to find and access knowledge from such long documents. The worldwide ETD collection has increased in volume to become what is known as ‘scholarly big data’. Apart from the text body, these documents contain a myriad of other pieces of knowledge like tables, figures, definitions, literature reviews, and references. There is a growing demand amongst researchers across various domains to make this collection of scholarly documents more computationally driven. We use ideas from natural language processing, information retrieval, and machine learning to excavate knowledge from this rich information source. In this paper, we examine some of the challenges we face, identify some key areas of exploration, and discuss our methods to mitigate the challenges.