Applications of Mining ETDs
Published in ETD conference, 2021
Recommended citation: Bipasha Banerjee, William A. Ingram, Jian Wu, and Edward A. Fox. 2021. Ap- plications of Mining ETDs. In 24th International Symposium on Electronic The- ses and Dissertations (ETD 2021), November 15-17, 2021, United Arab Emirates. https://doi.org/10.26226/morressier.614c9b8c87a68d83cb5d59b2 https://doi.org/10.26226/morressier.614c9b8c87a68d83cb5d59b2
Theses and dissertations contain a wealth of knowledge reflecting graduate students exploration in a scholarly domain. Although print submission was common practice early on, ETDs have become the predominant format for submitting, archiving, and disseminating graduate work. Over the past 25 years, millions of ETDs have been created, collected, and shared with the world through online digital repositories run by university libraries and scholarly publishing companies. Paper theses and dissertations have been replaced with PDFs; but for the most part, digital collections aren’t much different than the analog libraries they replaced. Online digital libraries of ETDs have greatly increased the exposure of graduate research, nonetheless, they fail to meet the needs of researchers, who find it hard to discover and access the knowledge buried in these long documents. The worldwide collection of ETDs has grown to become “scholarly big data’’ (Giles, 2013), consisting of myriad facts and descriptions of new knowledge, tables and figures, terms and definitions, references and literature reviews. There is a growing demand among researchers for collections of scholarly content to support computationally-driven research. This paper describes our efforts to create a computationally amenable corpus of ETDs. We use ideas and techniques from bibliometrics, machine learning, information retrieval, and natural language processing to mine knowledge from this rich information source. We examine some of the challenges we face, discuss our methods, and explore the results. Mining ETDs can be challenging as they are scattered across countless repositories and digital libraries. Despite efforts to establish standards of interoperability among scholarly repositories, accessing full-text on a large scale is surprisingly difficult. We set out with a goal of building a research corpus of at least 200,000 ETDs and their associated metadata from open repositories across the U.S. Harvesting full-text PDFs from institutional repositories involves creating extemporaneous web crawling scripts, most of which only work for the individual repository they were created for. Once downloaded, full-text representations must be extracted from PDF documents. This process can vary depending on whether the ETDs were “born digital’’ or if they were created by scanning paper documents. Modern advances in text mining and analytics have equipped researchers with new tools and novel ways to extract knowledge and understanding from text. Most techniques have been developed and tested on shorter documents, such as web pages and news articles. But ETDs are book-length documents. Like books, ETDs are organized into chapters and sections. A key aspect of text mining is establishing structure in unstructured data. For ETDs, this is a non-trivial process because, unlike some other digital formats like XML, PDF is an unstructured data format - so the structure of an ETD (e.g., chapters and sections) is usually not machine-readable. It would be useful to extract single chapters from an ETD so that they can be analyzed individually. Automatic chapter segmentation and extraction facilitate many downstream benefits. For instance, most ETDs contain deep and well-researched literature reviews. Extracted literature review chapters from ETDs could be indexed and made available as useful documents in their own right. We discuss how effective chapter segmentation can be applied to a large corpus of ETDs algorithmically. Our approach to chapter segmentation uses machine learning to predict which lines of text represent chapter headings based on lexical and syntactic features extracted from the text. Document classification and categorization is a long-established intellectual practice of libraries that is indispensable to information organization. However, the task of manually classifying millions of ETDs is untenable, so the onus has generally fallen on authors to assign subject categories or keywords to their own work. We demonstrate how subject categories can be generated for ETDs automatically using machine learning. Moreover, we show that classification can be done at the chapter level. The popularity of interdisciplinary research is surging (Millar, 2013). As universities encourage interdisciplinary approaches to research, the trend is born out in graduate research output, including ETDs. Using techniques from information extraction and natural language processing, we demonstrate how research topics can be mined from the text of ETDs, we explore changes in popularity of graduate research topics over time and examine the evolution of interdisciplinarity in graduate research. In addition, we show how chapter-level classification can be used to more accurately describe an interdisciplinary ETD, thus increasing its potential for discovery and impact. In addition to algorithmic classification, we explore ways of automatically summarizing ETDs and their chapters. Automatic summarization aims to identify the most important information in a document and express this information to the reader in a concise, factually correct format (Wu et al., 2021). Most ETDs contain an abstract that broadly describes the work. However, for many of the reasons mentioned above, it is useful to provide chapter-level summarization. Despite the wealth of knowledge and information contained in ETDs, the documents are simply too long to be considered by today’s busy researchers, who are already deluged with the vast amount of scholarly literature available to them. Providing a summary for each chapter helps researchers quickly identify individual chapters of interest and provide a point of entry for reading. Scholarly text mining is gaining popularity with researchers as its methods have been shown to identify unseen patterns and uncover new knowledge. Our research explores how a large corpus of ETDs can be made computationally amenable and demonstrates various applications of text mining and information extraction. We believe this work will lead to expanded service offerings by libraries, encourage other researchers to use ETDs for computational analysis, and ultimately raise the impact of graduate research.