Summarizing ETDs with deep learning

Published in ETD Conference 2019, Porto, Portugal, 2020

Recommended citation: William A. Ingram, Bipasha Banerjee, and Edward A. Fox. 2020. Summarizing ETDs with deep learning. Cadernos de Biblioteconomia, Arquivística e Documentação 1 (Mar. 2020), 46–52. https://doi.org/10.48798/cadernosbad.2014 https://bad.pt/publicacoes/index.php/cadernos/article/viewFile/2014/pdf

Inspired by the millions of Electronic Theses and Dissertations (ETDs) openly available online, we describe a novel use of ETDs as data for text summarization. We use a large corpus of ETDs to evaluate techniques for generating abstractive summaries with deep learning. Using an extensive ETD collection of over 30,000 doctoral dissertations and master’s theses, we examine the quality of state-of-the-art deep learning summarization technologies when applied to an ETD corpus. Deep learning requires a large set of training data to produce satisfactory results. Finding suitable training data is especially difficult due to the widespread use of domain-specific jargon in ETDs, coupled with the wide-ranging breadth of subject matter contained in an ETD corpus. To overcome this significant limitation, we demonstrate the potential of transfer learning on automatic summarization of ETD chapters. We apply several combinations of deep learning models and training data to the ETD chapter summarization task and compare the outputs of the top performers.

Download paper here