Opening Books and the National Corpus of Graduate Research

Virginia Tech University Libraries, in collaboration with Virginia Tech Computer Science and Old Dominion University Computer Science, will bring computational access to book-length documents, through a research and piloting effort employing Electronic Theses and Dissertations (ETDs). The library and archives fields lack research on extracting and analyzing segments of long documents (chapters, reference lists, tables, figures), as well as methods for summarizing individual chapters of longer texts to enable findability. The project brings cutting-edge computer science and machine learning technologies to advance discovery, use, and potential for reuse of the knowledge hidden in the text of books and book-length documents. By focusing on libraries’ ETD collections, the research will enhance libraries’ ETD programs, devising effective and efficient methods for opening the knowledge currently hidden in the rich body of graduate research and scholarship.

The project is divided broadly into three research areas.

  • RA 1: Document analysis and extraction
  • RA 2: Adding value
  • RA 3: User services

Research Area 1

Research Area 1 attempts to answer the research question “How can we effectively identify and extract key parts (chapters, sections, tables, fgures, citations),in both born digital and page image formats?”

To answer the research question, we divide the work into tasks that we want to accomplish.

  • Task 1: Compiling ETD sample
  • Task 2: Building ground truth for ETD segmentation and extraction
  • Task 3: Researching on ETD segmentation and extraction models

Research Area 2

Research Area 2 attempts to answer the research question “How can we develop effective automatic classification as well as chapter summarization techniques?”.

The tasks that will help us accomplish the goals are

  • Task 4: Building ground truth for ETD (chapters) topical classification
  • Task 5: Building topical classiffication models for ETDs and their chapters
  • Task 6: Building ground truth for ETD chapters summarization
  • Task 7: Researching on summarization models for ETD chapters

Research Area 3

Research Area 3 attempts to answer the research question “How can our ETD digital library most effectively serve stakeholders?” We primarily focus on investigating new ways to interact with ETD collections,and study which best support the needs of the user community.


This project was made possible in part by the Institute of Museum and Library Services LG-37-19-0078-19.