Springe direkt zu Inhalt

Ada-Humboldt-Hack

Layering Datasets from the "Viral Texts Project" onto Existing Collections

Topic 

Many Digital Humanities projects, including Viral Texts, produce large datasets derived from existing collections, such as the digitized newspapers in the Library of Congress, Internet Archive, and Staatsbibliothek zu Berlin. It is not scalable to reingest this derived data into those collections. Many cultural heritage institutions, however, have converged on the International Image Interoperability Framework (https://iiif.io/) for providing access to digitized books, images, and other data. IIIF provides facilities to overlay annotations, transcriptions, and other data on top of the original images, audio, and video or to reassemble documents out of fragments in mutliple locations. We propose to build tools to help researchers discover, browse, and search derived data related to IIIF library collections. By creating a tool that overlays research data atop its source collection website, we aim to model how digital humanists might make research data more readily available to scholars, students, and members of the public without the technical expertise required to find and make use of data repositories, free-standing databases, and similar solutions which currently dominate the digital humanities field.

Venue

Freie Universität Berlin,
14195 Berlin

Data

As a concrete example, consider the clusters of reprints detected by the Viral Texts project (https://viraltexts.org) within a large set of OCR'd nineteenth-century newspapers. We would like users browsing a newpaper page on the Library of Congress' Chronicling America website (https://chroniclingamerica.loc.gov/) or the German newspaper portal to be alerted to the fact that some articles on that page were reprinted, e.g., 20, 100, or 1000s of times. Clicking on those annotations would take the user to list of other versions of that text from various sources. One could also assemble groups of passages as "collection" objects in IIIF. Besides using these clusters or composite documents to display reprints from the Viral Texts project, we could generalize this idea to display groups of related reports on the same event for the Virality of Racial Terror project, groups of mentions of the same entity, groups of related images, etc. Other projects could aggregate passages the correspond to the same (addressable part of) the same work. For instance, scholars tracking serial fiction could group all instances of the same chapter of a novel. Scholars of manuscript transmission could trace exerpts copied into composite manuscripts alongside other texts. This sort of sub-volume metadata is often left out of catalogue records.

Use-Cases

Another use case arises with document transcriptions. As Cordell (2016) observes, the output of an OCR model constitutes an edition of a text. The Viral Texts project produced new OCR transcripts of some of the early data in Chronicling America which had low transcription accuracy. Users could be made aware of, and given access to, these new transcripts without having to update the underlying library data. This same infrastructure could, for example, be used to disseminate the output of our Arabic-script OCR models on Arabic and Persian manuscripts in the SBzB.

Participants

  • Ryan Cordell (University of Illinois Urbana-Champaign, School of Information Science) - Alexander von Humboldt Senior Fellow 
  • David Smith (Northeastern University - Khoury College of Computer Science) 
  • Daniel Evans (University of Illinois Urbana-Champaign,  School of Information Science) 
  • Dennis Mischke (Freie Universität Berlin, University Library, Ada Lovelace Center) - Alexander von Humboldt Host
  • ...please register to participate