RRCHNM’s Custom API for Data-Driven Projects

RRCHNM is a shop that is more and more working on computational history and historical data visualization. But we are also first and foremost a web shop: ever since Roy Rosenzweig saw the potential of the internet and left CD ROMs behind, we’ve been committed to delivering history via people’s web browsers. Those two commitments are becoming increasingly compatible. For example, Ben Schmidt has written persuasively about the next decade of data programming happening in the browser via JavaScript. But combining data analysis and the web takes work. In this blog post, I want to explain how we are solving one aspect of that challenge via our custom data API.

We have a lot of datasets in play for RRCHNM’s projects. Some of the spatial datasets, such as Natural Earth and the Atlas of Historic County Boundaries, we use over and over across projects. AHCB is a critical part of both Mapping Early American Elections and American Religious Ecologies. Some of the datasets are small and intended for display. Others are large text corpora, such as Chronicling America, Gale’s Making of Modern Law, or all of the full text collections from the Library of Congress gathered as part of Computing Cultural Heritage in the Cloud, from which we compute derivative datasets of biblical quotations, legal citations, or the like. Even those derivative datasets can be fairly large and unwieldy. And other datasets are ones that we are transcribing ourselves using our DataScribe tool. These include the data about religious congregations from the 1926 Census of Religious Bodies and about London’s bills of mortality

The version of record for these datasets is typically a PostgreSQL database. We use a relational database for—well—all the reasons everyone else uses relational databases. In particular, we value the strong guarantees a database provides about the data being strongly typed and well structured. We find it useful to be able to access the exact same data via, say, R for data analysis and a web application for display. And of course, there is the ability to query and index the data, combine datasets through joins, provide shared access, and so forth. PostgreSQL is not an exciting choice; it may very well be the least exciting choice imaginable. But rock solid and boring is a great place to be for critical infrastructure. 

A screenshot of a database table
An example of what some of the data looks like from the American Religious Ecologies project. It might not look like much, but we had to reverse engineer and entire federal census in order to create it.

That still leaves the problem of getting the data out of the database and into the user’s browser. We needed a solution that could provide some key features:

  • The data should be delivered in a format easily usable for web visualization, which means JSON or GeoJSON.
  • The data should be reshaped as necessary. Frequently the form that data is stored in, typically some kind of normalized form, is not the way that the data should be structured for display.
  • Large datasets must be queryable. Although browser can handle more and more, that does not mean that they should be made to do so, so ideally the minimum amount of data necessary should be sent to the browser.
  • It should be easily extensible as we add new projects, and it should not require us to reinvent the wheel every time we start a new project. Rather, it should let us use existing data and functionality (such as the AHCB dataset I mentioned) across projects.
  • And, if the need arises, it should allow the browser to write back to the database.
A screenshot of the output of our data API
JSON from the data API. It’s not exciting, but if it’s what you need, it’s very useful.

Our solution was to create a custom data API for RRCHNM projects, which we call Apiary. (Yes, we know other projects use that name, but this is just our internal codename.) The API is written in Go, a simple but powerful language well suited for our needs here. The API is containerized using Docker, for ease of deployment. The API essentially consists of a thin, fairly minimal application that provides the necessary boilerplate to set up a database connection, a web server, and so forth. Then individual endpoints that provide specific datasets are added as handlers. Adding a new dataset or view on a dataset is thus as straightforward as writing a new function in Go. But since those handlers fall into a few different types, in most instance the main work of adding a new endpoint is writing a new SQL query. 

Our data API is available under an open-source license on GitHub. (You can also take a look at the API’s endpoints.) To be clear, this project is a highly-custom application, not a library or a general purpose application. Nearly all of the handlers would be of no use to non-RRCHNM projects, and you would have to create your own database, queries, endpoints, and so forth. But as we look around at the landscape of digital history and digital humanities projects, we see other projects that have a similar need to store, structure, query, and display data in the browser. Perhaps the general idea of a data API could prove useful to other institutions and scholars.

What Came Before

Episode 2 of The Green Tunnel podcast launched today. In this episode we tell the story of the settler communities that existed along the route of the Appalachian Trail before the AT arrived. We’ve chosen three examples of those communities, each of which helps tell the story of life in the Appalachian mountains before Benton MacKaye dreamed up the AT in 1921. One community was home to people recently emancipated from enslavement, another was a thriving coal mining and railroad town until the mines played out, and the people of the third community had to rally their friends and neighbors to try to find a little boy who had wandered away from his schoolhouse in 1891. What was life like in the Appalachian mountains before the trail? Listen to our episode to learn more.

New Grant from Library of Congress to Create Teaching Resources for Pre-Service Teachers

RRCHNM is excited to announce a new grant award from the Library of Congress Teaching with Primary Sources program. As part of the grant, RRCHNM will develop teaching resources on difficult to teach topics with a focus on teaching the history of religion in the United States in K-12 schools. Teachers, especially if they are new to the profession, are often reticent to approach topics that might potentially be controversial or give rise emotional reactions. As a result, important themes for understanding history can receive less attention than they deserve in K-12 education.

One proven way to approach these topics is to incorporate into lessons and activities primary source evidence for students to analyze directly, and the Library of Congress fortunately has a wealth of engaging primary sources related to the theme of religion in the United States freely available as part of their many digital collections. Working from this base of incredible resources, the project team will collaborate with scholar advisors, Lincoln Mullen, John Turner, and Jannelle Legg to determine topics for the teaching resources. We will also collaborate with social studies education scholars, Mark Helmsing here at Mason, Stewart Waters at the University of Tennessee-Knoxville, and Karen Kortecamp at George Washington University. These scholars will provide feedback and guidance on incorporating techniques for teaching emotional topics and work with pre-service to get their feedback on the resources as well.

These teaching resources will be published on the center’s popular teachinghistory.org website in time for the 2022-2023 school year.

Remembering the Creation of the September 11 Digital Archive

Shortly after the September 11 attacks, the team here at the Center for History and New Media, in collaboration with our partners at the American Social History Project at CUNY, began building a new kind of digital archive, one that would be open to all contributions from anyone who wanted to contribute a memory, a photograph, an email, or whatever they wanted preserved. With the support of the Alfred P. Sloan Foundation, we were able to assemble a team of collaborators committed to collecting and preserving the history of that terrible day. What began as a crazy idea turned into a project that now houses 72,000 personal stories, more than 6,000 images, and more than 900 audio and video files. The September 11 Digital Archive is truly one of the richest collections related to the history of the events surrounding the events of that day and their aftermath.

Because the project became the model for countless open digital archives since 2001, we wanted to pause and look back at the earliest days of the Archive project. Below are the memories of three of the many dozens of faculty, staff, and students here at George Mason University who worked on our part of the project.

Dan Cohen (Dean of Libraries; Vice Provost for Information Collaboration, Northeastern University)
I started working at George Mason University and the Center for History and New Media just nine months before 9/11. I had joined the Center to work on a history of science project called ECHO — an acronym for “Exploring and Collecting History Online” — which, as the name suggests, was attempting to find new ways to acquire, preserve, and present materials and recollections that would be important for future historians to have. This seemed especially relevant for science, since most scientists who have ever lived are still alive today. If we are to write and understand the history of science, we need to take proactive measures right now to ensure a complete record for tomorrow’s researchers.

In the weeks after the extreme shock of 9/11 — I lost a childhood friend on one of the planes — CHNM regrouped and thought about what that tragic day meant in terms of American history, and also whether we should do something in digital media to memorialize or contextualize it. In New York City, the Alfred P. Sloan Foundation was thinking along the same lines, and since the foundation had funded ECHO, we coalesced around the idea of using the web to collect the history of 9/11.

At that point the ECHO project was still relatively new. The techniques and technology we were using and creating were only partially formed and not entirely tested in the field. And yet we were able to quickly turnaround a site and a broad outreach strategy based on that relatively narrow forerunner initiative. I’m still amazed at how much we collected, but of course much of that was due to the hard work of my colleagues at CHNM and the American Social History Project at CUNY. The September 11 Digital Archive is now viewed largely as a “digital” project, but it succeeded because of a massive elbow-grease and shoe-leather effort.

Looking back, it is also notable how the crucible of the event and our reaction to it forged so many tools that we have used over the last two decades, leading to important software platforms like Omeka and an appreciation for the active collection of digital files before they drift into inaccessibility. Personally, the collaboration and principles of the September 11 Digital Archive remain with me to this day, and still animate my work in history, libraries, and archives.

Jim Safley (Senior Developer and Metadata Specialist at Digital Scholar)
Less than a year after the September 11, 2001 attacks, the Center for History and New Media hired me to make sense of the growing number of resources they were collecting for the September 11 Digital Archive. By the time I arrived, they had already collected thousands of personal stories, emails, and images related to 9/11, but due to the website’s increasing popularity, there was a demand to expand the collection to include disparate aggregations of resources, hard drives full of media, and even non-digital artifacts. My responsibilities were to familiarize myself with and organize the material, to formulate strategies for digitization and long-term preservation, to act as an intermediary between content creators and people seeking permission to use their material, and to field technical questions about the Archive.

Coming from a background in traditional archives and special collections, I wasn’t fully prepared for the move to digital media, but the fundamentals were the same, like the preeminence of metadata and having a systematic plan for accession, access, and preservation. I learned the nuances of digital technology on the job, which ultimately prepared me for the large-scale migrations from our earlier, primitive database to standards-based content management systems.

Given the Archive’s subject matter, I cannot speak of my early years there without mentioning the emotional impact of the work. There were times I found it challenging to familiarize myself with the material, not only for technical reasons, but because of the emotional toll it imposed. The raw sentimentality and distressing honesty of the material was sometimes difficult to process. I bore witness to everyone’s grief and anger at their most vulnerable moments. Much of the content was poignant and reflective, but some of it was offensive and repugnant, reflecting us at our best and our worst. As the collections specialist my duty was to treat the material objectively, respect its historical value, and preserve it for use in future scholarship.

Mills Kelly (Director, RRCHNM)
I started work at George Mason University and RRCHNM–CHNM in those days–exactly two weeks before the September 11 attacks. I had come to Mason with a list of ideas for digital projects, but like everyone at the Center that September, I was immediately diverted into helping on the September 11 Digital Archive.

Because of my many responsibilities at the university, I was always a bit player on the Archive project, but I still remember how difficult it was for us to obtain the first 300 contributions to the collection. Those first few weeks after the Archive launched were largely taken up by all of us asking friends, family, and students to write something about their experiences. It’s worth remembering that in 2001, almost no one had ever done what we were asking them to do – upload digital objects and personal stories into a database where those items would be archived in perpetuity. More than one person I asked to write up their personal story responded with some version of, “I’m not sure why you would want my story.”

In this sense, we were really plowing new ground for historians—creating an open archive that anyone could contribute to—and new ground for the public—inviting everyone to tell their story, give us their digital photographs, emails, and other digital artifacts. The wide proliferation of similar digital archives focused on the Covid-19 pandemic demonstrates just how normalized that out-of-left-field idea has become.

My biggest contribution to those early days of the project fits into Dan’s emphasis on elbow-grease and shoe-leather efforts. I took on the Shanksville part of the project and drove to the almost impossible to find crash site just outside of that tiny Pennsylvania town. These days there is a national monument on the site, but when I arrived it was just a scar on the side of a hill with a very informal memorial created by people in the community. I all but went door to door begging people to write up their stories. Only one of the local residents I spoke to in that first trip even owned a personal computer and so everyone had a hard time even understanding what I was asking them to do that day. Now they would just dial up Facebook on their phones and post away, but in 2001 I had to explain again and again what it meant to contribute to a digital archive. I’m very glad that we were able to capture the Shanksville memorial in its raw state before it became a formal site of memory.

Models of Argument-Driven Digital History

The Models of Argument-Driven Digital History website launched today: find it here. It contains a set of published journal articles annotated by their authors to highlight the use of digital methods to make historical arguments. The site is part of a larger project on which I have been collaborating with Lincoln Mullen since 2017, with the support of the Andrew W. Mellon Foundation, to encourage argument-driven digital history as a form of digital scholarship.

The first element of the project was a workshop of digital historians that developed the white paper “Digital History and Argument” (2017), which aimed

to help bridge the argumentative practices of digital history and the broader historical profession. On the one hand, it aims to demonstrate to the wider historical discipline how digital history is already making arguments in differentforms than analog scholarship. On the other hand, it aims to help digital historians weave the scholarship they produce into historiographical conversations in the discipline

One conclusion of the white paper was that scholars lack conceptual models of how to apply digital methods to historical questions. The articles on Models of Argument-Driven Digital Historyare annotated by their authors to serve as models of how to conceive and construct interpretations and arguments using digital history methods and materials for digital historians to emulate and build on. The texts of the annotated articles, are by necessity, the version the author submitted to the journal, as the journal holds the rights to any revised versions and the final published version. Annotations offer an opportunity to provide an explanation of the methods and decisions behind an article’s form and argument that the current format for journal publication does not allow. It’s past time for journals to begin to reimagine their format to incorporate features such as annotation, and think in terms of what is possible online rather than in print, as its the online version with which most readers engage.

The site was conceived as an element of a special section of the Journal of Social History devoted to argument-driven digital history articles developed in a series of workshops to support eight authors in the process of writing, peer review and publication. Two articles developed in the workshops completed the process of peer review and appeared in the Summer 2021 issue of the journal and on the site. The authors of those articles, Leonardo Barleta and Rachel Midura, and the editors of the Journal of Social History who provided invaluable contributions to the workshops, Matt Karush and Sam Lebovic, were scheduled to appear in a panel on publishing digital history at the Annual Meeting of the American Historical Association in January 2020. After the cancellation of that meeting due to COVID, Lincoln had the idea that we get the permission of the Mellon Foundation to repurpose those funds to recruit authors of influential published digital history articles to annotate their scholarship with discussions of how they developed arguments using digital methods and responded to the comments of editors and reviewers. We were thrilled that Ruth Ahnert and Sebastian Ahnert, Melodee Beals, Sharon Block, Tim Hitchcock and William Turkel, Jo Guldi, Maeve Kane and Caroline Winterer accepted our invitations to participate, allowing us to enrich the site with articles published in a wide variety of journals: ELHJournal of Early American History; Journal of Women’s History; Law and History Review; Modern Intellectual History; and Victorian Periodicals Review. Lincoln and his co-author Kellen Funk also annotated their article “The Spine of American Law: Digital Text Analysis and U.S. Legal Practice,” published in the American Historical Review.

The collection of articles represents the three major digital methods that have been used by historians: spatial visualization; network analysis; and text analysis and topic modeling. Lincoln and I coauthored an introduction for the special section of the Journal of Social History, which also appears on the site, elaborating the patterns we see in the historical arguments made using these different methods: “Arguing with Digital History: Patterns of Historical Interpretation.”

Other methods and combinations of methods are becoming more prominent; it’s worth noting that both the articles published in the special section, and Winterer’s article, are using network graphs for spatial analysis. Predictably, articles using 3D for spatial analysis have appeared since we developed this project that would have been valuable additions. If people find this resource useful, perhaps we can expand it in the future.

This post was originally published here.

Transcribing Structured Data with the DataScribe Module for Omeka S

by Janet Hammond

DataScribe is an Omeka S module that helps ease laboriously detailed transcription work. Created at RRCHNM and funded by the NEH, this module allows users to complete a two-step process. The first is to craft transcription forms for structured data, which is particularly useful when transcribing historical forms and other highly structured documents. (Greta Swain’s write up on American Religious Ecologies’ blog goes into detail about this process.) Then, professionals can use these forms to transcribe data into a format amenable to computational analysis, combining the data creation and data cleaning steps of a project into a single process. This blog post focuses on the second step.

As a Digital History Fellow at the Center, I transcribed the Bills of Mortality for the project, Death by Numbers. The bills were weekly London publications that accounted for where individuals died and why (Figure 1).  Other information includes the price of bread and the number of christenings. Focusing on the years 1603 to 1752, this project aims “to transcribe and publish the information in these bills in a dataset suitable for computational analysis.” 

Bill of mortality listing number of deaths in each London parish
Bill of mortality listing all causes of death in a week and number of people who died of each cause
Figure 1. These images are examples of how the Bills of Mortality usually appear. On the left is the front that states the city, the week number, the dates, the year, and the parishes. On the right is the back of a bill that, among other things, shows causes of death and the number of people attributed with each type. Image sources: Folger Shakespeare Library under CC BY-SA 4.0.

To ensure the sources’ presence in DataScribe, I first made items in Omeka S, employing CSV imports. This plugin transfers spreadsheets filled with the primary source images’ metadata to the suitable format for individual items in this online platform. According to identifying information, I uploaded the corresponding photos to each item. (If images are public online, you can include URLs in the spreadsheet, foregoing the last measure.) With the sources present, Dr. Jessica Otis created the DataScribe form that I could use for transcription.  

The DataScribe transcription webpage can appear in a variety of manners. Users can choose the arrangement that best suits their screens, data, sources, and more. Figure 2 displays the automated perspective with the image on the left, entry field in the middle, and a column for notes or other special notations on the far right. Figure 3 shows it in “focus mode,” which eliminates the far-right column. Pictured a bit further down, Figure 4 reflects one of my personal preferences where the image is on the top with data entry below. In the latter option, I felt that, for this project, I could zoom in closer and pan across the page with greater ease.

screenshot of datascribe interface showing image to be transcribed and data entry boxes
Figure 2. Screenshot of DataScribe with the primary resource on the left, the entry field boxes in the middle, and a special actions column on the far right. Bill of Mortality Image Rights: CC-BY Wellcome Collection.  
image of datascribe screen interface in focus mode showing just image and entry boxes
Figure 3. Screenshot of DataScribe with “focus mode” enabled, which eliminates the far-right column seen in Figure 2. Bill of Mortality Image Rights: CC-BY Wellcome Collection.  

Historians—particularly digital historians—discuss the frustration of web applications that lack flexibility, as these scholars infrequently possess readily clean data. As seen in Figure 4, the primary historical sources need not contain strict consistency. The bills possessed a number of causes of death but usually limited the list to those relevant. For instance, there were no canker deaths January 19-26, 1668. As a result, I used the “is missing” feature to indicate that it was inapplicable that week. When information is present but difficult to understand—whether the handwriting turns too sloppy or water damage occurred—people can provide their best guess and mark “is illegible” to signal that reviewers need to especially look over this section. 

screenshot of datascribe interface showing extreme zoom on image
Figure 4. Screenshot of DataScribe with the primary resource listed above and boxes below to indicate whether a cause of death is listed or legible. It also depicts navigation features like “Guidelines.” Bill of Mortality Image Rights: CC-BY Wellcome Collection.  

When the data is unclean, it might seem unrealistic to invite a wide variety of transcribers. With DataScribe, project managers (or the like) can avoid confusion with the “Guidelines” tool (Figure 5). This feature is editable and permits the project lead to add instructions as needed. 

screenshot of transcription guidelines, showing text detailing transcription process
Figure 5. This screenshot displays the portal that appears when individuals click on “Guidelines” in the Death by Numbers project.

If the Guidelines lacks necessary instructions, transcribers can make personalized notes to the reviewer (Figure 6).  This feature avoids the need to toggle back and forth between messages over email, Slack, or other systems and the transcription page. People can also “flag” the record by checking the box next to “Needs review.”

screenshot of DataScribe interface showing where to insert transcriber notes
Figure 6. This screenshot shows where transcribers can add notes or check the box, “Needs review.”

Successful organization of historical data not only benefits from a tailor-made transcription form but continued flexibility for transcribers to uniquely respond. DataScribe offers “flexible structure” for the project leads to craft the model and for the participants to fill it out. 

When complete, people can download a CSV version of the data. Users can export a file of all validated data, which allows anyone to computationally analyze it. This tool encourages shared data as well as a method for others to readily take advantage of such opportunities. 

Jessica Otis Receives Major NSF Grant

RRCHNM Professor and Director of Public Projects Jessica Otis has been awarded $443,425 from the NSF to support her digital work on the history of the plague in early modern London. The project, called “Assessing the Arithmetic of Early Modern London’s Bills of Mortality,” involves the creation, publication and computational analysis of a dataset of weekly and annual mortality statistics produced for the city of London between 1603 and 1752.

Plague was one of the most dreaded diseases in early modern England. The city of London alone lost an estimated 225,000 people to plague in the century between 1563 and 1665. As an extension of government attempts to track plague deaths during outbreaks, London officials started publicly distributing a weekly series of mortality statistics called the Bills of Mortality at the turn of the seventeenth century. Jessica’s project uses the Bills of Mortality to investigate how lived experiences of plague outbreaks intersected with an emerging quantitative mentality among the people of early modern England. It examines how ordinary people aggregated, transformed, and interpreted death counts in order to draw conclusions about changes in the early modern use of and trust in numbers over time. In doing so, the project investigates contemporary perceptions of numbers and historicizes a quantitative method of knowledge generation that has become central to twenty-first-century understandings of the world.

The foundation of this project is the Bills of Mortality dataset, created through the digitization of primary sources and their subsequent transcription in DataScribe: specialized software designed at RRCHNM to create validated structured datasets from historical sources. The project deploys custom Python code on this dataset to assess the arithmetical accuracy of bills’ internal calculations and their summary statistics. It combines this assessment with close reading of historical sources in order draw conclusions about early modern use of and trust in numbers. Underlying these analyses are two questions: (1) Did people put their trust in the authority of the bills’ internal sums and extracted summary statistics because of the mathematical accuracy of their compilation, reflecting a belief in the importance of correctly quantifying mortality for assessing risk? (2) Did people put their trust in the bills’ numbers because they were numbers, seeing the bills and their mortality statistics as an inherently trustworthy form of knowledge because of its numerical basis?

The project also supports a variety of secondary and student-driven analyses on the dataset. By including student researchers, the project models interdisciplinary paths for students interested in both historical and STEM research and demonstrates the myriad career and research options available at the intersection of history and STEM.

New Digital Military History Postdoctoral Fellowship (Relisted)

We are pleased to announce that the Roy Rosenzweig Center for History and New Media will be collaborating with the Defense POW/MIA Accounting Agency (DPAA) to host a new postdoctoral fellowship in digital military history here at our center. The DPAA is the Defense Department agency charged with providing the fullest possible accounting for America’s missing military personnel to their families and the nation. Researchers and scientists from DPAA travel to wherever American military personnel have died or gone missing in order to provide accurate and timely information to the families of those who remain unaccounted for. Funding for this three-year fellowship will allow an emerging scholar to work directly with DPAA staff and scientists on a variety of digital military history projects that further the agency’s mission.

The advertisement for this position will take you to the full job posting. If you are a military historian with some digital experience and hold (or are about to hold) a PhD in history, please consider applying to this exciting postdoctoral fellowship. The successful applicant will join our team and a cohort of three other talented postdoctoral fellows working on a diverse array of topics in digital history.

The person hired for this position can begin their employment immediately.

Note: Commonwealth of Virginia policies require candidates to have completed all requirements for their degree prior to their first day of employment. Students with a firm defense date for their degree can apply, but cannot work at George Mason as a postdoctoral fellow unless they have completed all requirements for the degree prior to the first day of employment.

Jessica Mack Receives Grant for Project on Universities and Power

Our postdoctoral fellow, Dr. Jessica Mack, has received funding from the 4VA consortium for her digital project Mapping the University: A Digital Resource for Studying Virginia Campus Histories. Her project is an interdisciplinary, collaborative research project that will analyze the histories of Virginia campuses using university archives, digital mapping, and aerial photographs.

 Mapping the University is a timely inquiry into the planning, construction and expansion of the campus at George Mason and Old Dominion University. In a collaborative effort that will prioritize building undergraduate and graduate research skills, students at the two universities will work with university archivists, historians, and digital scholars to better understand student life on campus, academic priorities, historical exclusions, and town-gown relations, and to contextualize recent discussions about building naming, monuments and memorialization on campus. In collaboration with Mason’s Special Collections Research Center at Fenwick Library, RRCHNM, the Department of History and Art History, the Geography and Geoinformation Science Department, and colleagues at ODU, the project will provide training for students in archival research methods, digital history tools, and spatial analysis of materials such as maps, architectural plans, and aerial photographs. The result will be an open access, interactive website that will provide access to digitized archival documents alongside digital maps and interpretive content.

Lincoln Mullen Selected for Library of Congress Initiative

Professor Lincoln Mullen, Director of Computational History at RRCHNM, will join two other digital humanists at the Library of Congress as fellows working on the Computing Cultural Heritage in the Cloud initiative. Mullen will use this opportunity to extend the work he has done in his award-winning America’s Public Bible project and will be applying lessons from that project to a wide range of digital collections at the Library. The CCHC initiative is a first foray by the Library into using AI tools to transform access to knowledge and we are very proud of our colleague for being selected for this prestigious fellowship.