Transcribing Structured Data with the DataScribe Module for Omeka S
by Janet Hammond
DataScribe is an Omeka S module that helps ease laboriously detailed transcription work. Created at RRCHNM and funded by the NEH, this module allows users to complete a two-step process. The first is to craft transcription forms for structured data, which is particularly useful when transcribing historical forms and other highly structured documents. (Greta Swain’s write up on American Religious Ecologies’ blog goes into detail about this process.) Then, professionals can use these forms to transcribe data into a format amenable to computational analysis, combining the data creation and data cleaning steps of a project into a single process. This blog post focuses on the second step.
As a Digital History Fellow at the Center, I transcribed the Bills of Mortality for the project, Death by Numbers. The bills were weekly London publications that accounted for where individuals died and why (Figure 1). Other information includes the price of bread and the number of christenings. Focusing on the years 1603 to 1752, this project aims “to transcribe and publish the information in these bills in a dataset suitable for computational analysis.”
To ensure the sources’ presence in DataScribe, I first made items in Omeka S, employing CSV imports. This plugin transfers spreadsheets filled with the primary source images’ metadata to the suitable format for individual items in this online platform. According to identifying information, I uploaded the corresponding photos to each item. (If images are public online, you can include URLs in the spreadsheet, foregoing the last measure.) With the sources present, Dr. Jessica Otis created the DataScribe form that I could use for transcription.
The DataScribe transcription webpage can appear in a variety of manners. Users can choose the arrangement that best suits their screens, data, sources, and more. Figure 2 displays the automated perspective with the image on the left, entry field in the middle, and a column for notes or other special notations on the far right. Figure 3 shows it in “focus mode,” which eliminates the far-right column. Pictured a bit further down, Figure 4 reflects one of my personal preferences where the image is on the top with data entry below. In the latter option, I felt that, for this project, I could zoom in closer and pan across the page with greater ease.
Historians—particularly digital historians—discuss the frustration of web applications that lack flexibility, as these scholars infrequently possess readily clean data. As seen in Figure 4, the primary historical sources need not contain strict consistency. The bills possessed a number of causes of death but usually limited the list to those relevant. For instance, there were no canker deaths January 19-26, 1668. As a result, I used the “is missing” feature to indicate that it was inapplicable that week. When information is present but difficult to understand—whether the handwriting turns too sloppy or water damage occurred—people can provide their best guess and mark “is illegible” to signal that reviewers need to especially look over this section.
When the data is unclean, it might seem unrealistic to invite a wide variety of transcribers. With DataScribe, project managers (or the like) can avoid confusion with the “Guidelines” tool (Figure 5). This feature is editable and permits the project lead to add instructions as needed.
If the Guidelines lacks necessary instructions, transcribers can make personalized notes to the reviewer (Figure 6). This feature avoids the need to toggle back and forth between messages over email, Slack, or other systems and the transcription page. People can also “flag” the record by checking the box next to “Needs review.”
Successful organization of historical data not only benefits from a tailor-made transcription form but continued flexibility for transcribers to uniquely respond. DataScribe offers “flexible structure” for the project leads to craft the model and for the participants to fill it out.
When complete, people can download a CSV version of the data. Users can export a file of all validated data, which allows anyone to computationally analyze it. This tool encourages shared data as well as a method for others to readily take advantage of such opportunities.