Essays

No Computer Left Behind

by Daniel J. Cohen and Roy Rosenzweig

February 2006

Archives, Overviews

Chronicle of Higher Education, Feb. 24, 2006.

“I hate Scantron,” one exasperated high-school student wrote on an online bulletin board earlier this year, referring to the ubiquitous multiple-choice forms covered with ovals, named for the corporation that has manufactured them since 1972. An older student replied: “Get used to seeing them. Colleges are all about Scantrons.” Noting that it can take 30 minutes to grade an essay question, the older student explained, “That’s why most instructors use Scantron, or at least multiple choice, for most of their tests.”

But multiple-choice tests not only torment students; they also feature centrally in the increasingly vitriolic debate over standardized testing. Do they adequately measure student learning? Do they simply force teachers to “teach to the test”? In our own discipline of history, policy makers, teachers, and scholars have begun to debate whether history should be added to the list of subjects tested in the schools under the No Child Left Behind Act. And we can safely predict that when the National Assessment of Educational Progress history tests are given again this year, we will see a new round of hand-wringing over “why students don’t know any history.” Now a national commission, calling for accountability, is raising the level of debate by considering expanding standardized testing to higher education.

Such student complaints and adult debates about standardized tests could soon become obsolete – if, as we argue, the digital technology that allows students to share their grievances online undermines the very nature of multiple-choice exams. As the calculator forever altered mathematical education – eventually muscling its way into the test room when it became clear that long division had become a useless relic of the past – what if modern technology is about to make the format of these tests as quaint as a slide rule? What if students will have in their pockets a device that can rapidly and accurately answer, say, multiple-choice questions about history? Would teachers start to face a revolt from (already restive) students, who would wonder why they were being tested on their ability to answer something that they could quickly find out about on that magical device?

It turns out that most students already have such a device in their pockets, and to them it’s less magical than mundane. It’s called a cellphone. That pocket communicator is rapidly becoming a portal to other simultaneously remarkable and commonplace modern technologies that, at least in our field of history, will enable the devices to answer, with a surprisingly high degree of accuracy, the kinds of multiple-choice questions used in thousands of high-school and college history classes, as well as a good portion of the standardized tests that are used to assess whether the schools are properly “educating” our students. Those technological developments are likely to bring the multiple-choice test to the brink of obsolescence, mounting a substantial challenge to the presentation of history – and other disciplines – as a set of facts or one-sentence interpretations and to the rote learning that inevitably goes along with such an approach.

Surprisingly, multiple-choice testing is less than a century old. According to the psychologist Franz Samelson, the multiple-choice question made its first published appearance in 1915 in a “silent reading test” devised by Frederick J. Kelly, the director of the Training School at the State Normal School in Emporia, Kan. Kelly’s innovation responded, in part, to growing complaints about the subjectivity of grading in standardized tests that had become increasingly common at the turn of the century. But equally important, he wanted to make tests cheaper and faster to grade. How could you administer mass standardized tests and establish “objective” test norms without some quick and easy method of grading? The need for easily scoreable exams became even more compelling two years later, when the United States entered World War I, and psychologists convinced military leaders that measuring the “intelligence” of almost two million soldiers would improve military efficiency.

In the mid-1920s, the College Board added multiple-choice questions to its SAT’s, previously just a set of essay questions, and sealed the triumph of the new format. “The multiple-choice test – efficient, quantitative, capable of sampling wide areas of subject matter and easily generating data for complicated statistical analyses,” Samelson writes, became “the symbol … of American education.” Along the way, the technologies of testing became more elaborate – moving from the scoring stencils devised around World War I to the IBM 805 Test Scoring Machine, which appeared in the late 1930s and could read pencil marks, to the Scantron forms and machines that are the bane of today’s high-school and college students.

The IBM 805 and the Scantron were effective and widespread 20th-century technologies. But they pale in comparison to the power and ubiquity of two 21st-century technological developments that may change the debate over multiple-choice testing. The first is the World Wide Web – not only the largest record of human knowledge in the history of our species, but also the most open and available.

We can already hear the snickers from our colleagues: “You want to send students to the wilds of the Web to find the answers to exam questions?” Scholars in history (as well as in other fields) have generally viewed the state of knowledge on the Web with skepticism. In 2004 Leon Botstein, president of Bard College and also a historian, told The New York Times that a Google search of the Web “overwhelms you with too much information, much of which is hopelessly unreliable or beside the point. It’s like looking for a lost ring in a vacuum bag. What you end up with mostly are bagel crumbs and dirt.” Scholars like Botstein – used to the detailed analysis of individual documents for credibility and import – look in horror at the many Web pages with factual errors or outright fictions. Even if students could Google any topic they wanted from their cellphone, they would surely choose some of those errant Web pages, select some “bagel crumbs and dirt,” and flunk their exams.

But what if, as in statistics, the extremes could cancel each other out, and the errors become swamped by the truth? Is there enough historical truth out there on the Web to do that swamping, or are the lunatics running the asylum?

Computer scientists have an optimistic answer for worried scholars. They argue that the enormous scale and linked nature of the Web make it possible for it to be “right” in the aggregate while sometimes very wrong on specific pages. The Web “has enticed millions of users to type in trillions of characters to create billions of Web pages of on average low-quality contents,” write the computer scientists Rudi Cilibrasi and Paul Vitányi in a 2004 essay.Yet, they continue, “the sheer mass of the information available about almost every conceivable topic makes it likely that extremes will cancel and the majority or average is meaningful in a low-quality approximate sense.” In other words, although the Web includes many poorly written and erroneous pages, taken as a whole the medium actually does quite a good job encoding meaningful data.

At the same time that the Web’s openness allows anyone access, it also allows any machine connected to it to scan those billions of documents, which leads to the second development that puts multiple-choice tests in peril: the means to process and manipulate the Web to produce meaningful information or answer questions. Computer scientists have long dreamed of an adequately large corpus of text to subject to a variety of algorithms that could reveal underlying meaning and linkages. They now have that corpus, more than large enough to perform remarkable new feats through information theory.

For instance, Google researchers have demonstrated (but not yet released to the general public) a powerful method for creating “good enough” translations – not by understanding the grammar of each passage, but by rapidly scanning and comparing similar phrases on countless electronic documents in the original and second languages. Given large enough volumes of words in a variety of languages, machine processing can find parallel phrases and reduce any document into a series of word swaps. Where once it seemed necessary to have a human being aid in a computer’s translating skills, or to teach that machine the basics of language, swift algorithms functioning on unimaginably large amounts of text suffice. Are such new computer translations as good as a skilled, bilingual human being? Of course not. Are they good enough to get the gist of a text? Absolutely. So good the National Security Agency and the Central Intelligence Agency increasingly rely on that kind of technology to scan, sort, and mine gargantuan amounts of text and communications (whether or not the rest of us like it).

As it turns out, “good enough” is precisely what multiple-choice exams are all about. Easy, mechanical grading is made possible by restricting possible answers, akin to a translator’s receiving four possible translations for a sentence. Not only would those four possibilities make the work of the translator much easier, but a smart translator – even one with a novice understanding of the translated language – could home in on the correct answer by recognizing awkward (or proper) sounding pieces in each possible answer. By restricting the answers to certain possibilities, multiple-choice questions provide a circumscribed realm of information, where subtle clues in both the question and the few answers allow shrewd test takers to make helpful associations and rule out certain answers (for decades, test-preparation companies like Kaplan Inc. have made a good living teaching students that trick). The “gaming” of a question can occur even when the test taker doesn’t know the correct answer and is not entirely familiar with the subject matter.

Are there algorithms that might identify connections between a multiple-choice question and the correct answer, thus providing a means of effectively mining those billions of words suddenly accessible free to everyone with an Internet connection – a group that already includes many people with cellphones? To test the ratio of accurate to inaccurate historical information on the Web and to pursue the idea that machine reasoning might, as with the new computational translation services, provide “good enough” answers to historical questions, one of us, Daniel, created a software agent called “H-Bot.” On the Center for History and New Media Web site, we have a public beta test of that software that you can use to answer simple factual questions about history using natural language (http://chnm.gmu.edu/tools/h-bot). For instance, ask it, “Whenwas Nelson Mandela born?” It responds, “Nelson Mandela was born on July 18, 1918.” Although it has a fast mode that looks at “trusted sources” first (i.e., online encyclopedias and dictionaries), it can also use the entire Web to answer questions using algorithms drawn from computer science.

Suppose you want to know when Charles Lindbergh took his famous flight to Paris. Asking H-Bot “When did Charles Lindbergh fly to Paris?” would prompt the software (using its “pure” mode, which does not simply try to find a reliable encyclopedia entry) to query Google for Web pages that include the words “Charles Lindbergh,” “flew,” and “Paris.” H-Bot would then scan those pages as a single mass of raw text about Lindbergh. It would search, in particular, for words that look like years (i.e., positive three- and four-digit numbers), and it would indeed find many instances of “1902” and “1974” (Lindbergh’s birth and death years). But most of all, it would find a statistically indicative spike around “1927,” the year that Lindbergh made his pioneering flight to Paris. By scanning and processing many Web sites – sites like the official Lindbergh Foundation site and the amateur enthusiast Ace Pilots site in the same breath – H-Bot would accurately answer the user’s historical question, disregarding as statistical outliers the few sites that incorrectly state the year of his flight.

While simple statistical methods can process the raw material of the Web to answer basic historical questions, more involved algorithms can provide the answers to more complex questions. Using a theory called “normalized information distance,” a special version of H-Bot programmed to take multiple-choice tests can tackle not only question-and-answer pairs similar to the Lindbergh question, but also questions from the NAEP U.S. history exam that supposedly invoke the higher-order processes of historical thinking, and that should be answerable only if you truly understand the subject matter and are able to reason about the past. For example, a 1994 NAEP question asked, “What is the purpose of the Bill of Rights?” It provided the following options:

(a) To say how much Americans should pay in taxes

(b) To protect freedoms like freedom of speech

(d) To make Washington, D.C., the capital of the United States

H-Bot cannot understand the principles of taxation, liberty, or the purviews of the executive and legislative branches. But it need not comprehend those concepts to respond correctly. Instead, to figure out the significance of the Bill of Rights, H-Bot found that Web pages on which the phrase “Bill of Rights” and the word “purpose” appear contain the words “freedom” and “speech” more often than words like “taxes,” “President,” or “Washington.” (To be more precise, H-Bot’s algorithms actually compared the normal frequency of those words on the Web with the frequency of those words on relevant pages.) H-Bot thus correctly surmised that the answer was (b).

We gave H-Bot that and dozens of other publicly available multiple-choice questions from the fourth-grade NAEP American-history exam, on which such questions composed about two-thirds of the total. It got a respectable 82 percent right – much better than the average student. Moreover, the experimental H-Bot is only a preliminary version programmed by a humble historian of science with help from a (very bright) high-school student, Simon Kornblith. Imagine how well it could do with financing and legions of math Ph.D.’s to attack problems on behalf of search-engine giants like Google.

Before we disdainfully dismiss H-Bot’s test-taking prowess as a parlor gimmick, we need to remember that we have built a good deal of our educational system around such multiple-choice tests. They are ubiquitous even in college classrooms and are widely cited as evidence of national “ignorance” in history and other fields. Moreover, our attachment to these tests (as Frederick Kelly knew well) has more to do with economics and technology than with teaching and learning. “We use these tests,” Sam Wineburg, a cognitive psychologist who teaches at Stanford University’s School of Education, writes in The Journal of American History, “not because they are historically sound or because they predict future engagement with historical study, but because they can be read by machines that produce easy-to-read graphs and bar charts.”

Moreover we should remember the resistance that accompanied the entry of the calculator into the exam room. Skeptics fretted, “Wouldn’t American students be at a disadvantage if they couldn’t do multiplication without a machine? Doesn’t the ability to do such processes unassisted lead to a deeper understanding of mathematics itself?” But most people quickly realized that providing calculators to students freed them up to work on more complex and important aspects of mathematics, rather than worrying about memorizing multiplication tables.

The combination of the cellphone and the magnificent, if imperfect, collective creation of the Web with some relatively simple mathematical formulas has given us a free version of what our provost and historian, Peter Stearns, proposed to us a couple of years ago – the Cliolator, a play on the muse of history and the calculator. Stearns observed that many educators would resist the adoption of the Cliolator, as they had the calculator. But he also argued, rightly in our view, that it would improve history education by displacing the fetishizing of factual memorization.

Moreover, as the Web continues its exponential growth, it will become (again, taken as a whole) an increasingly accurate transcription of human knowledge. A basic principle of information theory is that the larger the corpus, the more accurately it encodes meaning over all and the more useful it is for data-mining applications. And consider what will happen to the quality of information on the Web after the completion of the vast initiatives of Google and others to digitize the high-caliber information in books.

By the time today’s elementary-school students enter college, it will probably seem as odd to them to be forbidden to use digital devices like cellphones, connected to an Internet service like H-Bot, to find out when Nelson Mandela was born as it would be to tell students now that they can’t use a calculator to do the routine arithmetic in an algebra equation. By providing much more than just an open-ended question, multiple-choice tests give students – and, perhaps more important in the future, their digital assistants – more than enough information to retrieve even a fairly sophisticated answer from the Web. The genie will be out of the bottle, and we will have to start thinking of more meaningful ways to assess historical knowledge or “ignorance.”

At around the same time that Kelly was pioneering the multiple-choice test on the Kansas frontier, the educational psychologists J. Carleton Bell and David F. McCollum, no doubt influenced by the same mania for testing that was sweeping the country, began a study of the “attainments” of history students in Texas. At the outset, they wrote, they surmised that they might, for example, assess students’ “ability to understand present events in the light of the past,” or their “skill in sifting and evaluating a mass of miscellaneous materials” and “constructing … a straightforward and probable account,” or their aptitude at providing “reflective and discriminating replies to ‘thought questions’ on a given historical situation.” Bell and McCollum then noted a final possibility, that “historical ability may be taken as the readiness with which pupils answer questions revealing the range of their historical information,” although “this is perhaps the narrowest, and … the least important type of historical ability.” But, they continued, “it is the one which is the most readily tested, and was, therefore, chosen for study in the present investigation.” As Wineburg observes, “While perhaps the first instance, this was not the last in which ease of measurement – not priority of subject-matter understanding – determined the shape and contour of a research program.”

Of course Bell and McCollum might have had an even easier time if they had gotten word of Kelly’s innovations in testing. Instead they asked students, for example, to write down “the reason for the historic importance of each of 10 representative dates” (like 1789). That required them, to their disappointment, to give partial credit for answers, including some “evaluated quite arbitrarily.” Very soon, however, their factualist approach would be married to the seemingly objective multiple-choice test, and historical understanding would be reduced to a filled-in bubble on a form.

Now that newer technology threatens the humble technology of the multiple-choice exam, we have an opportunity to return to some of the broader and deeper measures of understanding in history – and other subjects – that Bell and McCollum knew quite well before they and others rushed down the path that has led us and our students to Scantron purgatory. As Bell and McCollum knew (like students who complain about Scantrons), it takes considerably more time and effort to grade essay questions that, for example, measure a student’s ability to synthesize historical sources into a complex narrative. But, as the Document Based Questions widely used in Advanced Placement history tests demonstrate, such exams are not incompatible with standardized, national measurements. They just take a little more time to grade. Indeed, the creators of the initial NAEP U.S. history examination worried that “one limitation of many traditional assessments is that they frequently present pieces of information or problems to be solved in isolation.” Yet their response – placing related multiple-choice questions together in “theme blocks” while adding some short “constructed response” questions – only modestly addressed that problem.

Although we tend to believe that “new technology” always saves time and money, the marriage of the Web with the cellphone augurs the demise of the inexpensive technologies of multiple-choice tests and grading machines. But we will not be among the mourners at the funeral of the multiple-choice test. Such exams have fostered a school-based culture of rote memorization that has little to do with true learning. And the resources that it will take to offer and grade more complex and thoughtful exams pale in comparison to those being wasted on pointless approaches to measuring student comprehension. Politicians who insist on raising the “stakes” in standardized testing need to provide the funds for people rather than machines to do the grading. If we are going to continue to insist on having machines grade our students, then we should expect that they are going to insist on being able to answer exam questions using the machines in their pockets.

Author Bio

Daniel J. Cohen is an assistant professor of history and Roy Rosenzweig a professor of history at George Mason University. They are affiliated with the university’s Center for History and New Media and are co-authors of Digital History: A Guide to Gathering, Preserving, and Presenting the Past on the Web (University of Pennsylvania Press, 2005).

More Essays