Still wondering about the February bar results? I continue that discussion here. As explained in my previous post, NCBE premiered its new Multistate Bar Exam (MBE) in February. That exam covers seven subjects, rather than the six tested on the MBE for more than four decades. Given the type of knowledge tested by the MBE, there is little doubt that the new exam is harder than the old one.
If you have any doubt about that fact, try this experiment: Tell any group of third-year students that the bar examiners have decided to offer them a choice. They may study for and take a version of the MBE covering the original six subjects, or they may choose a version that covers those subjects plus Civil Procedure. Which version do they choose?
After the students have eagerly indicated their preference for the six-subject test, you will have to apologize profusely to them. The examiners are not giving them a choice; they must take the harder seven-subject test.
But can you at least reassure the students that NCBE will account for this increased difficulty when it scales scores? After all, NCBE uses a process of equating and scaling scores that is designed to produce scores with a constant meaning over time. A scaled score of 136 in 2015 is supposed to represent the same level of achievement as a scaled score of 136 in 2012. Is that still true, despite the increased difficulty of the test?
Unfortunately, no. Equating works only for two versions of the same exam. As the word “equating” suggests, the process assumes that the exam drafters attempted to test the same knowledge on both versions of the exam. Equating can account for inadvertent fluctuations in difficulty that arise from constructing new questions that test the same knowledge. It cannot, however, account for changes in the content or scope of an exam.
This distinction is widely recognized in the testing literature–I cite numerous sources at the end of this post. It appears, however, that NCBE has attempted to “equate” the scores of the new MBE (with seven subjects) to older versions of the exam (with just six subjects). This treated the February 2015 examinees unfairly, leading to lower scores and pass rates.
To understand the problem, let’s first review the process of equating and scaling.
Equating
First, remember why NCBE equates exams. To avoid security breaches, NCBE must produce a different version of the MBE every February and July. Testing experts call these different versions “forms” of the test. For each of the MBE forms, the designers attempt to create questions that impose the same range of difficulty. Inevitably, however, some forms are harder than others. It would be unfair for examinees one year to get lower scores than examinees the next year, simply because they took a harder form of the test. Equating addresses this problem.
The process of equating begins with a set of “control” questions or “common items.” These are questions that appear on two forms of the same exam. The February 2015 MBE, for example, included a subset of questions that had also appeared on some earlier exam. For this discussion, let’s assume that there were 30 of these common items and 160 new questions that counted toward each examinee’s score. (Each MBE also includes 10 experimental questions that do not count toward the test-taker’s score but that help NCBE assess items for future use.)
When NCBE receives answer sheets from each version of the MBE, it is able to assess the examinees’ performance on the common items and new items. Let’s suppose that, on average, earlier examinees got 25 of the 30 common items correct. If the February 2015 test-takers averaged only 20 correct answers to those common items, NCBE would know that those test-takers were less able than previous examinees. That information would then help NCBE evaluate the February test-takers’ performance on the new test items. If the February examinees also performed poorly on those items, NCBE could conclude that the low scores were due to the test-takers’ abilities rather than to a particularly hard version of the test.
Conversely, if the February test-takers did very well on the new items–while faring poorly on the common ones–NCBE would conclude that the new items were easier than questions on earlier tests. The February examinees racked up points on those questions, not because they were better prepared than earlier test-takers, but because the questions were too easy.
The actual equating process is more complicated than this. NCBE, for example, can account for the difficulty of individual questions rather than just the overall difficulty of the common and new items. The heart of equating, however, lies in this use of “common items” to compare performance over time.
Scaling
Once NCBE has compared the most recent batch of exam-takers with earlier examinees, it converts the current raw scores to scaled ones. Think of the scaled scores as a rigid yardstick; these scores have the same meaning over time. 18 inches this year is the same as 18 inches last year. In the same way, a scaled score of 136 has the same meaning this year as last year.
How does NCBE translate raw points to scaled scores? The translation depends upon the results of equating. If a group of test-takers performs well on the common items, but not so well on the new questions, the equating process suggests that the new questions were harder than the ones on previous versions of the test. NCBE will “scale up” the raw scores for this group of exam takers to make them comparable to scores earned on earlier versions of the test.
Conversely, if examinees perform well on new questions but poorly on the common items, the equating process will suggest that the new questions were easier than ones on previous versions of the test. NCBE will then scale down the raw scores for this group of examinees. In the end, the scaled scores will account for small differences in test difficulty across otherwise similar forms.
Changing the Test
Equating and scaling work well for test forms that are designed to be as similar as possible. The processes break down, however, when test content changes. You can see this by thinking about the data that NCBE had available for equating the February 2015 bar exam. It had a set of common items drawn from earlier tests; these would have covered the six original subjects. It also had answers to 190 new items; these would have included both the original subjects and the new one (Civil Procedure).
With these data, NCBE could make two comparisons:
1. It could compare performance on the common items. It undoubtedly found that the February 2015 test-takers performed less well than previous test-takers on these items. That’s a predictable result of having a seventh subject to study. This year’s examinees spread their preparation among seven subjects rather than six. Their mastery of each subject was somewhat lower, and they would have performed less well on the common items testing those subjects.
2. NCBE could also compare performance on the new Civil Procedure items with performance on old and new items in other subjects. NCBE won’t release those comparisons, because it no longer discloses raw scores for subject areas. I predict, however, that performance on Civil Procedure items was the same as on Evidence, Property, or other subjects. Why? Because Civil Procedure is not intrinsically harder than these other subjects, and the examinees studied all seven subjects.
Neither of these comparisons, however, would address the key change in the MBE: Examinees had to prepare seven subjects rather than six. As my previous post suggested, this isn’t just a matter of taking all seven subjects in law school and remembering key concepts for the MBE. Because the MBE is a closed-book exam that requires recall of detailed rules, examinees devote 10 weeks of intense study to this exam. They don’t have more than 10 weeks, because they’re occupied with law school classes, extracurricular activities, and part-time jobs before mid-May or mid-December.
There’s only so much material you can cram into memory during ten weeks. If you try to memorize rules from seven subjects, rather than just six, some rules from each subject will fall by the wayside.
When Equating Doesn’t Work
Equating is not possible for a test like the new MBE, which has changed significantly in content and scope. The test places new demands on examinees, and equating cannot account for those demands. The testing literature is clear that, under these circumstances, equating produces misleading results. As Robert L. Brennan, a distinguished testing expert, wrote in a prominent guide: “When substantial changes in test specifications occur, either scores should be reported on a new scale or a clear statement should be provided to alert users that the scores are not directly comparable with those on earlier versions of the test.” (See p. 174 of Linking and Aligning Scores and Scales, cited more fully below.)
“Substantial changes” is one of those phrases that lawyers love to debate. The hypothetical described at the beginning of this post, however, seems like a common-sense way to identify a “substantial change.” If the vast majority of test-takers would prefer one version of a test over a second one, there is a substantial difference between the two.
As Brennan acknowledges in the chapter I quote above, test administrators dislike re-scaling an exam. Re-scaling is both costly and time-consuming. It can also discomfort test-takers and others who use those scores, because they are uncertain how to compare new scores to old ones. But when a test changes, as the MBE did, re-scaling should take the place of equating.
The second best option, as Brennan also notes, is to provide a “clear statement” to “alert users that the scores are not directly comparable with those on earlier versions of the test.” This is what NCBE should do. By claiming that it has equated the February 2015 results to earlier test results, and that the resulting scaled scores represent a uniform level of achievement, NCBE is failing to give test-takers, bar examiners, and the public the information they need to interpret these scores.
The February 2015 MBE was not the same as previous versions of the test, it cannot be properly equated to those tests, and the resulting scaled scores represent a different level of achievement. The lower scaled scores on the February 2015 MBE reflect, at least in part, a harder test. To the extent that the test-takers also differed from previous examinees, it is impossible to separate that variation from the difference in the tests themselves.
Conclusion
Equating was designed to detect small, unintended differences in test difficulty. It is not appropriate for comparing a revised test to previous versions of that test. In my next post on this issue, I will discuss further ramifications of the recent change in the MBE. Meanwhile, here is an annotated list of sources related to equating:
Michael T. Kane & Andrew Mroch, Equating the MBE, The Bar Examiner, Aug. 2005, at 22. This article, published in NCBE’s magazine, offers an overview of equating and scaling for the MBE.
Neil J. Dorans, et al., Linking and Aligning Scores and Scales (2007). This is one of the classic works on equating and scaling. Chapters 7-9 deal specifically with the problem of test changes. Although I’ve linked to the Amazon page, most university libraries should have this book. My library has the book in electronic form so that it can be read online.
Michael J. Kolen & Robert L. Brennan, Test Equating, Scaling, and Linking:
Methods and Practices (3d ed. 2014). This is another standard reference work in the field. Once again, my library has a copy online; check for a similar ebook at your institution.
CCSSO, A Practitioner’s Introduction to Equating. This guide was prepared by the Council of Chief State School Officers to help teachers, principals, and superintendents understand the equating of high-stakes exams. It is written for educated lay people, rather than experts, so it offers a good introduction. The source is publicly available at the link.
Data, Teaching, Bar Exam, Equating, MBE, NCBE No Comments YetStates have started to release results of the February 2015 bar exam, and Derek Muller has helpfully compiled the reports to date. Muller also uncovered the national mean scaled score for this February’s MBE, which was just 136.2. That’s a notable drop from last February’s mean of 138.0. It’s also lower than all but one of the means reported during the last decade; Muller has a nice graph of the scores.
The latest drop in MBE scores, unfortunately, was completely predictable–and not primarily because of a change in the test takers. I hope that Jerry Organ will provide further analysis of the latter possibility soon. Meanwhile, the expected drop in the February MBE scores can be summed up in five words: seven subjects instead of six. I don’t know how much the test-takers changed in February, but the test itself did.
MBE Subjects
For reasons I’ve explained in a previous post, the MBE is the central component of the bar exam. In addition to contributing a substantial amount to each test-taker’s score, the MBE is used to scale answers to both essay questions and the Multistate Performance Test (MPT). The scaling process amplifies any drop in MBE scores, leading to substantial drops in pass rates.
In February 2015, the MBE changed. For more than four decades, that test has covered six subjects: Contracts, Torts, Criminal Law and Procedure, Constitutional Law, Property, and Evidence. Starting with the February 2015 exam, the National Conference of Bar Examiners (NCBE) added a seventh subject, Civil Procedure.
Testing examinees’ knowledge of Civil Procedure is not itself problematic; law students study that subject along with the others tested on the exam. In fact, I suspect more students take a course in Civil Procedure than in Criminal Procedure. The difficulty is that it’s harder to memorize rules drawn from seven subjects than to learn the rules for six. For those who like math, that’s an increase of 16.7% in the body of knowledge tested.
Despite occasional claims to the contrary, the MBE requires lots of memorization. It’s not solely a test of memorization; the exam also tests issue spotting, application of law to fact, and other facets of legal reasoning. Test-takers, however, can’t display those reasoning abilities unless they remember the applicable rules: the MBE is a closed-book test.
There is no other context, in school or practice, where we expect lawyers to remember so many legal principles without reference to codes, cases, and other legal materials. Some law school exams are closed-book, but they cover a single subject that has just been studied for a semester. The “closed book” moments in practice are much fewer than many observers assume. I don’t know any trial lawyers who enter the courtroom without a copy of the rules of evidence and a personalized cribsheet reminding them of common objections and responses.
This critique of the bar exam is well known. I repeat it here only to stress the impact of expanding the MBE’s scope. February’s test takers answered the same number of multiple choice questions (190 that counted, plus 10 experimental ones) but they had to remember principles from seven fields of law rather than six.
There’s only so much that the brain can hold in memory–especially when the knowledge is abstract, rather than gained from years of real-client experience. I’ve watched many graduates prepare for the bar over the last decade: they sit in our law library or clinic, poring constantly over flash cards and subject outlines. Since states raised passing scores in the 1990s and early 2000s, examinees have had to memorize many more rules in order to answer enough questions correctly. From my observation, their memory banks were already full to overflowing.
Six to Seven Subjects
What happens, then, when the bar examiners add a seventh subject to an already challenging test? Correct answers will decline, not just in the new subject, but across all subjects. The February 2015 test-takers, I’m sure, studied just as hard as previous examinees. Indeed, they probably studied harder, because they knew that they would have to answer questions drawn from seven bodies of legal knowledge rather than six. But their memories could hold only so much information. Memorized rules of Civil Procedure took the place of some rules of Torts, Contracts, or Property.
Remember that the MBE tests only a fraction of the material that test-takers must learn. It’s not a matter of learning 190 legal principles to answer 190 questions. The universe of testable material is enormous. For Evidence, a subject that I teach, the subject matter outline lists 64 distinct topics. On average, I estimate that each of those topics requires knowledge of three distinct rules to answer questions correctly on the MBE–and that’s my most conservative estimate.
It’s not enough, for example, to know that there’s a hearsay exemption for some prior statements by a witness, and that the exemption allows the fact-finder to use a witness’s out-of-court statements for substantive purposes, rather than merely impeachment. That’s the type of general understanding I would expect a new lawyer to have about Evidence, permitting her to research an issue further if it arose in a case. The MBE, however, requires the test-taker to remember that a grand jury session counts as a “proceeding” for purposes of this exemption (see Q 19). That’s a sub-rule fairly far down the chain. In fact, I confess that I had to check my own book to refresh my recollection.
In any event, if Evidence requires mastering 200 sub-principles of this detail, and the same is true of the other five traditional MBE subjects, that was 1200 very specific rules to memorize and keep in memory–all while trying to apply those rules to new fact patterns. Adding a seventh subject upped the ante to 1400 or more detailed rules. How many things can one test-taker remember without checking a written source? There’s a reason why humanity invented writing, printing, and computers.
But They Already Studied Civil Procedure
Even before February, all jurisdictions (to my knowledge) tested Civil Procedure on their essay exams. So wouldn’t examinees have already studied those Civ Pro principles? No, not in the same manner. Detailed, comprehensive memorization is more necessary for the MBE than for traditional essays.
An essay allows room to display issue spotting and legal reasoning, even if you get one of the sub-rules wrong. In the Evidence example given above, an examinee could display considerable knowledge by identifying the issue, noting the relevant hearsay exemption, and explaining the impact of admissibility (substantive use rather than simply impeachment). If the examinee didn’t remember the correct status of grand jury proceedings under this particular rule, she would lose some points. She wouldn’t, however, get the whole question wrong–as she would on a multiple-choice question.
Adding a new subject to the MBE hit test-takers where they were already hurting: the need to memorize a large number of rules and sub-rules. By expanding the universe of rules to be memorized, NCBE made the exam considerably harder.
Looking Ahead
In upcoming posts, I will explain why NCBE’s equating/scaling process couldn’t account for the increased difficulty of this exam. Indeed, equating and scaling may have made the impact worse. I’ll also explore what this means for the ExamSoft discussion and what (if anything) legal educators might do about the increased difficulty of the MBE. To start the discussion, however, it’s essential to recognize that enhanced level of difficulty.
Data, Teaching, Bar Exam, Civil Procedure, MBE, NCBE View Comments (2)I recently found a letter that Erica Moeser, President of the National Conference of Bar Examiners (NCBE) wrote to law school deans in mid-December. The letter responds to a formal request, signed by 79 law school deans, that NCBE “facilitate a thorough investigation of the administration and scoring of the July 2014 bar exam.” That exam suffered from the notorious ExamSoft debacle.
Moeser’s letter makes an interesting distinction. She assures the deans that NCBE has “reviewed and re-reviewed” its scoring, equating, and scaling of the July 2014 MBE. Those reviews, Moeser attests, revealed no flaw in NCBE’s process. She then adds that, to the extent the deans are concerned about “administration” of the exam, they should “note that NCBE does not administer the examination; jurisdictions do.”
Moeser doesn’t mention ExamSoft by name, but her message seems clear: If ExamSoft’s massive failure affected examinees’ performance, that’s not our problem. We take the bubble sheets as they come to us, grade them, equate the scores, scale those scores, and return the numbers to the states. It’s all the same to NCBE if examinees miss points because they failed to study, law schools taught them poorly, or they were groggy and stressed from struggling to upload their essay exams. We only score exams, we don’t administer them.
But is the line between administration and scoring so clear?
The Purpose of Equating
In an earlier post, I described the process of equating and scaling that NCBE uses to produce final MBE scores. The elaborate transformation of raw scores has one purpose: “to ensure consistency and fairness across the different MBE forms given on different test dates.”
NCBE thinks of this consistency with respect to its own test questions; it wants to ensure that some test-takers aren’t burdened with an overly difficult set of questions–or conversely, that other examinees don’t benefit from unduly easy questions. But substantial changes in exam conditions, like the ExamSoft crash, can also make an exam more difficult. If they do, NCBE’s equating and scaling process actually amplifies that unfairness.
To remain faithful to its mission, it seems that NCBE should at least explore the possible effects of major blunders in exam administration. This is especially true when a problem affects multiple jurisdictions, rather than a single state. If an incident affects a single jurisdiction, the examining authorities in that state can decide whether to adjust scores for that exam. When the problem is more diffuse, as with the ExamSoft failure, individual states may not have the information necessary to assess the extent of the impact. That’s an even greater concern when nationwide equating will spread the problem to states that did not even contract with ExamSoft.
What Should NCBE Have Done?
NCBE did not cause ExamSoft’s upload problems, but it almost certainly knew about them. Experts in exam scoring also understand that defects in exam administration can interfere with performance. With knowledge of the ExamSoft problem, NCBE had the ability to examine raw scores for the extent of the ExamSoft effect. Exploration would have been most effective with cooperation from ExamSoft itself, revealing which states suffered major upload problems and which ones experienced more minor interference. But even without that information, NCBE could have explored the raw scores for indications of whether test takers were “less able” in ExamSoft states.
If NCBE had found a problem, there would have been time to consult with bar examiners about possible solutions. At the very least, NCBE probably should have adjusted its scaling to reflect the fact that some of the decrease in raw scores stemmed from the software crash rather than from other changes in test-taker ability. With enough data, NCBE might have been able to quantify those effects fairly precisely.
Maybe NCBE did, in fact, do those things. Its public pronouncements, however, have not suggested any such process. On the contrary, Moeser seems to studiously avoid mentioning ExamSoft. This reveals an even deeper problem: we have a high-stakes exam for which responsibility is badly fragmented.
Who Do You Call?
Imagine yourself as a test-taker on July 29, 2014. You’ve been trying for several hours to upload your essay exam, without success. You’ve tried calling ExamSoft’s customer service line, but can’t get through. You’re worried that you’ll fail the exam if you don’t upload the essays on time, and you’re also worried that you won’t be sufficiently rested for the next day’s MBE. Who do you call?
You can’t call the state bar examiners; they don’t have an after-hours call line. If they did, they probably would reassure you on the first question, telling you that they would extend the deadline for submitting essay answers. (This is, in fact, what many affected states did.) But they wouldn’t have much to offer on the second question, about getting back on track for the next day’s MBE. Some state examiners don’t fully understand NCBE’s equating and scaling process; those examiners might even erroneously tell you “not to worry because everyone is in the same boat.”
NCBE wouldn’t be any more help. They, as Moeser pointed out, don’t actually administer exams; they just create and score them.
Many distressed examinees called law school staff members who had helped them prepare for the bar. Those staff members, in turn, called their deans–who contacted NCBE and state bar examiners. As Moeser’s letters indicate, however, bar examiners view deans with some suspicion. The deans, they believe, are too quick to advocate for their graduates and too worried about their own bar pass rates.
As NCBE and bar examiners refused to respond, or shifted responsibility to the other party, we reached a stand-off: no one was willing to take responsibility for flaws in a very high-stakes test administered to more than 50,000 examinees. That is a failure as great as the ExamSoft crash itself.
Technology, Bar Exam, ExamSoft, NCBE No Comments YetCafe Manager & Co-Moderator
Deborah J. Merritt
Cafe Designer & Co-Moderator
Kyle McEntee
Law School Cafe is a resource for anyone interested in changes in legal education and the legal profession.
Have something you think our audience would like to hear about? Interested in writing one or more guest posts? Send an email to the cafe manager at merritt52@gmail.com. We are interested in publishing posts from practitioners, students, faculty, and industry professionals.