There are a number of acceptable methodologies in the psyychometric literature for standard setting studies, also known as cutscores or passing points. Examples include Angoff, modified-Angoff, Bookmark, Contrasting Groups, and Borderline. The modified-Angoff approach is by far the most commonly used, yet it remains a black box to many professionals in the testing industry, especially non-psychometricians in the credentialing field. This post hopefully provides some elucidation and demystification. There is some flexibility in the study implementation, but this article describes a sound method.
What to Expect with the Modified-Angoff Approach
First of all, do not expect a straightforward, easy process that leads to an unassailably correct cutscore. All standard setting methods involve some degree of subjectivity. The goal of the methods is to reduce that objectivity as much as possible. Some methods focus on content, others on data, while some try to meld the two.
Step 1: Prepare Your Team
The modified-Angoff process depends on a representative sample of subject matter experts (SMEs), usually 6-20. By “representative” I mean they should represent the various stakeholders. A certification for medical assistants might include experienced medical assistants, nurses, and physicians, from different areas of the country. You must train them about their role and how the process works, so they can understand the end goal and drive toward it.
Step 2: The Minimally Competent Candidate (MCC)
This concept is the core of the Angoff process, though it is known by a range of terms or acronyms, including minimally qualified candidate (MQC) or just barely qualified (JBQ). The reasoning is that we want our exam to separate candidates that are qualified from those that are not. So we ask the SMEs to define what makes someone qualified (or unqualified!) from a perspective of skills and knowledge. This leads to a conceptual definition of an MCC. We then want to estimate what score this borderline candidate would achieve, which is the goal of the remainder of the study. This step can be conducted in person, or via webinar.
Step 3: Round 1 Ratings
Next, ask your SMEs to read through all the items on your test form and estimate the percentage of MCCs that would answer each correctly. A rating of 100 means the item is a slam dunk; it is so easy that every MCC would get it right. A rating of 40 is very difficult. Most ratings are in the 60-90 range if the items are well-developed. The ratings should be gathered independently; if everyone is in the same room, let them work on their own in silence. This can easily be conducted remotely, though.
Step 4: Discussion
This is where it gets fun. Identify items where there is the most disagreement (as defined by grouped frequency distributions or standard deviation) and make the SMEs discuss it. Maybe two SMEs thought it was super easy and gave it a 95 and two other SMEs thought it was super hard and gave it a 45. They will try to convince the other side of their folly. Chances are that there will be no shortage of opinions and you, as the facilitator, will find your greatest challenge is keeping the meeting on track. This step can be conducted in person, or via webinar.
Step 5: Round 2 Ratings
Raters then re-rate the items based on the discussion. The goal is that there will be greater consensus. In the previous example, it’s not likely that every rater will settle on a 70. But if your raters all end up from 60-80, that’s OK. How do you know there is enough consensus? We recommend the inter-rater reliability suggested by Shrout and Fleiss (1979).
Step 6: Evaluate Results and Final Recommendation
Evaluate the results from Round 2 as well as Round 1. An example of this is below. What is the recommended cutscore, which is the average or sum of the Angoff ratings depending on the scale you prefer? Did the reliability improve? Estimate the mean and SD of examinee scores (there are several methods for this). What sort of pass rate do you expect? Even better, utilize the Beuk Compromise as a “reality check” between the modified-Angoff approach and actual test data. You should take multiple points of view into account, and the SMEs need to vote on a final recommendation. They, of course, know the material and the candidates so they have the final say. This means that standard setting is a political process; again, reduce that effect as much as you can.
Step 7: Write Up Your Report
Validity refers to evidence gathered to support test score interpretations. Well, you have lots of relevant evidence here. Document it. If your test gets challenged, you’ll have all this in place. On the other hand, if you just picked 70% as your cutscore because it was a nice round number, you could be in trouble.
In some situations, there are more issues to worry about. Multiple forms? You’ll need to equate in some way. Using item response theory? You’ll have to convert the Angoff-recommended cutscore onto the theta metric using the Test Response Function (TRF). New credential and no data available? That’s a real chicken-and-egg problem there.
Where Do I Go From Here?
Ready to take the next step and actually apply the modified-Angoff process to improving your exams? Download our free Angoff Analysis Tool.
Want to go even further and implement automation in your Angoff study? Sign up for a free account in our FastTest item banker.
Shrout & Fleiss (1979). Intraclass correlations: Uses in assessing reliability. Psychological Bulletin, 86(2), 420-428.
Want to improve the quality of your assessments?
Sign up for our newsletter and hear about our free tools, product updates, and blog posts first! Don’t worry, we would never sell your email address, and we promise not to spam you with too many emails.
Standard-setting study is an official research study conducted by an organization that sponsors tests to determine a cutscore for the test. To be legally defensible in the US, in particular for high-stakes assessments, and meet the Standards for Educational and Psychological Testing, a cutscore cannot be arbitrarily determined; it must be empirically justified. For example, the organization cannot merely decide that the cutscore will be 70% correct. Instead, a study is conducted to determine what score best differentiates the classifications of examinees, such as competent vs. incompetent. Such studies require quite an amount of resources, involving a number of professionals, in particular with psychometric background. Standard-setting studies are for that reason impractical for regular class room situations, yet in every layer of education, standard setting is performed and multiple methods exist.
Standard-setting studies are typically performed using focus groups of 5-15 subject matter experts that represent key stakeholders for the test. For example, in setting cut scores for educational testing, experts might be instructors familiar with the capabilities of the student population for the test.
Types of standard-setting studies
Standard-setting studies fall into two categories, item-centered and person-centered. Examples of item-centered methods include the Angoff, Ebel, Nedelsky, and Bookmark methods, while examples of person-centered methods include the Borderline Survey and Contrasting Groups approaches. These are so categorized by the focus of the analysis; in item-centered studies, the organization evaluates items with respect to a given population of persons, and vice versa for person-centered studies.
Item-centered studies are related to criterion-referenced tests and to norm-referenced tests.
- Angoff Method (item centered): This method requires the assembly of a group of subject matter experts, who are asked to evaluate each item and estimate the proportion of minimally competent examinees that would correctly answer the item. The ratings are averaged across raters for each item and then summed to obtain a panel-recommended raw cutscore. This cutscore then represents the score which the panel estimates a minimally competent candidate would get. This is of course subject to decision biases such as the overconfidence bias. Calibration with other, more objective, sources of data is preferable. Several variants of the method exist.
- Modified Angoff Method (item-centered): Subject matter experts (SMEs) are generally briefed on the Angoff method and allowed to take the test with the performance levels in mind. SMEs are then asked to provide estimates for each question of the proportion of borderline or “minimally acceptable” participants that they would expect to get the question correct. The estimates are generally in p-value type form (e.g., 0.6 for item 1: 60% of borderline passing participants would get this question correct). Several rounds are generally conducted with SMEs allowed to modify their estimates given different types of information (e.g., actual participant performance information on each question, other SME estimates, etc.). The final determination of the cut score is then made (e.g., by averaging estimates or taking the median). This method is generally used with multiple-choice questions.
- Dichotomous Modified Angoff Method (item-centered): In the dichotomous modified Angoff approach, instead of using difficulty level type statistics (typically p-values), SMEs are asked to simply provide a 0/1 for each question (“0” if a borderline acceptable participant would get the question wrong and “1” if a borderline acceptable participant would get the item right)
- Nedelsky Method (item-centered): SMEs make decisions on a question-by-question basis regarding which of the question distracters they feel borderline participants would be able to eliminate as incorrect. This method is generally used with multiple-choice questions only.
- Bookmark Method (item-centered): Items in a test (or a representative subset of items) are ordered by difficulty (e.g., IRT response probability value) from easiest to hardest. SMEs place a "bookmark" in the "ordered item booklet" such that a student at the threshold of a performance level would be expected to respond successfully to the items prior to the bookmark with a likelihood equal to or greater than the specified response probability value (and with a likelihood less than that value for items after the bookmark). For example, for a response probability of .67 (RP67) SMEs would place a bookmark such that an examinee at the threshold of the performance level would have at least a 2/3 likelihood of success on items prior to the bookmark and less than a 2/3 likelihood of success on the items after the bookmark“ This method is considered efficient with respect to setting multiple cut scores on a single test and can be used with tests composed of multiple item types (e.g., multiple-choice, construct response, etc.).
Rather than the items that distinguish competent candidates, person-centered studies evaluate the examinees themselves. While this might seem more appropriate, it is often more difficult because examinees are not a captive population, as is a list of items. For example, if a new test comes out regarding new content (as often happens in information technology tests), the test could be given to an initial sample called a beta sample, along with a survey of professional characteristics. The testing organization could then analyze and evaluate the relationship between the test scores and important statistics, such as skills, education, and experience. The cutscore could be set as the score that best differentiates between those examinees characterized as "passing" and those as "failing."
- Borderline groups method (person-centered): A description is prepared for each performance category. SMEs are asked to submit a list of participants whose performance on the test should be close to the performance standard (borderline). The test is administered to these borderline groups and the median test score is used as the cut score. This method can be used with virtually any question type (e.g., multiple-choice, multiple response, essay, etc.).
- Contrasting groups method (person-centered): SMEs are asked to categorize the participants in their classes according to the performance category descriptions. The test is administered to all of the categorized participants and the test score distributions for each of the categorized groups are compared. Where the distributions of the contrasting groups intersect is where the cut score would be located. This method can be used with virtually any question type (e.g., multiple-choice, multiple response, essay, etc.).
- ^Nedelsky, L. (1954). Absolute grading standards for objective tests. Educational and Psychological Measurement, 14, 3–19.
- ^Zieky, M.J. (2001). So much has changed: how the setting of cutscores has evolved since the 1980s. In Cizek, G.J. (Ed.), Setting Performance Standards, p. 19-52. Mahwah, NJ: Lawrence Erlbaum Associates.
- ^Lewis, D. M., Mitzel, H. C., Green, D. R. (June, 1996). Standard Setting: A Bookmark Approach. In D. R. Green (Chair), IRT-Based Standard-Setting Procedures Utilizing Behavioral Anchoring. Paper presented at the 1996 Council of Chief State School Officers National Conference on Large Scale Assessment, Phoenix, AZ.
- ^Mitzel, H. C., Lewis, D. M., Patz, R. J., & Green, D. R. (2000). The Bookmark Procedure: Cognitive Perspectives on Standard Setting. Chapter in Setting Performance Standards: Concepts, Methods, and Perspectives (G. J. Cizek, ed.). Mahwah, NJ: Lawrence Erlbaum Associates.
- ^Lewis, D. M., Mitzel, H. C., Mercado, R. L., & Schulz, E. M. (2012). The Bookmark Standard Setting Procedure. Chapter in Setting Performance Standards: Foundations, Methods, and Innovations Second Edition (G. J. Cizek, ed.). Mahwah, NJ: Lawrence Erlbaum Associates.