Issues in Assessment

Assessment results provide insights about what students know and can do. When created thoughtfully, assessments can offer objective, meaningful feedback that supports student learning. And different types of assessments address different types of questions.

Our assessment experts engage in basic and applied research about innovations in testing and measurement to address real-world measurement challenges and produce useful information for guiding educational decisions.

Technological Implications for Assessment Ecosystems: Opportunities for Digital Technology to Advance Assessment

Rethinking assessment in the age of the digital revolution requires us the rethink many of our traditional assumptions about what assessment is and does. This paper, written for the Gordon Commission on the Future of Assessment, details three major shifts in thinking about assessment.

Download: "Technological Implications for Assessment Ecosystems: Opportunities for Digital Technology to Advance Assessment"

Assessing 21st Century Skills: Integrating Research Findings

This paper synthesizes research evidence pertaining to several 21st century skills: critical thinking, creativity, collaboration, metacognition, and motivation.

Download: "Assessing 21st Century Skills: Integrating Research Findings"

Considerations for Performance Scoring When Designing and Developing Next Generation Assessments

To truly measure college and career readiness, next generation assessments must include an array of item types; however, item and task complexity will affect the entire assessment process, influencing cost, turnaround times, and ultimately the feasibility of the tests. This white paper explores the interactions between test design and scoring approach, and the implications for performance scoring quality, cost, and efficiency.

Download: "Considerations for Performance Scoring When Designing and Developing Next Generation Assessments"

Assessing Language Performance with the Tablet English Language Learner Assessment (TELL)

New technologies in touch-tablet computing enable integrated combinations of language activities (listening, speaking, reading, watching, writing, manipulating) in engaging environments for students. This paper presents the development and piloting of a formative assessment with 25 different English language activities implemented on an iPad. The performance tasks integrate combinations of the four language skills with nonlinguistic skills.

Download: "Assessing Language Performance with the Tablet English Language Learner Assessment (TELL)"

Evidence-Based Standard Setting: Vertically Aligning Grades 3–8 Assessments

Evidence-Based Standard Setting (EBSS) has been previously used to support alignment of high school assessments to postsecondary expectations. This study presents an extension of the EBSS process to grades 3–8 assessments using the State of Texas Assessments of Academic Readiness (STAAR).

Download: "Evidence-Based Standard Setting: Vertically Aligning Grades 3–8 Assessments"

Lessons Learned: Decision Points for Evidence-Based Standard Setting

This paper discusses the decision points that testing programs face as they implement the evidence-based standard-setting (EBSS) approach to establish performance standards for their assessments. As with any standard-setting process, EBSS requires multiple decisions to be made in order to implement the process in a way that produces reasonable results. Decisions may vary depending on the specific needs of the testing program. Lessons learned from implementations of the EBSS are synthesized and compared in four key areas.

Download: "Lessons Learned: Decision Points for Evidence-Based Standard Setting"

An Example of Evidence-Based Standard Setting for an English Language Proficiency Test

Evidence-Based Standard Setting (EBSS) uses empirical data as a key component of the standard-setting process. This paper demonstrates how EBSS was used to recommend performance standards on an English language proficiency test, the Texas English Language Proficiency Assessment System (TELPAS) reading test. The empirical data was selected to validate claims about characteristics of students at each of the proficiency levels. Specifically, studies were chosen to evaluate whether or students scoring in the highest proficiency level on the English language proficiency test in reading would be successful on the state academic reading test, the State of Texas Assessments of Academic Readiness (STAAR), after an additional year of instruction.

Download: "An Example of Evidence-Based Standard Setting for an English Language Proficiency Test"

Standard Setting for a Common Core Aligned Assessment

This paper discusses an implementation of Evidence-Based Standard Setting (EBSS) for common core aligned assessments in grades 3-8 English language arts and mathematics using external benchmark data from SAT, PSAT, and NAEP. Policy considerations took into account the expectation that students leave high school ready for college or career and data about current levels of college readiness in the state were shown to panelists in support of this intended policy inference. Panelists were additionally shown impact data associated with the external benchmarks and were asked to provide a range of expected impact data for the proficient cut score. Finally, panelists were given a range of bookmark placements for the proficient cut that would align well with these external benchmarks.

Download: "Standard Setting for a Common Core Aligned Assessment"

Evidence-Based Standard Setting: Establishing Cut Scores by Integrating Research Evidence with Expert Content Judgments

This bulletin describes the processes and practices associated with Evidence Based Standard Setting for assessment. This standard-setting approach combines expert content judgment with results from empirical research studies.

Download: "Evidence-Based Standard Setting: Establishing Cut Scores by Integrating Research Evidence with Expert Content Judgments"

Conceptual Frameworks for Reporting Results of Assessment Activities

This paper discusses three lenses for understanding the interaction of the goals of assessment developers and assessment stakeholders: Information communication, social activity and educational literacy. Information communication frames the designer-stakeholder interaction as a knowledge transfer problem that focuses on transmitting non-distorted messages from the sender to the receiver. There has been growing interest and accumulated recommendation in this area and we propose some language to consider the various recommendations. The social activity lens recommends activity theory as a theoretical framework with which to approach the design of assessment systems. This approach takes into account a number of social roles and epistemic frames that individuals and groups may bring to the interaction with the assessment system.

Download: "Conceptual Frameworks for Reporting Results of Assessment Activities"

A Universal Design for Learning-Based Framework for Designing Accessible Technology-Enhanced Assessments

The increased capabilities offered by digital technologies offer new opportunities to evaluate students’ deeper knowledge and skills and on constructs that are difficult to measure using traditional methods. Such assessments can also incorporate tools and interfaces that improve accessibility for diverse students, as well as inadvertently introduce new accessibility barriers. Designing these technology-enhanced tasks according to universal design principles is one way to address these accessibility concerns, but requires a grounded understanding of students’ diverse abilities and the ways they interact with the tasks. A thorough consideration of the factors that impact construct validity, with an emphasis on identifying and eliminating sources of construct-irrelevant variance, is essential to this process. This report proposes a framework based on the principles of Universal Design for Learning (UDL) for defining task design guidelines consistent with the goals of universal design and thus accessible to a wide range of students, including those with disabilities and who are English learners.

Download: "A Universal Design for Learning-Based Framework for Designing Accessible Technology-Enhanced Assessments"

The Role of Formalized Tools in Formative Assessment

Formative assessment does not refer to a type of testing instrument, but rather is a process for improving instruction dynamically through use of student data. Given the emphasis on organic teacher-student instructional interactions, the question arises whether formalized tools—standalone testing instruments, item banks, instructional improvement systems, etc.—can meaningfully support the formative assessment process. This paper explores the supporting argument for using formalized tools.

Download: "The Role of Formalized Tools in Formative Assessment"

The Variable-Length Adaptive Diagnostic Testing

Recently, the diagnostic assessment, which uses the diagnostic classification models (DCMs) to determine mastery or non-mastery of a set of attributes and to provide strengths and weaknesses, has drawn much attention of the practitioners. The diagnostic assessment is adaptive to a pool of items that are specifically designated as diagnostic.The goal of this study was to evaluate different adaptive algorithms in the variable-length adaptive diagnostic testing. Two new heuristics were proposed and used as part of the algorithms.

Download: "The Variable-Length Adaptive Diagnostic Testing"

Dealing with Variability with Item Clones in Computerized Adaptive Testing

The purpose of this study is to examine different means of dealing with the variability of clone items in order to moderate the possible loss of precision in ability estimation in Computer Adaptive Testing (CAT). To possibly capture the difference in item parameters among clones, two functions are investigated.

Download: "Dealing with Variability with Item Clones in Computerized Adaptive Testing"

Methods for Monitoring Rating Quality: Current Practices and Suggested Changes

This white paper discusses current rater monitoring practices and suggests several ways that rater monitoring can be improved. Within that discussion, it emphasizes that the purpose of rater monitoring should be to identify rater effects so that scoring leaders can provide raters with diagnostic and corrective feedback. It illustrates how several raw score and latent trait modeling indices can be used to accomplish this goal, and further identifies how many indices currently used for rater monitoring fail to provide such information.

Download: "Methods for Monitoring Rating Quality: Current Practices and Suggested Changes"

Halo Effects and Analytic Scoring: A Summary of Two Analytical Studies

In this research report, we address the issue of whether unique information is provided by analytic scores assigned to student writing, beyond what is depicted by holistic scores, and to what degree multiple analytic scores assigned by a single rater display evidence of a halo effect.

Read: "Halo Effects and Analytic Scoring: A Summary of Two Analytical Studies"

Halo Effects and Analytic Scoring Research Report Summary

This document summarizes the results of a research study that examines rater halo and how much unique information is provided by multiple analytic scores.

Download: "Halo Effects and Analytic Scoring Research Report Summary"

Revisiting the Halo Effect Within a Multitrait, Multimethod Framework

This report summarizes an empirical study on halo effect. Specifically, it investigates the extent to which multiple analytic scores assigned by a single rater display evidence of a halo effect.

Download: "Revisiting the Halo Effect Within a Multitrait, Multimethod Framework"

Distinguishing Several Rater Effects with the Rasch Model

Prior research about psychometric modeling of rater effects has focused on distinct effect in isolation even though multiple types of rater effects likely exist simultaneously in real data. This simulation study evaluates the performance of several rater effect indicators in data containing multiple rater effects.

Download: "Distinguishing Several Rater Effects with the Rasch Model"

Evaluation of Pseudo-Scoring as an Extension of Rater Training

This report summarizes the results of a study that sought to determine the potential benefit of engaging raters of essays written by students in a pseudo-scoring process as an extension of rater training. Raters at three grade levels took part in rater training activities, then took a pre-qualifying test, then engaged in pseudo-scoring, then took a qualifying test. Those who achieved qualifying status then proceeded to operational scoring. Results indicate an increase in performance on qualifying sets, an increase in qualifying rate, and an increase in inter-rater correlation following pseudo-scoring.

Download: "Evaluation of Pseudo-Scoring as an Extension of Rater Training"

An Investigation into Statistical Methods to Identify Aberrant Response Patterns

Various test-taker behaviors, including certain forms of misconduct, may result in aberrant response patterns for individual test-takers. This paper investigates four different approaches toward computing person-fit. All four investigated methods showed poor detection power in a simulation study. In terms of positive predictive value, or the percentage of correctly-identified cheaters among the total number of identified test-takers, the lco difference method generally outperformed the other methods, although all methods performed poorly in circumstances in which cheating was scarce in the test-taking population.

Download: "An Investigation into Statistical Methods to Identify Aberrant Response Patterns"

Relationship between Rater Background and Rater Performance

Although several researchers have posited relationships between rater prior experiences and rater backgrounds and their performance in operational scoring projects, very little empirical research exists to address this topic. This report summarizes how the background characteristics (demographics and professional experiences) of two samples of raters (one who scored a writing prompt and one who scored a science prompt) relate to measures of rater performance during the scoring project. Results suggest that small differences may exist between rater groups, but (a) these differences are not consistent across rater samples and (b) missing data precludes concluding whether important trends exist in these results.

Download: "Relationship between Rater Background and Rater Performance"

A Comparison of Trend Scoring and IRT Linking in Mixed-Format Test Equating

Characteristics of constructed-response (CR) items bring complications to the equating of mixed-format tests. Variations of rater severity across scoring cycles, if not adjusted for become a potential source of errors in mixed-format equating.

Download: "A Comparison of Trend Scoring and IRT Linking in Mixed-Format Test Equating"

Online Scoring vs. Materials Scoring for Portfolio Assessments: An Exploration of Score Stability

Considering the advantages of online scoring and the flexibilities it offers, transitioning to online scoring is an increasingly common trend in portfolio assessments and a worthwhile effort. The study is designed to investigate whether scores assigned to portfolio submissions are comparable between materials-based scoring and online scoring conditions, and to evaluate how scorers perceive the ease of using the online scoring platform and in facilitating the scoring process.

Download: "Online Scoring vs. Materials Scoring for Portfolio Assessments: An Exploration of Score Stability"

Ability Estimation in the Presence of Aberrant Responses

This paper compares three alternative estimation methods with the maximum likelihood estimation (MLE) method in the estimation of ability parameters using simulated item response data with different degrees of disturbances such as guessing and carelessness.

Download: "Ability Estimation in the Presence of Aberrant Responses"

The Impact of Ignoring Cross-Classified Multiple Membership Data Structures

This study compared the use of a three-level growth-curve model with that of a cross-classified growth curve model and a cross-classified multiple membership growth-curve model for handling cross-classified multiple membership data structures.

Download: "The Impact of Ignoring Cross-Classified Multiple Membership Data Structures"

Profile Classification for Cognitive Diagnostic Assessment: A Simulation Study

Cognitive diagnostic assessment (CDA) is a systematic process that seeks to obtain detailed information about the strengths and weaknesses of students’ knowledge, skills, and abilities (Rupp, Templin, & Henson, 2010). This information is most often organized in the form of cognitive profiles to assist teachers in designing remedial efforts for individual students.

Download: "Profile Classification for Cognitive Diagnostic Assessment: A Simulation Study"

Network-Based Tools for the Visualization and Analysis of Domain Models

A domain model depicts relationships among the important knowledge and skills that students are expected to learn in the subject domain. A graphic representation of a domain model facilitates understanding of the complex relationships in the domain and informs subsequent efforts of assessment development. The most important type of relationships in domain models is the prerequisite relationship, which spells out the idea that a knowledge element must be acquired before learning another. How to visualize the network of prerequisite relationships and represent it in ways that help assessment designers gain insights into the domain is essential in domain modeling.

Download: "Network-Based Tools for the Visualization and Analysis of Domain Models"

Detecting Game Player Goals with Log Data

As gaming researchers attempt to make inferences about player knowledge, skills, and attributes from their actions in open-ended gaming environments, understanding game players’ goals can help provide an interpretive lens for those actions. Games are generally far more open-ended than traditional assessments and as such, inferential evidence must take into account the context of the player’s actions. Algorithms can help researchers identify particular goals in large log files of detailed player actions. This research uses Classification and Regression Tree methodology to develop and then cross-validate features of game play and related rules through which player behavior can be used to classify players according to their goals.

Download: "Detecting Game Player Goals with Log Data"

Bayesian Networks for Skill Diagnosis and Model Validation

Domain models depict relationships among the important knowledge and skills that students are expected to learn in a subject domain. Even in a highly focused and well-defined content area, there may be different domain models capturing different assumptions, perspectives, and hypotheses regarding what students should learn and how they learn it. This study illustrated how Bayesian networks can be used to diagnose specific subskills in the domain of fractions, and how different domain models compare with each other in terms of the extent to which they accurately predict students’ mastery of different subskills.

Download: "Bayesian Networks for Skill Diagnosis and Model Validation"

A Literature Review of Gaming in Education

The use of simulations and digital games in learning and assessment is expected to increase over the next several years. Although there is much theoretical support for the benefits of digital games in learning and education, there is mixed empirical support. This research report provides an overview of the theoretical and empirical evidence behind five key claims about the use of digital games in education. The claims are that digital games (1) are built on sound learning principles, (2) provide more engagement for the learner, (3) provide personalized learning opportunities, (4) teach 21st century skills, and (5) provide an environment for authentic and relevant assessment. The evidence for each claim is presented and directions for future research are discussed.

Download: "A Literature Review of Gaming in Education"

Harnessing the Currents of the Digital Ocean

This paper extends the discussion begun by DiCerbo & Behrens (2012) in which they outlined how the societal shift related to the digital revolution can be understood in terms of a shift from a pre-digital “digital desert” to a post-digital “digital ocean”. Using the framework of Evidence Centered Design they suggest that the core processes of assessment delivery can be re-thought in terms of new capabilities from computing devices and large amounts of data and that many of our original categories of educational activity represent views limited by their origination in the digital desert.

Download: "Harnessing the Currents of the Digital Ocean"

Evaluating the Use of Growth Prediction Models to Inform Instruction

This study examines the performance five growth prediction models (trajectory, transition table, projection, Student Growth Percentile, logistic regression) using data from two cohorts of students (elementary and middle school) in two subjects (reading and mathematics) with two proficiency cut scores (low and high rigor) on a vertically scaled summative assessment from a large U.S. state.

Download: "Evaluating the Use of Growth Prediction Models to Inform Instruction"