implementation and validation of an intem response theory scale for formative assessment
Stéphanie Berger is a PhD student in the department of Research Methodology, Measurement and Data Analysis (OMD). Her supervisors are prof.dr.ir. T.J.H.M. Eggen from the faculty of Behavioural, Management and Social sciences (BMS) and prof.dr. U. Moser of the University of Zürich, Switzerland and her co-supervisor is dr.ir. A.J. Verschoor from the faculty of Behavioural, Management and Social sciences (BMS)
A vertical measurement scale is the basis for repeatedly assessing and monitoring students’ abilities throughout their school years. Advanced computer technology serves as a foundation for implementing vertical scales and related complex measurement models, such as item response theory (IRT) models, within computer-based assessment systems. Such innovative assessment systems have immense potential for formative assessment because they can assist students and teachers to evaluate students’ progress toward their learning goals; to adapt learning goals, learning strategies, or teaching; and to define new learning goals that are appropriate for each student’s learning path. This thesis was motivated by practical challenges related to implementing and validating a vertical Rasch scale to measure students’ mathematics abilities throughout compulsory school in Northwestern Switzerland. The goal of this vertical scale is to provide third through ninth grade students with objective, reliable, and valid assessment reports based on two different assessment instruments: (1) a set of four standardized tests and (2) an online item bank for formative assessment. This thesis contributes to the existing literature about IRT, vertical scaling, and formative assessment by evaluating practical solutions for practical challenges, which can complicate the implementation and validation of a vertical Rasch scale. This thesis elaborated on practical constraints, including time and financial resources; willingness of schools, teachers, and students to participate in calibration studies; and the number of new items that need to be calibrated as a basis for repeatedly assessing students over multiple school years.
Chapter 1, the general introduction, provided an overview of educational assessment in Northwestern Switzerland as the practical context for this thesis, and introduced vertical scaling and efficient testing based on IRT methods as the common theoretical themes of the studies presented in this thesis. Chapter 1 also outlined the research objectives and research questions.
In Chapter 2, a theoretical concept was proposed to implement IRT-related calibration procedures and data-collection designs in order to establish the vertical mathematics scale for Northwestern Switzerland. Specifically, Chapter 2 described the four standardized tests and the online item bank for formative assessment in more detail, evaluating their similarities and differences regarding target population, assessment types and purposes, content specifications, and measurement conditions. Chapter 2 also provided an overview of different IRT-related calibration procedures and data-collection designs for both horizontal and vertical scaling. This chapter elaborated on the idea of targeted and adaptive testing based on the Rasch model to increase measurement efficiency under the practical constraints of a limited student or item samples. By integrating the two instruments’ similarities and differences with the theoretical background on data-collection designs and item calibration within a Rasch framework, a four-step item calibration process was proposed to establish a vertical scale and link the two instruments. The concluding discussion pointed out a need for empirical research into efficient calibration and test designs under practical contexts and constraints, and stressed the need to validate the final scale from a psychometric, as well as a content, perspective.
Chapter 3 directed focus toward the fact that calibration of an item bank for computerized adaptive testing, such as the online item bank for formative assessment, requires substantial resources, and addressed the need for empirical research into efficient calibration designs. This chapter presented a study that investigated whether calibration efficiency under the Rasch model could be enhanced through targeted multistage calibration designs, which consider ability-related background variables and performance for assigning students with suitable items. This chapter also investigated whether uncertainty about item difficulty could impair assembly of an efficient design. The results indicated that targeted multistage calibration designs were more efficient than ordinary targeted designs under optimal conditions. Limited knowledge about item difficulty reduced the efficiency of one of the two investigated targeted multistage calibration designs, whereas the targeted design was more robust.
Chapter 4 further investigated the idea of combining targeted and multistage testing and addressed the fact that neither targeted testing by means of ability-related background variables nor adaptive testing through multistage tests (MSTs) can ensure that all students receive items that completely match their abilities. Targeted designs do not consider that student abilities might significantly differ within each group. MST designs usually include a starting module of moderate difficulty, which does not account for differences in student abilities. This chapter investigated whether measurement efficiency can be improved through targeted multistage test (TMST) designs that consider ability-related background information for a targeted assignment at the beginning of the test, as well as performance during test-taking, for selecting matching test modules. Through simulations, the efficiency of the TMST design for estimating student ability was compared to that of the traditional targeted test design and the MST design. Chapter 4 further analyzed the extent to which each design’s efficiency depends on the correlation between the ability-related background variable and students’ true abilities, each student’s ability level and categorization into an ability group, and the length of the starting module compared to the total test length. The results indicated that TMST designs were generally more efficient than targeted and MST designs, especially if the ability-related background variable was highly correlated with students’ true abilities. TMST designs were also particularly efficient for estimating the abilities of low- and high-ability students within a given population. Finally, longer starting modules resulted in a less efficient estimation of low and high abilities than did shorter starting modules. However, this finding was more prominent for MST than for TMST designs. This chapter concluded by recommending TMST designs to assess students with a wide range of abilities when a reliable ability-related background variable is available.
Chapter 5 resumed and expanded upon the conclusion provided in Chapter 2, that vertical scales require validation from a psychometric, as well as a content, perspective. Specifically, this chapter described the actual implementation of the calibration assessments, proposed in Chapter 2, as one of four calibration steps to establish a vertical Rasch scale for assessing the mathematics abilities of students in third through ninth grade. The psychometric properties of the vertical scale were examined through item analysis, as well as by comparing two different calibration procedures: concurrent and grade-by-grade calibration. The content-related validity of the scale was evaluated by contrasting the empirical item difficulty estimates with the theoretical, content-related item difficulties reflected in the underlying competence levels of the curriculum. Through correlation and multiple regression analyses, this chapter explored whether the match between empirical and content-related item difficulty differed for items related to different curriculum cycles (i.e., primary vs. secondary school), domains, or competencies within mathematics. The results indicated a satisfying item fit and a close match between the outcomes of the concurrent and grade-by-grade calibration procedures, supporting the scale’s unidimensionality and stability from a psychometric perspective. In addition, strong correlations between the empirical and content-related item difficulties were found, emphasizing the scale’s content validity. Further analysis showed a higher correlation between empirical and content-related item difficulty at the primary level when compared with the secondary school level. Correlations were comparable across the different curriculum domains and competencies, implying that the scale is a good indicator of students’ math abilities, as outlined in the curriculum.
The Epilogue, Chapter 6, reviewed the primary research questions answered by the four studies presented in this thesis, summarized the related theoretical and empirical findings, and discussed their implications for future research.