In this study, various statistical indexes of agreement were calculated using empirical data from a group of evaluators (n = 45) of early childhood teachers. The group of evaluators rated ten fictitious teacher profiles using the North Carolina Teacher Evaluation Process (NCTEP) rubric. The exact and adjacent agreement percentages were calculated for the group of evaluators. Kappa, weighted Kappa, Gwet’s AC1, Gwet’s AC2, and ICCs were used to interpret the level of agreement between the group of raters and a panel of expert raters. Similar to previous studies, Kappa statistics were low in the presence of high levels of agreement. Weighted Kappa and Gwet’s AC1 were less conservative than Kappa values. Gwet’s AC2 statistic was not defined for most evaluators, as there was an issue found with the statistic when raters do not use each category on the rating scale a minimum number of times. Overall, summary statistics for exact agreement were 68.7% and 87.6% for adjacent agreement across 2,250 ratings (45 evaluators ratings of ten profiles across five NCTEP Standards). Inter-rater agreement coefficients varied from .486 for Kappa, .563 for Gwet’s AC1, .667 for weighted Kappa, and .706 for Gwet’s AC2. While each statistic yielded different results for the same data, the inter-rater reliability of evaluators of early childhood teachers was acceptable or higher for the majority of this group of raters when described with summary statistics and using precise measures of inter-rater reliability.
Holcomb, T. S., Lambert, R., & Bottoms, B. L. (2022). Reliability Evidence for the NC Teacher Evaluation Process Using a Variety of Indicators of Inter-Rater Agreement. Journal of Educational Supervision, 5 (1). https://doi.org/10.31045/jes.5.1.2