Number of Operative Performance Ratings Needed to Reliably Assess the Difficulty of Surgical Procedures

Kenneth L. Abbott MS, Xilin Chen MPH, Michael Clark PhD, Nikki L. Bibler Zaidi PhD, David B. Swanson PhD, Brian C. George MD, MAEd,

Journal of Surgical Education, Vol 76 issue 6



The profession of surgery is entering a new era of “big data,” where analyses of longitudinal trainee assessment data will be used to inform ongoing efforts to improve surgical education. Given the high-stakes implications of these types of analyses, researchers must define the conditions under which estimates derived from these large datasets remain valid. With this study, we determine the number of assessments of residents’ performances needed to reliably assess the difficulty of “Core” surgical procedures.


Using the SIMPL smartphone application from the Procedural Learning and Safety Collaborative, 402 attending surgeons directly observed and provided workplace-based assessments for 488 categorical residents after 5259 performances of 87 Core surgical procedures performed at 14 institutions. We used these faculty ratings to construct a linear mixed model with resident performance as the outcome variable and multiple predictors including, most significantly, the operative procedure as a random effect. We interpreted the variance in performance ratings attributable to the procedure, after controlling for other variables, as the “difficulty” of performing the procedure. We conducted a generalizability analysis and decision study to estimate the number of SIMPL performance ratings needed to reliably estimate the difficulty of a typical Core procedure.


Twenty-four faculty ratings of resident operative performance were necessary to reliably estimate the difficulty of a typical Core surgical procedure (mean dependability coefficient 0.80, 95% confidence interval 0.73-0.87).


At least 24 operative performance ratings are required to reliably estimate the difficulty of a typical Core surgical procedure. Future research using performance ratings to establish procedure difficulty should include adequate numbers of ratings given the high-stakes implications of those results for curriculum design and policy.

