An evaluation of the inter-rater reliability in a clinical skills objective structured clinical examination

Main Article Content

V de Beer
J Nel
FP Pieterse
A Snyman
G Joubert
MJ Labuschagne


Background. An objective structured clinical examination (OSCE) is a performance-based examination used to assess health sciences students and is a
well-recognised tool to assess clinical skills with or without using real patients.
Objectives. To determine the inter-rater reliability of experienced and novice assessors from different clinical backgrounds on the final mark allocations
during assessment of third-year medical students’ final OSCE at the University of the Free State.
Methods. This cross-sectional analytical study included 24 assessors and 145 students. After training and written instructions, two assessors per station
(urology history taking, respiratory examination and gynaecology skills assessment) each independently assessed the same student for the same skill by
completing their individual checklists. At each station, assessors could also give a global rating mark (from 1 to 5) as an overall impression.
Results. The urology history-taking station had the lowest mean score (53.4%) and the gynaecology skills station the highest (71.1%). Seven (58.3%) of
the 12 assessor pairs differed by >5% regarding the final mark, with differences ranging from 5.2% to 12.2%. For two pairs the entire confidence interval
(CI) was within the 5% range, whereas for five pairs the entire CI was outside the 5% range. Only one pair achieved substantial agreement (weighted
kappa statistic 0.74 ‒ urology history taking). There was no consistency within or across stations regarding whether the experienced or novice assessor
gave higher marks. For the respiratory examination and gynaecology skills stations, all pairs differed for the majority of students regarding the global
rating mark. Weighted kappa statistics indicated that no pair achieved substantial agreement regarding this mark.
Conclusion. Despite previous experience, written instructions and training in the use of the checklists, differences between assessors were found in
most cases.


Download data is not yet available.

Article Details

How to Cite
An evaluation of the inter-rater reliability in a clinical skills objective structured clinical examination. (2023). African Journal of Health Professions Education, 15(2), 13-17.
Research Articles

How to Cite

An evaluation of the inter-rater reliability in a clinical skills objective structured clinical examination. (2023). African Journal of Health Professions Education, 15(2), 13-17.


Boursicot K, Kemp S, Wilkinson T, et al. Performance assessment: Consensus statement and recommendations from the 2020 Ottawa conference. Med Teach 2021;43(1):58-67.

Schuwirth LW, van der Vleuten CP. Current assessment in medical education: Programmatic assessment. J Appl

Test Technol 2019;20(S2):2-10.

Harden RM. Outcome-based education: AMEE Guide No. 14. Part 1: An introduction to outcome-based

education. Med Teach 2009;21(1):7-14.

Khan KZ, Ramachandran S, Gaunt K, Pushkar P. The objective structured clinical examination (OSCE): AMEE

Guide No. 81. Part I: An historical and theoretical perspective. Med Teach 2013;35(9):e1437-e1446. https://doi.


Miller GE. The assessment of clinical skills/competence/performance. Acad Med 1990;65(9 Suppl):S63-S67.

Smee S. Skill based assessment. BMJ 2003;326(7391):703-706.

Schleicher I, Leitner K, Juenger J, et al. Examiner effect on the objective structured clinical exam ‒ a study at five

medical schools. BMC Med Educ 2017;17(1):71.

Mortsiefer A, Karger A, Rotthoff T, Raski B, Pentzek M. Examiner characteristics and interrater reliability in

a communication OSCE. Patient Educ Coun 2017;100(6):1230-1234.

Mazor KM, Zanetti ML, Alper EJ, et al. Assessing professionalism in the context of an objective structured clinical examination: An in-depth study of the rating process. Med Educ 2007;41(4):331-340.


Kenny DA. PERSON: A general model of interpersonal perception. Pers Soc Psychol Rev 2004;8(3):265-280.

Park B, DeKay ML, Kraus S. Aggregating social behavior into person models: Perceiver-induced consistency. J Pers Soc Psychol 1994;66(3):437-459.

Gingerich A, Regehr G, Eva KW. Rater-based assessments as social judgments: Rethinking the etiology of rater errors. Acad Med 2011;86(10 Suppl):S1-S7.

Seitz T, Raschauer B, Längle AS, Löffler-Stastka H. Competency in medical history taking ‒ the training physicians’ view. Wien Klin Wochenschr 2019;131(1-2):17-22.

McKenna L, Innes K, French J, Streitberg S, Gilmour C. Is history taking a dying skill? An exploration using a simulated learning environment. Nurse Educ Pract 2011;11(4):234-238. nepr.2010.11.009

Jönsson A, Svingby G. The use of scoring rubrics: Reliability, validity and educational consequences. Educ Res Rev 2007;2(2):130-144.

Wood TJ. Exploring the role of first impressions in rater-based assessments. Adv Health Sci Educ Theory Pract 2014;19(3):409-427.

Williams RG, Klamen DA, McGaghie WC. Cognitive, social and environmental sources of bias in clinical performance ratings. Teach Learn Med 2003;15(4):270-292. 18. Cook DA, Dupras DM, Beckman TJ, Thomas KG, Pankratz VS. Effect of rater training on reliability and accuracy

of mini-CEX scores: A randomised, controlled trial. J Gen Intern Med 2009;24(1):74-79. s11606-008-0842-3