Enhancing the reliability of graduation projects assessment

Guide: Assessing graduation projects – How to enhance the reliability  by way of investigation, verification, and
calibration activities

1. Introduction
The graduation project, as the concluding part of a (university) education programme, is an important tool for determining whether the student meets the established final learning outcomes and can produce work at an academic level.  Graduation projects are a significant part of a programme in volume (usually 15-30 EC) and the assessment of the graduation project has serious consequences for the student. 

Graduation projects assess the final competencies of a programme and guidelines and a framework will be in place for this purpose. But even then, the assessment remains a difficult affair. The final assignment and the process can differ due to, among other things, the complexity of the assignment, the degree and quality of supervision, the qualities the student brings to the start, the attitude and behavior of the student, and the extent to which the student manages to develop during the process. The assessment is done by different teachers, each bringing their views, methods, qualities, and experiences.    

All these factors make a reliable assessment of a graduation project extremely important but at the same time complex. How can you ensure that the difference in grades is meaningful? That the grade unambiguously represents the quality of the work?al work? That the grade does not depend on who happened to be grading (inter-assessor reliability)?

Investigation, verification, and calibration activities can help explore and increase the reliability of these complex assessments. These activities can be initiated by programme management (from a quality assurance point of view), an Examination Board (from a safeguarding point of view), or both parties jointly. A programme committee can also play a role in this respect.
   
This page offers suggestions for this purpose. First of all we explain what investigation, verification, and calibration activities can contribute to.
The activities can be designed in various ways and serve multiple purposes. In Chapter 5 the set-up for calibration sessions in the form of, for example, so-called “theses carousels”, but to gain better insight into the supervision and assessment, other activities can also be carried out, even without or with limited direct teacher involvement. These activities are described in Chapter 3.
In addition to these investigative activities, other actions can help increase inter-rater reliability and consistency in assessment. For these, see Chapter 4. 

Chapter 2. What investigation, verification, and calibration activities can contribute to
The activities described in the following chapters can contribute to:
> Consistency of assessments. Equitable grading. Meaningful grading. Increase in inter-rater reliability.
> Verification that the academic standards to which the programme as reflected in the final qualifications are adhered to and achieved.
> Professional development of assessors. By discussing assessment practices, criteria, considerations, etc., assessors learn from each other.
   It is especially important for newly employed, inexperienced assessors or assessors who have little or no experience with the Dutch assessment system. 
> Adequately following up on the procedures and guidelines for the graduation process. For example, performing a check for plagiarism.
> Identifying and detecting bottlenecks and points of concern in the assessment or in procedures, processes, and instruments.
> (Discussing and ) Improving procedures and processes.
> (Discussing and ) Improving assessment instrumentation.


  • Chapter 3.1 Investigative activities without direct involvement of assessors

    A number of investigative activities to better understand the processes surrounding graduation and assessments and the resulting outcomes can be conducted without or with limited direct involvement of faculty members. These activities can provide input to sessions with faculty staff or provide useful evaluative information for programme managers and/or Examination Boards.

    Ideas for investigative activities:

    • Examine the quality and follow-up of processes and procedures. Are processes and procedures being implemented as intended? For example, are forms being completed as expected so that it is clear how a grade was arrived at? For this investigation, it may be sufficient to look at a sample of completed forms.
    • Examine time spent on supervision. NOTE. This will require limited input from teachers unless timekeeping systems are in use. Questions might include: how many hours on average are spent supervising students? What is the spread and can this be explained? Does it vary by type of assignment, chair group, supervisor, and education background of the students? Can differences lead to differences in quality of work and grading? 
    • Screening of the entire graduation process. An independent assessment expert or a small committee can be asked to take a critical look at the entire graduation process (including all instruments). Is the assessment done validly, reliably, and transparently? 
    • Investigating certain themes or bottlenecks. For example: Does the “design character” of the programme emerge sufficiently in the graduation projects? A domain expert can look at a selection of graduation projects for this purpose. Or a question could be: Are projects done internally (within the programme) assessed differently from assignments done externally? After categorization, grades (at the criteria level) can be compared.  
  • Chapter 3.2 Investigations by way of quantitative analyses

    Quantitative analyses can be conducted regularly. Especially to signal any areas of concern and special developments. For example:

    • How long do students take to graduate? Does this vary over the years? Does this differ for the professorships involved, teachers, types of assignments, and students' education backgrounds?
    • What grades are obtained? Does this vary over the years? Does this differ for the chair groups involved, teachers, type of assignments, education background of students?
    • If scores or grades are recorded at criteria level: what stands out? For example: It appears that for the Methodology section in the final report, low marks are obtained on average.

  • Chapter 4. Other quality assurance or safeguarding actions

    In addition to the investigation possibilities described above and calibration activities described below, there are more ways to promote and ensure the quality of assessment for the theses projects, such as:

    • If working with thesis committees: Ensure changing the composition of thesis committees.
    • Combine experienced and less experienced assessors in a committee. NB. 'Inexperienced' may also imply inexperience with the Dutch grading system and standards.
    • Promote a peer review and support and feedback culture among assessors. Encourage assessors to consult each other when in doubt or to learn from each other.  
    • Provide an assessment manual that describes all important aspects of the assessment process and the assessment instrument (especially the criteria used).
    • Provide an expanded version of the assessment form as background information. For example, if the assessment form only briefly states the criteria, a more extensive version can provide more explanations and examples. Also, the criteria can be combined with “standards” in a more comprehensive analytical or holistic rubric. This indicates, for example: for the grade 6, we expect at least ......, but there may be deficiencies in terms of .....
    • Organize (annually) a meeting to discuss the graduation process, supervision, and assessment process. Especially for new employees.
    • Encourage participation in a "Supervising students training". In the UTQ track, this training is offered by default. If PhDs are involved in the supervision and assessment of students, reference can be made to the Taste of Teaching training (pays partial attention to thesis supervision and assessment) or the organization of a customized training for own PhDs can be considered.
    • Take a questionnaire or organize interviews among graduation assessors to identify any bottlenecks and areas for improvement. What is their impression of the quality of student work? Has this changed over the past few years? What bottlenecks do they encounter (e.g., finding that many students do not master a particular skill)? 
  • Chapter 5. Calibration or alignment activities.

    Next some ideas for calibration or alignment activities are presented.

    Important points to consider when organizing a “theses carousel”:
    > Clearly state the purpose of the reassessments in advance. The essence is “collaborative reflection.” It is important that participants feel safe to reflect on their opinions and ideas and are open to the reactions of others. It is not about judging the original assessors and how they did the work or who is right. It is about discussing the interpretation of criteria and standards and identifying any bottlenecks and concerns in the assessment process.
    > A joint session requires an expert moderator and good structure and organization. Steer for dialogue rather than discussion. Avoid talking about the quality of a specific piece of work, but lift the conversation to a somewhat higher level of abstraction. Providing a checklist with questions to address, can help to steer the conversation in the right direction.   

    • 5.1 First action: Determine the purpose of the calibration or alignment activity

      Prior to organizing an activity, it is important to consider what goal is or goals are intended. What is the focus of the activity undertaken? Ultimately, all these goals may contribute to improving the quality of the overall graduation process. Nevertheless, the perspective varies and this may determine the design and implementation of an activity.

      Possible goals can be for instance: 

      • To investigate whether the assessments do justice to the quality of the work done; possibly also in comparison with the quality of graduation projects at similar programmes elsewhere (national or international).
      • To investigate whether the assessments of assessors involved or whether the assessors of a programme are on the same wavelength and, if there are differences, find out what leads to observed differences.
      • To calibrate; having thesis assessors discuss with each other how they assess, to promote consistency in assessment.  This can be specifically important when there is a lot of change in who is assessing and to “induct” new assessors. Reflecting on one's assessment methods can also be valuable for experienced assessors. 
      • To examine ambiguous situations and circumstances. Are there aspects that make an assessment especially difficult? If someone takes much longer on a task than planned, do you factor this into the assessment? In what ways? Does everyone do so equally?
      • To check whether the assessment criteria and assessment forms used are adequate and provide a good guide for the assessments. Are there any ambiguities? Differences in interpretations of certain criteria? Difference in the assignment of weight to different criteria? Are adjustments needed?
      • To discuss the assessment criteria and assessment forms used, possibly also the vision regarding thesis work, the supervision process and important conditions (e.g. how much time is envisaged for supervision), procedures and support (e.g. documentation for supervisors and assessors, but also the manuals for the students), in order to arrive at a common frame of reference, similar working methods and sufficient support for the processes.
      • To discuss bottlenecks experienced by thesis supervisors and assessors and possible measures. Or discussing measures planned by the programme (for example, the need to regulate the number of supervision hours) and the expected bottlenecks or added value of these.
    • 5.2 Consideration: Calibration with external involvement

      Depending on the goal(s), it can be interesting to involve external parties. For example, when you want to investigate whether the level and quality of the graduation work of your own students is comparable to the graduation work of students from similar programmes in the Netherlands or internationally. And whether you give comparable grades for comparable quality.  
      An external party without familiarity with the student population, educational context, certain traditions, and customs, looks at the final works more objectively. Externals with specific expertise can be involved to further examine certain areas of concern. For example: is the methodological part of the research that students conduct in order and up to standard?

      How can you proceed?
      The involvement of an external party can be on the basis of mutual exchange. This does not necessarily have to happen, but promotes working with “closed exchanges”. On both sides, for example, a few faculty members with sufficient knowledge of the discipline and with experience in grading theses are asked to look at a few theses (varying in quality and grades) and give an assessment of them using a form.
      A separate form can be prepared for the exchange so that both parties assess based on equal and pre-agreed assessment criteria. Or the assessment form used at the particular programme can be used. Assessment forms can then also be discussed afterward.
      The assessments are subsequently compared and discussed at a meeting. The process and results are documented and provide input for possible improvement measures. Good documentation is important for demonstrating measures taken in the context of quality care and assurance.

      Variants:

      • Even without looking at specific theses, it may be interesting to check with a similar programme elsewhere to see how they design the thesis process and assess the work delivered. This can be done through desk research based on requested or publicly available information or by organizing an (online) meeting with one or more external parties.  
      • A calibration session can be organized by taking the same thesis (or 2 or 3) and having both parties review it. With or without using their own assessment tools. Are their different judgments for the same thesis?  What causes this?
      • A single expert external assessor (familiar with the discipline and experienced in assessing theses) may be asked to assess a number of theses.
    • 5.3.2 Calibration session sec (all review the same thesis)

      Purpose: To verify that teachers interpret criteria and expected standards similarly. Calibrating assessments by learning from each other. Possibly also: to ascertain the extent to which the assessment tools are satisfactory or where improvement is needed.

      In advance: One or more (2, 3) theses are chosen and all reviewers are asked in advance to complete the assessment form (possibly with additional questions) for the same anonymized thesis(s). Items such as “process” and “presentation” can be omitted.
      The judgments can be analyzed in advance to bring out some points of interest for the session. Another option is to omit analysis beforehand and ask employees to bring the form with them to discuss on the spot. 

      During session: The assessments are discussed (in groups). Participants explain to each other how they arrived at the assessment. Perspectives are compared and differences are discussed on the basis of a number of presented questions.
      Findings are recorded using, for example, the questionnaire (or a PPT presentation, flip chart sheet, etc.).  
      At the end of the session, the main findings are reviewed and whether they warrant action. A measure does not always imply a concrete decision. More research may be needed first. 

      Variant: Discussion in a “world cafe setup. At different tables with a moderator, specific concerns are discussed. Participants visit the different tables to give their input. Previous input is built upon.

      After the session: A report is made and and actions are determined.

      Variants: This calibration can also be well organized in a smaller format, for example for a chair group. This is particularly useful if new reviewers have joined.

    • 5.3.3 Calibration based on a thesis carousel, to explore standards

      Purpose: To ascertain inter-rater reliability. Are equal grades given by different evaluators? To ascertain whether teachers interpret criteria and expected standards similarly. Calibrating assessments by learning from each other. Possibly: assessing the extent to which assessment tools are adequate or where improvement is needed.

      In advance: A sample of theses will be identified. Varying in quality (moderate, good, excellent), coming from different chair groups, possibly with further variations, e.g. internal versus external assignments. 
      For the selected theses, two lecturers per thesis (e.g. one belonging to the same chair group as the original assessors and one from a different chair group) are asked to independently assess the particular thesis and to fill out the usual forms for this purpose. Possibly with additional questions, such as: how do you estimate the complexity of the assignment? 
      A set moment (half a day) can be scheduled for completion and a space for everyone to work independently. This can help teachers schedule time for this action. It can also be left to those involved and only a deadline is given.   
      After the “reassessments”: The completed forms are compared with those completed by the original assessment committee. As far as possible, the comparison is made at the criteria level. Consideration is given to:
      > Differences between the re-assessors themselves. In terms of overall assessment and at the criteria level.
      > Differences between the re-assessors and the original assessment committee. In terms of overall assessment and criteria level.

      Discussing results: Several variants are conceivable for this. Among them:

      • An independent interviewer/conversation leader speaks with the assessors in the case of differences greater than a set standard (e.g., 1 grade point) to explore what explanations may be for the differences. Interviews may be done with the evaluators individually or with the re-evaluators plus the original evaluators together. Findings are noted and all findings together are recorded in a final report for the client (program management and/or examination board).  The findings can form input for a later session with those involved in the graduation process. A session - for example during an annual 'study day' for all teachers - that focuses specifically on identified points of attention and points for improvement.
      • A session is organized in which all original evaluators and re-evaluators discuss their judgments in groups. Using a question form, they record the findings for each group. The findings at a somewhat higher level of abstraction, not specific to a particular thesis, but for example, “Group x: We noticed that the process is very decisive for the final grade,” are discussed in plenary.  
        Afterwards, a report presents the main findings.
  • Interesting and to some extent used sources
    • Handreiking Kalibreersessies.  Auteur Daan Andriessen met medewerking van Irene van der Marel, Stijn Bollinger en Martine Ganzevles Inlichtingen Lectoraat Methodologie van Praktijkgericht Onderzoek 06-42605375 Datum 16 januari 2015
    • Handleiding-kalibreren-alternatieve-variant.pdf Handleiding kalibreren – alternatieve variant Project: Je ogen uitkijken Status: Concept Versie: 1.0 Datum: 16 juli 2020 Auteurs: Liesbeth Baartman, Lisette Munneke, Jeroen van der Linden
    • Calibration synthesis report.pdf Calibration Standards: What, Why and How? AdvanceHE.