Skip to main content

Cornell University

Academic Innovation

Office of the Vice Provost for Academic Innovation (OVPAI)

SETs Guidance to Chairs and Deans

Student evaluations of teaching are an important source of feedback on student experience and can provide valuable insight for individual faculty members, chairs and deans looking to improve and reward good teaching at the course, department and college level. Used carefully and in conjunction with other inputs, they can form an important part of a more holistic, effective and equitable approach to teaching assessment, including as this relates to key processes like annual reviews and reviews for reappointment and promotion.

Summary of key points and recommendations

  • Adopt a holistic and multidimensional approach that considers SEVERAL inputs (and doesn’t over-rely on student evaluations of teaching as a/the primary pillar of the teaching ‘case’)
  • Include regular and systematic faculty peer evaluations that include input on performance, content, and course design (not just unstructured or single-session observations)
  • Account for variations in teaching set-up (size, level, content, pedagogical methods, etc.) and recognize those who take on classes with distinct teaching challenges
  • Consider what student evaluations CAN and CAN’T do well. CAN: report on their own experiences, provide feedback on concrete elements like organization, communication, availability of support, and their experience of specific course elements, from lectures to group work to homework and assignments; CAN’T: provide single-question assessments of instructor effectiveness, offer evaluations of instructor mastery of materials; evaluate value/relevance (or not) of course design and disciplinary knowledge
  • Collect and review concrete teaching materials – lecture slides, assignments, syllabi, learning goals, exams, samples of student work, etc. (see teaching portfolio recommendations below)
  • Invite instructor reflection on what the instructor is trying to accomplish, how this will be done, challenges, successes, and adaptation based on experience and student feedback, and future teaching goals.
  • Look for evidence of effort: trainings attended; participation in department, college or university-wide teaching programs; submission or success in teaching-related grants or publications; engagement with teaching research and literature; reflective conversations and questions around teaching with mentors, chair, and colleagues.
  • Look for evidence of continued growth and improvement – revised materials, additional/challenging courses taken on, or movements in student feedback over time (including evidence of early challenges addressed and overcome)
  • Recognize and support pedagogical experiment and innovation – chairs and deans should recognize that changes to long-standing teaching practices may produce mixed or even negative feedback from students in early implementations.

Background

Based on accumulating research evidence, universities and academic associations across the country have urged reform in the way that teaching effectiveness is evaluated. Currently, student evaluations of teaching are often used as primary measures of teaching effectiveness and are regularly used in high-stakes personnel decisions (including around reviews for promotion and tenure) despite research that shows the limits of these instruments (and their frequently poor or problematic design).  In many cases, student evaluations more closely resemble student opinion surveys than valid evaluations of teaching excellence, and mix topics that students are well-equipped to comment on (including individual learning experiences, challenges, and ideas) with assessments of pedagogy or course design they may be less well placed to make. Research suggests that they tend to be only weakly related to other measures of teaching effectiveness (Boring, Ottoboni & Stark, 2016; Uttl, White & Gonzalez, 2017), influenced by factors that have little to do with teaching quality, and biased against women and people of color.  These problems may be compounded by over-reliance on meta-scores (e.g. ‘overall instructor’ or ‘overall course’ scores) in the promotion and tenure process where some of these biases are known to be most pronounced.  Since the move from in-class to online evaluations, many universities have also experienced chronic response rate issues that further challenge the value and representativeness of student evaluations when taken as a primary measure of teaching effectiveness.[1]

This document provides guidance to Deans, Chairs, and faculty on more effective and appropriate use of student evaluation data in faculty assessments (including for purposes of promotion and tenure). Together with improvements in the content and administration of Cornell’s student evaluations of teaching (SET) systems, it is meant to support richer and more useful feedback to individual instructors, fairer and more effective assessments of teaching for purposes of review and promotion, and more effective and innovative teaching practices and discussions at the department, college, and university levels.

Summary of research evidence

  1. Course evaluation forms often ask students to evaluate the overall quality or effectiveness of their instructors, instructor mastery of disciplinary knowledge, or the appropriateness of course content, despite accumulated research evidence that shows that students lack the subject matter expertise to accurately assess these things. However, students are a good source of feedback about their individual experiences of course elements (organization, communication, assignments, groupwork, etc.) that affect learning, and such input can serve as valuable input to instructors to seek to improve their courses (Boysen, Kelly, Raesly & Casner, 2014; Zabaleta, 2007; American Sociological Association 2019).
  2. Results from meta-analyses of the relationship between student ratings and student learning are inconsistent (Clayson, 2009; Cohen, 1981).
    • Research consistently fails to find evidence of a compelling correlation between student ratings of teaching effectiveness and measures of student learning.
    • Favorable ratings in one course do not necessarily predict student performance in subsequent courses (Carrell & West, 2010)
    • Putting too much weight on typical student evaluations as a measure of teaching excellence is not a good idea (Stark & Freishtat, 2014).
  3. Ratings can be biased by a number of factors that have little to do with teaching quality or effectiveness (Heffernan, 2022):
    • Instructor factors: gender, perceived attractiveness, ethnicity and race
      • Gender: Students often rate female instructors lower than male instructors (Basow, Codos, & Martin, 2013; Boring et al., 2016; Centra & Gaubatz, 2000; Spooren, Brockx, & Mortelmans, 2013, Boring, Ottoboni & Stark, 2016); in online studies, students rate instructors with male names higher than those with female names, even when the names do not align with the true identity of the instructors and all teaching components such as grading standards and speed of responses are identical (MacNell, Driscol, & Hunt, 2015).
      • Race/ethnicity: instructors of color, such as Black and Asian faculty members, are frequently evaluated less positively than white faculty, especially by white male students.  (Bavishi, Mader & Hebl, 2010; Reid, 2010; Smith & Hawkins, 2011).
      • Interaction of identity and course content: Bias is strongest in fields in which certain faculty members are underrepresented.
    • Course factors:
      • Course type (required vs elective): ratings correlated with student interest in a subject (Ting, 2000).
      • Course level: upper-level courses receive higher ratings (Santhanam & Hicks, 2001).
      • Course size: smaller classes receive slightly higher ratings (Bedard & Kuhn, 2008; McPherson,2006).
      • Academic discipline: STEM and quantitative courses often receive lower ratings (Bascow & Montgomery, 2005; Beran & Violato C., 2005).
      • Grading rigor: students’ grade expectations are positively correlated with ratings; the correlation between expected grades and teaching evaluations may contribute to grade inflation, with some evidence that this may be more prevalent among pre-tenure faculty (Remedios & Lieberman, 2008; Beran & Violato C., 2005; Langbein, 2008).
      • Intrinsic interest in the subject (Griffin, 2004).
    • Type of question:
      • Abstract questions, omnibus questions (e.g., ‘overall, this is an excellent instructor’), and questions requiring holistic judgments of effectiveness (e.g., how effective was the instructor?) are particularly problematic because they are less grounded in specific and concrete experience, and more prone to the influence of biases.

Improving use of SET data in annual reviews, instructor evaluation, and promotion and tenure decisions

A first set of recommendations involves the more careful and appropriate use of SET data in annual reviews, instructor evaluations, and promotion and tenure decisions.  We encourage deans, chairs, and faculty to:

  1. Refrain from relying on numerical teaching evaluation scores as the sole standard for assessing teaching quality. When they are used, the focus should be on examining patterns in feedback over time, not comparing faculty members to other faculty, departmental averages, or other courses. Leaders should not compare average ratings across courses, instructors, levels, and modes of instruction. It is especially problematic to do so when course factors that affect ratings (e.g., size, level) correlate with demographics (e.g., female faculty teaching more sophomore classes in ENG).
  2. Keep in mind that at least half of the faculty in any department will have scores that are at or below the general norm or median for that department (so requiring or expecting ratings above departmental norms or medians as evidence of teaching effectiveness is inappropriate (Stark & Freishtat, 2014).
  3. When reporting numerical scores, focus on the distribution of scores more than (or instead of) the averages, and both the sample size and response rate for each course (Stark & Freishtat, 2014). Items that have low response rates should be given little weight in the evaluation process.
  4. Exercise caution in comparing and interpreting means (especially where scales are ordinal (not interval-based) with non-standard distances between ratings (e.g., the distance between good and fair vs between fair and poor). Do not overemphasize small differences in rating results – focus instead on whether the instructor is excellent (and should be nominated for teaching awards), meets expectations, or where they might benefit from further work and focus.
  5. When interpreting numerical scores, take into account factors that are known to influence student ratings (see summary of research evidence above; in addition, adoption of new teaching methods or other pedagogically-sound course changes has sometimes resulted in mixed or negative short-term effects in certain quantitative or qualitative scores and comments).
  6. Pay attention to student comments while understanding their limitations. Students typically are not reliable evaluators of pedagogy. Also, unless response rates are very high, it is impossible to know the nature of response bias (i.e., what motivated students to respond versus not?)
  7. Make every effort to interpret numerical scores within the context of other available information, including peer reviews, qualitative comments from students, instructor self-reflections and teaching statements, and a review of teaching materials. Overall assessments should focus on the extent to which the following conditions of teaching excellence are consistently met:
    • Assessment of and feedback on student learning:
      1. Evaluation of student learning that is linked to explicitly stated goals for student learning, using evaluation methods and principles that are transparent to students.
      2. Provision of timely and useful feedback to students, including about their progress in a course (to be evaluated in light of contextual factors like course size, staffing levels, etc.)
    • Professionalism:
      1. Creation, maintenance and updating of coherently organized course materials, including syllabi that state the course learning objectives, outline course policies and expectations, and provide a clear picture of workload and deadlines.
      2. Respectful communication with students
      3. Awareness, sensitivity and responsiveness to student accommodations (to be evaluated in light of contextual factors like course size, staffing levels, etc.)
    • Engaged and inclusive teaching:
      1. Instruction that engages, challenges, and supports students
      2. Use of pedagogical strategies that encourage active and meaningful participation by all students (to be evaluated in light of contextual factors like course size, staffing levels, etc.)
      3. Course content that reflects diversity of voices, perspectives, issues and applications in a field, including efforts to broaden disciplinary and professional communities.
      4. Openness to engaging contested and evolving knowledge, including as disciplines, areas, and perspectives on issues evolve.
    • Reflective teaching
      1. Regular revision of courses, both in content and pedagogy; evidence of using student input to inform course refinement to improve student learning.
    • Teaching that supports learning and growth in student thinking by:
      1. Providing context and connections across the material, including where relevant to other courses
      2. Explicitly helping students organize autonomous thinking about the material, developing more expert-like model or approaches while supporting student curiosity, innovation, and critical thinking skills
      3. Reflecting and supporting department, college, or university learning outcomes
  8. Read and discuss student evaluations with faculty (especially early career) faculty during the annual review process.  Chairs can encourage faculty reflection and learning by asking what faculty members take away from the evaluations (from general patterns to specific ideas); focusing on questions and feedback beyond overall course and instructor rating scores; pointing to particularly positive or noteworthy elements (including positive trends over time, in cases where evaluations have shown significant improvement); ask about specific pedagogical experiments (‘I see you added / tried out X – how did that go?); observe any distinct challenges (and ask about instructor’s ideas for addressing these); and ask about any ideas or techniques the instructor is thinking about trying out in future versions of the class.  Chairs can also use positive evaluation results (including significant improvements in results, from whatever starting point) to guide recognition and praise, whether in form of nomination for teaching awards or simple email acknowledgement (‘hey, nice job in Class X this year, those were really interesting comments’).

An improved SET instrument

In addition to more judicious use of SET data in assessments of faculty teaching, the university-wide SET content committee has proposed a revised SET core question set that reflects the experience, research and common pitfalls above.  In general, this redesign has followed the below principles:

  1. Questions should focus on factors about which students are in a good position to provide feedback – their experiences in the course and reflection on their own learning – rather than abstract assessments of teaching effectiveness (e.g., instructor’s command of the content, appropriateness of the course structure). Students are reliable sources of data about the extent to which an instructor communicates clearly, stimulates interest, treats students with respect, and appears prepared for class. They may also be a source of insight on their own challenges and experiences, and a source of good ideas for future revisions to class content and ideas.  This (non-evaluative) ‘feedback’ function is essential to ongoing instructor learning and development, in particular for early-stage faculty and faculty seeking to experiment with, improve, update, and extend their pedagogical practice and technique.
  2. Avoid abstract questions, omnibus questions, and questions requiring disciplinary or content matter expertise that students may lack, as they are particularly subject to biases related to instructor gender, race, and other characteristics.
  3. Shift attention away from the overall performance of individual instructors and towards students’ learning and experiences in the course. Focus on factors that affect learning and the learning experience (e.g., clarity of course organization, effectiveness of assignments, experiences of group work, use of in-class activities to enhance learning, etc.).
  4. Every effort should be made to encourage response rates >60%. Potential strategies to improve response rates are discussed in separate ‘SET response rate strategies’ document.

Supporting more holistic and multidimensional assessments of teaching – sources beyond student evals

Given the inherent limits of SET data (even in revised form), we urge deans, chairs and faculty towards more holistic and multidimensional assessments of teaching that draw on multiple sources of evidence and input.  Beyond more judicious use of student evaluation data, such inputs should include, at minimum:

  • Instructor self-reflections – asking candidates to reflect on their teaching practice, and their growth as teachers, provides important insight into their development, and allows them to contextualize their teaching experience. Candidates should describe their teaching goals and how they align their syllabi, learning outcomes, and assignments to accomplish those goals. When used to assess longer developmental trajectories (for example, for purposes of promotion and reappointment) this should include reflection on how the candidate’s teaching methods and approaches have developed over time, including documentation of efforts to experiment or draw on outside resources in developing the candidate’s teaching practice.
  • Peer observations – to be most effective, peer observations should be conducted with some regularity (not just in the run-up to key personnel decisions); involve reflections on both classroom visit(s) and pedagogical design (review of syllabus, learning goals, course and assignment structure, etc.); and follow some kind of common structure, template or guidance to the evaluator.  Potential examples and additional guidance on effective peer evaluations of teaching can be found here.
  • Review/discussion of teaching materials (syllabi, assignments, classroom activities, etc.) – in addition to classroom observations, during key promotion reviews (e.g. review for tenure, or promotion from lecturer to senior lecturer) we encourage faculty to review and discuss teaching materials supplied by the candidate that documents and samples their teaching practice.  Such materials should reflect the full range of the candidate’s teaching efforts, including across course types, sizes, and levels, and demonstrate alignment between the candidate’s teaching goals and their practices.  Such materials could also include samples of student work where these are connected to specific pedagogical aims and practices.

We encourage departments to adopt a common general form and provide clear guidance to candidates on the requirements and framing of these materials; departments may also wish to encourage sharing of examples (with individual faculty permission) from recent to current candidates for promotion to encourage learning, clarity of expectation, and efficiency in assembling this part of the promotion portfolio.  (Beyond its importance for holistic faculty review, such practices of sharing may also encourage borrowing, learning, and richer department-level teaching discussions, and support wider understanding of curricular structure and content above the individual course level).  Departments looking for guidance or potential models of this approach are encouraged to consider the portfolio approach recommended by the Center for Teaching Innovation as documented here.

In addition to the resources and materials above, more holistic assessments of teaching should consider other evidence or indicators that the faculty member has met or exceeded expectations and evidenced particular engagements and commitments to teaching, for example by:

  • Contributing to teaching and advising in courses or programs that require additional (and otherwise unrecognized) effort. Examples might include: first-year experience seminars; student success initiatives; independent studies; Learning-Where-You-Live-Courses; active participation in West Campus or North Campus House Fellows or Faculty Fellows programs; courses that contain an engaged and/or global experiential learning opportunity;
  • Receiving or being nominated for awards, grants, fellowships, and/or other forms of recognition for teaching excellence and innovation;
  • Serving on a higher-than-average number of graduate student committees and/or advising a higher-than-average number of undergraduate students;
  • Carrying a higher than normal ‘shadow’ advising load (for example, women or faculty of color who may bear a higher-than-standard load of unofficial support or advising, including during periods of stress or tension in campus climate);
  • Participation and leadership in departmental teaching initiatives and discussions;
    Participating in programs or initiatives oriented to pedagogical experiment or improvement (e.g. active or engaged learning programs; Center for Teaching Innovation or college-level (e.g. McCormick Teaching Excellence Institute in Engineering) programs, workshops, and initiatives;
  • Demonstrating leadership in teaching through teaching-centered contributions to disciplinary and professional societies;
  • Demonstrating engagement with pedagogical theory and research, including through attendance and/or publishing teaching-related research in disciplinary and/or educational fields

[1] See separate ‘SET Response Rate Strategies’ document for more detail and recommendations.

 

Download PDF

Cornell SETs Guidance to Chairs and Deans – May 2024

References

Basow S. A., Montgomery S. (2005). Student ratings and professor self-ratings of college teaching: Effects of gender and divisional affiliation. Journal of Personnel Evaluation in Education, 18, 91–106.

Basow, S., Codos, S., & Martin, J. (2013). The effects of professors’ race and gender on student evaluations and performance. College student journal, 47(2), 352-363.

Bavishi, A., Madera, J. M., & Hebl, M. R. (2010). The effect of professor ethnicity and gender on student evaluations: Judged before met. Journal of Diversity in Higher Education, 3(4), 245.

Beran T., Violato C. (2005). Ratings of university teacher instruction: How much do student and course characteristics really matter? Assessment and Evaluation in Higher Education, 30, 593–601.

Bedard K., Kuhn P. (2008). Where class size really matters: Class size and student ratings of instructor effectiveness. Economics of Education Review, 27, 253–265.

Boring, A., & Ottoboni, K. (2016). Student evaluations of teaching (mostly) do not measure teaching effectiveness. ScienceOpen research.

Boring, A., Ottoboni, K., & Stark, P. B. (2016). Student evaluations of teaching are not only unreliable, they are significantly biased against female instructors. Impact of Social Sciences Blog.

Boysen, G. A., Kelly, T. J., Raesly, H. N., & Casner, R. W. (2014). The (mis) interpretation of teaching evaluations by college faculty and administrators. Assessment & Evaluation in Higher Education, 39(6), 641-656.

Carrell, S. E., & West, J. E. (2010). Does professor quality matter? Evidence from random assignment of students to professors. Journal of Political Economy, 118(3), 409-432.

Centra, J. A., & Gaubatz, N. B. (2000). Is there gender bias in student evaluations of teaching?. The journal of higher education, 71(1), 17-33.

Clayson, D. E. (2009). Student evaluations of teaching: Are they related to what students learn? A meta-analysis and review of the literature. Journal of marketing education, 31(1), 16-30.

Cohen, P. A. (1981). Student ratings of instruction and student achievement: A meta-analysis of multisection validity studies. Review of educational research, 51(3), 281-309.

Griffin B. W. (2004). Grading leniency, grade discrepancy, and student ratings of instruction. Contemporary Educational Psychology, 29, 410–425.

Heffernan, T. (2022). Sexism, racism, prejudice, and bias: A literature review and synthesis of research surrounding student evaluations of courses and teaching. Assessment & Evaluation in Higher Education, 47(1), 144-154.

Langbein L. (2008). Management by results: Student evaluation of faculty teaching and the mis-measurement of performance. Economics of Education Review, 27, 417–428.

MacNell, L., Driscoll, A., & Hunt, A. N. (2015). What’s in a name: Exposing gender bias in student ratings of teaching. Innovative Higher Education, 40, 291-303.

McPherson M. A. (2006). Determinants of how students evaluate teachers. Journal of Economic Education, 37, 3–20.

Reid, K. (2010). An evaluation of an internal audit on student feedback within a British university: A quality enhancement process. Quality assurance in education, 18(1), 47-63.

Remedios R., Lieberman D. A. (2008). I liked your course because you taught me well: The influence of grades, workload, expectations and goals on students’ evaluations of teaching. British Educational Research Journal, 34, 91–115.

Rivera, L. A., & Tilcsik, A. (2019). Scaling down inequality: Rating scales, gender bias, and the architecture of evaluation. American Sociological Review, 84(2), 248-274.

Santhanam E., Hicks O. (2001). Disciplinary, gender and course year influences on student perceptions of teaching: Explorations and implications. Teaching in Higher Education, 7, 17–31.

Smith, B. P., & Hawkins, B. (2011). Examining student evaluations of Black college faculty: Does race matter?. The Journal of Negro Education, 149-162.

Spooren, P., Brockx, B., & Mortelmans, D. (2013). On the validity of student evaluation of teaching: The state of the art. Review of Educational Research, 83(4), 598-642.

Stark, P., & Freishtat, R. (2014). An evaluation of course evaluations. ScienceOpen. Center for Teaching and Learning, University of California, Berkley.

Ting K. (2000). A multilevel perspective on student ratings of instruction: Lessons from the Chinese experience. Research in Higher Education, 41, 637–661.

Uttl, B., White, C. A., & Gonzalez, D. W. (2017). Meta-analysis of faculty’s teaching effectiveness: Student evaluation of teaching ratings and student learning are not related. Studies in Educational Evaluation, 54, 22-42.

Zabaleta, F. (2007). The use and misuse of student evaluations of teaching. Teaching in higher education, 12(1), 55-76.