A Warning on Measuring Learning Outcomes

You have /5 articles left.
Sign up for a free account or log in.

Among the recommendations contained in the report that Secretary of Education Margaret Spellings’ Commission on the Future of Higher Education issued last September were these:

The results of student learning assessments, including value added measurements that indicate how students’ skills have improved over time, should be made available to students and reported in the aggregate publicly.
The collection of data from public institutions allowing meaningful interstate comparison of student learning should be encouraged and implemented in all states.

I appreciate the commission’s focus on student learning and its assessment. But my experience and my reading and conduct of research on these topics lead me to argue against the use of standardized tests of general intellectual skills to compare the effectiveness of colleges and universities.

Secretary Spellings currently is undertaking a variety of initiatives designed to implement the commission’s recommendations. In addition, several national organizations, including the Educational Testing Service and a partnership involving the National Association of State Universities and Land Grant Colleges and the American Association of State Colleges and Universities, are working to identify or develop “student learning assessments, including value added measurements” that will facilitate “meaningful interstate comparison.”

I have devoted much of my career to helping faculty identify and develop ways to assess student learning and institutional effectiveness, then use assessment findings to improve students’ learning and educational experiences, I have conducted my own research on assessment and have studied that of many others and have established a reputation as an advocate of appropriate (i.e., valid and reliable) assessment that can improve student learning. Thus I have more than a passing interest in these current developments.

For a decade beginning in the mid-1980s I coordinated the University Tennessee at Knoxville’s response to Tennessee’s Performance Funding initiative, which required us to test thousands of freshmen and seniors and calculate gain, or "value added." Given the large numbers of students involved, we were able to try out several standardized tests of general intellectual skills (ACT’s COMP and CAAP; CBASE; and the Academic Profile, the ETS precursor to MAPP) as well as to test seniors who had taken the same exam as freshmen. In addition, my associate Gary Pike and I, along with other colleagues in various disciplines at Tennessee, undertook a program of research on the reliability and validity of the tests and on the reliability of value added calculations.

Our research confirmed findings and conclusions dating to the 1960s reached by such respected measurement scholars as Lee Cronbach, Frederic Lord, Robert Linn, and Robert Thorndike. Some generalizations based on these findings may be helpful to others as we confront once again the challenge to find valid measures of college students’ learning and score gain that permit institutional comparisons.

While standardized tests can be helpful in initiating faculty conversations about assessment, our research casts serious doubt on the validity of using standardized tests of general intellectual skills for assessing individual students, then aggregating their scores for the purpose of comparing institutions.

Standardized tests of general intellectual skills (writing, critical thinking, etc.):

test primarily entering ability (e.g., when the institution is the unit of analysis, the correlation between scores on these tests and entering ACT/SAT scores is quite high, ranging from .7 to .9), therefore differences in test scores reflect individual differences among students taking the test more accurately than they illustrate differences in the quality of education offered at different institutions.
are not content neutral, thus disadvantage students specializing in some disciplines.
contain questions and problems that do not match the learning experiences of all students at any given institution.
measure at best 30% of the knowledge and skills faculty want students to develop in the course of their general education experiences.
cannot be given to samples of volunteers if scores are to be generalized to all students and used in making important decisions such as the ranking of institutions on the basis of presumed quality.
cannot be required of some students at an institution and not of others—yet making the test a requirement is the only way to ensure participation by a sample over time.

If standardized tests of general intellectual skills are required of all students,

and if an institution’s ranking is at stake, faculty may narrow the curriculum to focus on test content.
student motivation to perform conscientiously becomes a significant concern.
extrinsic incentives (pizza, stipends) do not ensure conscientious performance over time.
ultimately, a requirement to achieve a minimum score on the test, with consequences, is needed to ensure conscientious performance. And if a senior achieves less than the minimum score, does that student fail to graduate despite meeting other requirements?

For nearly 50 years measurement scholars have warned against pursuing the blind alley of value added assessment. Our research has demonstrated yet again that the reliability of gain scores and residual scores -- the two chief methods of calculating value added -- is negligible (i.e., 0.1).

We conclude that standardized tests of generic intellectual skills do not provide valid evidence of institutional differences in the quality of education provided to students. Moreover, we see no virtue in attempting to compare institutions, since by design they are pursuing diverse missions and thus attracting students with different interests, abilities, levels of motivation, and career aspirations.

If it is imperative that those of us concerned about assessment in higher education identify standardized methods of assessing student learning that permit institutional comparisons, I propose two alternatives:

1. electronic portfolios that can illustrate growth over time in generic as well as discipline-based skills and are not distorted by a student having a bad day and performing poorly on a 3-hour snapshot of what s/he has learned in college. Portfolios can be scored reliably using rubrics developed by groups of faculty. Then scores can be aggregated to provide the numbers decision-makers want to compare.

2. measures based in academic disciplines that show how students can use discipline-based knowledge, as well as generic skills, in their chosen fields and as informed citizens with specialized expertise.

In short, a substantial and credible body of measurement research tells us that standardized tests of general intellectual skills cannot furnish meaningful information on the value added by a college education nor can they provide a sound basis for inter-institutional comparisons. In fact, the use of test scores to make comparisons can lead to a number of negative consequences, not the least of which is homogenization of educational experiences and institutions. The wide variety of opportunities for higher education has heretofore been one of the great strengths of higher education in the United States.