I’m republishing instead of writing up something else (I have a couple in the works) because my head is full of mucus, I can’t hear, I can’t breathe, and I have to make supper (after I figure out what we’re going to have) and read (and comment on) about 25 essays.

This is long and detailed. The rest is below the fold.

Pissed Off has an interesting article about partial credit:

My AP is against giving too much partial credit on exams. At Friday’s welcome meeting he passed out copies of a final exam from summer school and asked us what was wrong with it. Not one person in the department could come up with an answer. The questions were well written in mathematically precise language. They were typed. They covered the entire range of the course and they were not too easy or too hard. He was in shock that we could find nothing to complain about.

After a few minutes he yelled “Don’t you see it? The multiple choice section is worth 24 points (4 points each) and the rest of the exam consists of 4 and 6 point questions where partial credit is given. This is ridiculous. An exam should have no more than 30 or 40 points of questions with partial credit. Besides, don’t you have better things to do than to mark papers?”

I thought when I first read this that the AP was being stupid, but upon reflection (and it may well be that I’m giving the AP too much credit), I can understand where this came from: Partial credit not only can be abused, but frequently is, in my experience, and I don’t mean by students. But the question of partial credit, what is abuse and what is appropriate, led to assessment in general, and here we are.

I’d like to talk about assessment issues, and remembering that a commenter on an earlier article stated that his administration to a large extent controlled the weighting of graded items, address this both to faculty and administrators. If you’re into assigning macaroni art projects, you’re probably wasting your time here. Just sayin.

Categories of assessment

Not all graded items are created equal. The distinctions between different types of assessment must be taken into account if we are to give students the fairest, most honest and accurate assessment possible.

Knowledge assessment v. still-warm-still-breathing assessment. Knowledge assessment obviously assesses a student’s knowledge, preferably of the topic at hand (and not something else). Still-warm-still-breathing assessment has nothing to do with knowledge, but may assess attitude or some other tangential property. Giving points for attendance is one example of still-warm-still-breathing assessment. After all, a student may not be in class because he already knows what a differential equation is. I probably don’t need to tell you that I’m not a big fan of the latter. I also don’t understand it. In primary and secondary school, attendance is mandated by law, so why base part of the grade on it? In the university, most faculty don’t grade on attendance because 1) it’s tedious and takes up too much valuable time, and 2) it treats students like children.

The only thing I have to say about still-warm-and-breathing assessment is this: If you absolutely must grade on something that has no direct relationship to knowledge, find some way to tie knowledge to it. So if you feel you have to grade on attendance, instead of wasting time calling roll at the beginning of class, give pop quizzes. You get a partial attendance check that bears some relationship to the topic, even if it is what I call the “Were you awake through class” type of quiz.

In-class v. out-of-class assessment. The crucial distinction between the two is that the former is given in a controlled environment, whereas the latter is not. This points up a weakness of out-of-class assessment: You can never be sure if the assignment reflects the student’s knowledge, or that of some other person (or source), be it another student, well-meaning but seriously misguided and unethical parents, or wikipedia. Of course, students can cheat on in-class assignments, but that’s a different topic (see here).

Another important distinction is time. I suspect that if I had had days in which to do an assignment, I might eventually have come up with an answer that was more or less correct, but if it took me days to think of it, then I didn’t know it very well. Out-of-class assignments therefore tend to inflate the representation of the student’s knowledge that in-class assignments, such as exams, more accurately represent.

I do not mean that out-of-class assignments are bad. But out-of-class assignments should be designed so that they assess knowledge and skills that are not easily assessed in a timed environment, or they should be designed less as assessment than reinforcement.

Contextual v. discrete assessment. Because “integrated” has been perverted redefined to mean “anything that doesn’t assess knowledge related to this course,” I’m reverting to the older term, contextual. A final paper is an example of a contextual assessment, because it assesses the student’s overall knowledge of the course material. Case-based assignments, those in which the assessments are embedded within the context of a case, or assignments that assess multiple, interrelated knowledge domains are also contextual.

Assessments in which the items bear no necessary relationship to one another except that they all relate to the general topic are discrete assessments. There is a grey area between the two, where the individual items may be discrete, but the items themselves are contextualized. Story problems on a math or science exam would be an example.

Also at issue are individual and group assessments, which need no definition.

In my experience, assessment is too often given very little thought. I have, on too many occasions, sat by while a colleague said something like, “We’ll have two exams at 45 points each and 10 assignments at 10 points each, that adds up to 100 so it will be easy to calculate final grades,” and left it at that. In my humble opinion, one should invest a great deal of care and thought when designing an assessment system.

Ideally, an assessment system will use a combination of assessment types to best evaluate the student’s knowledge across the spectrum. Exams are an excellent assessment for detailed knowledge, and quizzes can be excellent checks along the way (both for the instructor and the student). Every class should implement some kind of in-class assessment, if only to guard against academic misconduct.

Group projects (and group work in class) certainly have their advantages, but they also have distinct disadvantages. Before any instructor implements group projects, he should decide how he will guard against unfair work distribution within the group. Too often, students who like group work do so because they can sit back an let everybody else do their work for them.

However, group projects are excellent for contextualized work that is too complex to be assessed on an exam. Otherwise, there is little point in assigning group work. I am aware of the maxim about teaching to others being the best way to learn, but that doesn’t imply that those others learn anything.

Controlling for fairness when assessing group work is extremely difficult. I found that assigning students to one group which they worked with all semester long helped a great deal; students are willing to let others slack on the first project, but are much less so inclined on succeeding projects. I also used a system of contracts. The students as a group when the project was ready to turn in had to allot what everyone in the group agreed was an accurate percentage of the work done to each student in the group. Each student had to sign the contract, consenting to the percentage given him, and I assigned no grade to anyone in the group until the contract was signed and submitted. Once signed, a percentage score could not be appealed (if there is a conflict, we resolve it before any grades are assigned). So if the project was worth 50 points, the project grade was 45 points, and Johnny got a 90% on the contract, he would get 90% of the 45 points, or 40.5 points for that project.

Group projects were highly complex, far more so than what could be assessed in class. The were also case-based, and therefore contextual, which leads us to the next issue.

Contextual assignments, individual or work, out-of-class or in-class, are often designed so that the output or answers from one section feed into following questions. There is nothing wrong with this, except that it raises the issue of cascading error. In other words, if Mary gets the first question wrong and her answer for succeeding questions depends on getting the first question correct, should she be docked for those suceeding questions?

My position is that it depends on the purpose of the assessment. If the purpose of the assessment is to judge Mary’s knowledge of each of those tasks covered, then Mary should only be docked for what she did not learn. If, however, Mary has already been assessed on those tasks, and the project is analogous to, say, a comprehensive final exam, then I would be more comfortable with cascading errors — provided that she is given partial credit.

And that brings us back around to the article that started this. My first reaction, as I said, was that this AP was spouting nonsense. Partial credit may not be applicable in the real world, but it’s a necessity for teaching, even in a math class. Carl may not have gotten the right answer because he started out right, but veered off in the wrong directions, but he started out right, and he should get credit for that. Partial credit reflects partial learning, and as such, partial credit is more accurate assessment.

But partial credit should never be given unless it reflects partial knowledge. “You get five points just for writing something down for the question” isn’t partial credit. It’s educational welfare. Such practices should be forbidden, and the assessments of teachers who employ such practices cannot be trusted.

The final question, of course, is weighting assessments. How an instructor weights assessments depends on what type of course he teaches and what types of assessments he uses. But the weights assigned should be designed so that the advantages and disdavantages balance one another. So for example, if I am teaching the same course with others and assessments must be decided by the group, I insist on a 60-75% range for in-class exams so that at least that amount of the grade is not an assessment of someone other than the student’s.

But I do not give still-warm-and-breathing assessments. If such work is assigned, it should never form enough of the total score to raise a student’s grade more than a grade sign. That’s no more than 3% of the total grade. Anything more distorts the assessment to the point that it no longer reflects how well the student has learned.

So what does a fair, accurate, performance-based system look like?

When you teach a high-enrollment, mandatory course to highly competitive, goal-oriented students, fairness is paramount. One student can cause a great deal of pain by going to the proper university office, and we all want to treat our students fairly. For those who teach smaller courses and have as few as 40 or so students, personal integrity and professionalism should be enough. In our case, it was not, for reasons that will become clear in a moment.

The course (actually, a two-semester course sequence, but the assessment is identical in both) is a mandatory data-analyis course, whose two primary purposes are to weed out those students who lack the necessarry mathematical abilities and to teach students the problem-solving skills they will need in order to succeed in their future classes in the school. In the first semester course, enrollment averages 1700 in the fall semester and 1200 in the spring semester (because a student must get a C or better in the first semester course to take the second, enrollments in the second-semester course average 1200 and 800 for the fall and spring semesters, respectively). There are one fifty-minute lecture and two fifty-minute labs per week. Lectures are held in one of the several lecture halls, and average around 240 students per section (maximum seating for each room on campus is set by the fire marshall, so enrollment depends on the room). Labs are held in computer clusters, and range from 24-45 students per section, depending on the size of the computer cluster in which the lab section is held (do the math — that’s a lot of computer clusters across campus).

Because the course has such a large enrollment, and because so many lab sections must be taught, over half of the instructors are graduate students, nearly all of whom have no teaching experience, and no interest or investment in teaching. Also, since this is a university and not the real world, we — meaning those of us in charge of the two-semester sequence — have no control over which graduate students teach the class or whether they are allowed to continue to teach, no matter how poorly they perform (the university sees it as financial aid instead of employment). This presents a set of problems, the relevant ones here being how we control both the lack of teaching (assessment) experience and teacher (assessment) subjectivity.

Students are the ultimate control for fairness. Even in such a large course, when one instructor doesn’t cover something other instructors do, it not only gets around, but it gets back to us, usually within 24 hours, and often, sooner — and when this happens, there is hell to pay. If one instructor were grading a project one way and another were grading another way, or if one instructor bumped up Jeannie’s grade because he felt sorry for her, or if one student got credit for an answer while other students did not, students would be lined up in front of our offices to raise hell — and justifiably so. Subjectivity cannot be justified because it is inherently unfair, and autonomy does not protect the unfair, inconsistent assessment of students.

The solution was to develop a system of assessment that was wholly consistent and wholly fair. This is how we did that.

The assessment system is exclusively performance-based, and no criteria are subjective. Here is the grade breakdown for the class (1000 points possible):

Assessments
Total Points
Total Weight
Written Exams (2) 400 40%
Practical Exams (2) 400 40%
Quizzes (10) 50 5%
Projects (10) 150 15%

Each written exam is 100 multiple-choice questions (five distractors), and is written to test abstract concepts covered in the lecture portion of the class, as well as the student’s ability to extract abstract information from a problem (as are the ten unannounced quizzes, each of which is five multiple-choice questions with five distractors). Anything covered in any of the materials for class is fair game for assessment. This seems to be a foreign concept to students, who believe for some reason that there is only a small subset of what the course covers that will appear on an exam. The questions most students miss are those taken from the reading assignments but not overtly covered in the lecture (in other words, students don’t do the assigned reading, and don’t believe you when you tell them that anything from the material can appear on the exam). An example question which tests the student’s ability to analyze a problem and abstract crucial information from it:

“SomeCorp, Inc. produces widgets. Each widget costs $1.1942 to produce, and markup is 113%. Customer demand is 8498, 7742, 9023, and 8936 for the next four weeks, no backlogging is allowed, and excess widgets must be warehoused. Warehousing costs are $0.0943 per widget per day, and maximum warehouse capacity is 2652. What crucial piece of information do we need to supply to SomeCorp, Inc so they can minimize their total costs over the next four months?”

A. The number of customer orders over the next four months
B. The cost of widgets produced over the next four months
C. The gross profit margin over the next four months
D. The number of widgets to produce over the next four months
E. The warehousing costs over the next four months

(D, by the way, is the correct answer.)

Practical exams, like the projects, assess problem-solving skills in Excel (the ability of a student to solve a problem similar to the above is assessed on the practical exam). Students are given the Excel exam file on a flash drive, a printed copy of the questions, and have two hours to work through the exam. All exams are administered in a closely monitored, timed environment (any student who does not hand in his written or practical exam when time is up is given a zero). Testing administrations are tightly monitored, with at least two proctors in the small rooms, and as many as six in large lecture halls. No cell phones or electronic devices are allowed in the room, students are staggered in every other seat of every other row, and given alternating versions of the exam. The administrations of practical exams are similarly secure. The only application that can be running on any student’s machine is Excel. All infractions of testing policy are considered academic dishonesty. The student’s materials are collected, he is ejected from the room, and academic dishonesty charges are filed against him with the Dean of Students. The projects are the only component of the course worked outside of class or a supervised environment.

No criterion is subjective. There is no participation component, no attendance component, no self-esteem component. Of course, grading itself, even on objective criteria, can be subjectivized by the instructor, but even that has been purged from the assessment system. (In a sense, the quizzes can be thought of as partially an attendance grade, since they are given at the end of lectures and only those who attend can take the quiz, but only partially, since the quizzes are scored. Also, while students are not required to attend class, they are expected to attend, and know that if they skip lecture, they may be skipping a quiz. No makeup exams or quizzes are given, and no late project submissions are graded.)

Included in each practical exam and project Excel file are two VBA modules. One grades the file, logs in to the central gradebook, and uploads the score (grading is a process of downloading a zip file of student files, opening Excel and running the grading module for that project or exam). The other secures the file against cheating. If any instance of cheating (copied and pasted cell contents, one student turning in another student’s file, changed times and dates in files, etc.) is caught by the module, the instructor is alerted, the file is flagged, and the student is automatically emailed a message including the university statement on academic dishonesty, stating that the student’s file was flagged for possible academic dishonesty, and “requesting” that the student make an appointment to see the coordinator within the next five days (if the incident includes some other student or student’s file, for example if one student turns in another student’s file, both students’ files are flagged, and both are notified). There are numerous beartraps in the file in case a resourceful (ahem) student cracks the VBA password and hacks into the code. In the case of practical exams, the security module also prohibits a student from opening a file not on the flash drive, or copying it to another location.

Every point of every Excel assignment is graded by program. Instructors are locked out of the system. At no time and in no way does any instructor have the ability or opportunity to add, subtract, or change points for any student. All the instructor can do is comment.

Certainly, errors creep into written exams, even though each exam goes through an extensive review process before it goes to be printed and copied. For this reason, written exam results are rigorously analyzed statistically before exam scores are recorded in the gradebook. Any bad question is discarded from the exam before student scores are recorded.

Here is the grading scale (the total points can vary if exam questions are tossed, in which case the total point grade cutoffs are recalculated):

%-age
Total Points
Grade
93%
930
90%
900
A-
87%
870
B+
83%
830
80%
800
B-
77%
770
C+
73%
730
70%
700
C-
67%
670
D+
63%
630
60%
600
D-

The scale is strictly points-based with no rounding. If Susie totals 729.5 points at the end of the semester, she gets a C-, even if she is only 0.5 points away from a C. Letter grades are calculated and reported centrally. The instructor has no access to the central grades, and cannot add or subtract points, or otherwise adjust grades. The instructor may, of course, petition the faculty in charge of the course to adjust a grade, which new grad student instructors sometimes do, but the answer is always no. (Students do have emergencies, and they are dealt with as they arise, but no student’s grade is adjusted at the instructor’s whim.)

Education is not golf. There are no handicaps. Each week, if the student does an additional exercise, goes online and logs in, uploads his file, and answers a five-question quiz about the exercise, he can earn one bonus point, up to a maximum of fifteen bonus points. In order to get bonus credit, the Excel file uploaded must be worked, the file must be uploaded and the quiz worked, both from the same machine in the correct computer cluster during the student’s class, and three out of the five quiz questions must be correct (if the student tries to upload the file or answer the quiz from a computer other than one in his lab section cluster or at a time other than his assigned lab, he will get a message that tells him he may only upload files or answer the quiz during his lab section). So while there are bonus points available (1.5%), there is no way to give Marjorie a handicap to boost her self-esteem, or because she’s a disadvantaged, pigeon-toed, transgendered lesbian of color, or because she was abducted and anally-probed by space aliens. Likewise, there is no way for an instructor to adjust Billy’s grade down because he didn’t like him, or because Billy asked questions in class that made the instructor feel uncomfortable, or because the instructor didn’t like Billy’s politics. Each student is assessed exclusively on his performance and nothing else. And because each assessment is graded by program and not by a human instructor, every item on every assessment for every student is graded in exactly the same way and given exactly the same weight.

The only way to remove subjectivity from assessment is to remove all opportunities for subjective assessment. And the only way to remove instructor bias from assessment is to take assessment entirely out of the hands of the instructor. The system is “cold” in the sense that it makes no allowances for circumstance (note that if a student really does have a death in the family, or some other excusable problem, we do make allowances for it — but not in the grading system, and not by accepting late projects) and students do sometimes complain about that, but no student has ever complained that the system is unfair.

Just as importantly, all element of reward has been removed from the assessment. Students are assessed based solely on how they perform, and are not “given” a grade to reward them for being responsible students. We don’t reward students for being responsible; we expect students to be responsible. Whether or not they went to class or participated has no bearing on their assessment. Whether they liked or were liked by their instructor has no bearing on their assessment. Nothing but how well they learned the material determines their assessment — and that’s just as it should be.

We do not “teach to the test.” Assuming that students have the arithmetic and algebraic fluency required, teaching the material prepares students for the test. Practical exams follow the same format as the problems we do in class and the project problems. The last lab before the administration of the practical exam is a review day, and students are supplied with review files to work in class (there is too much material to cover to allow any more than one review day). We used to do reviews for the written exams, but dropped them. Students did not prepare for the reviews, and review days became wastes of valuable time.

Is this high stakes testing? Given that this is a mandatory class, and given that the exams comprise 80% of the class grade, yes. Does this testing or assessment somehow get in the way of learning? So far, no student or faculty member has come to us to complain that anyone’s creativity or “higher-order thinking” was impaired, but we have had many students thank us later for teaching them the skills they need, and quite a few faculty colleagues tell us how well prepared for their classes the students are. Teachers who think that the only way to prepare students for an exam is to emulate it in class lack imagination or intelligence or most likely both. Bear that in mind the next time you hear some teacher whining about having to “teach to the test.”

The best way to ensure fairness is to use performance-based assessment. While implementing the VBA code in every project and exam is extremely work intensive, removing all instructor subjectivity and catching all occurences of cheating make it well worth the trouble. But a subjective assessment system can only return a subjective — and unfair — assessment.

4 Responses to “Fair, Honest, Accurate Assessment”
  1. […] Read the rest of this great post here […]

  2. […] Rolfe Schmidt: […]

  3. […] Fair, Honest, Accurate Assessment (Right Wing Nation) […]

  4. Professor, nice post, but I have to pick a nit. In your Widget example, the question presents the demand in weeks, but the answers all address months. I thought it was a trick question or something.