Stats For Teachers, Revisited

I’ve reworked this to make it a bit more readable (particularly to the non-math folks), and here and there, I’ve clarified or added material. I hope it helps. I’ve also added a section for those of you who use mostly written assessments, like papers or esssay questions (yes, you can use statistics to help your teaching). I realize not everybody has SPSS or SAS (I should say only stats geeks have SPSS or SAS), but everybody has Excel, and that’s all you need. But first, you have to enable the data analysis toolpack.

  1. Run Excel. Yes, now.
  2. Click Tools.
  3. Click Add-ins.
  4. Check Analysis ToolPak and Analysis ToolPak - VBA
  5. Click OK.

Click Tools again, and Data Analysis will appear in the menu (it will stay there — you only have to enable it once). That’s what you’ll use.

Ready? Here we go.

Statistics are an invaluable tool for improving your teaching and making your class fairer for your students. With statistics, you can identify bad test questions and throw them out. You can identify questions many students should have gotten right but did not, and determine what went wrong. You can determine how well the assignments you give your students work, and you can determine how well you are preparing students for those exams.

I’m assuming two things. First, I assume that you’re teaching the material, and not doing test-prep in class. Teaching the material is teaching to the test. There is no excuse for doing test prep instead of covering the material. Second, I assume that your school curriculum matches the state curriculum (and therefore the test) fairly closely. If not, well, there’s a serious problem with the school curriculum, and the administration needs to fix it.

Let’s start with that 100-point test you just gave (assuming your data are in an Excel file). Open your gradebook in Excel, and — pay attention here — note that one thing that sucks about Excel’s Data Analysis Tookpack is that (usually) data have to be in adjacent columns, so prepare to do some copying and pasting. Get your exam scores in one column, then click Tools, Data Analysis. Select Descriptive Statistics from the list, then click OK. Click in the Input Range box, then select the exam scores. If you have a label (text — you know, like "first exam") in the first cell, make sure you check Labels in First Row. Click in the New Worksheet Ply box, and type Exam 1 Descriptive Stats (or something like that), then click OK. Excel will calculate the descriptive stats and put them in a new worksheet called Exam 1 Descriptive Stats (or whatever you chose to call it).

Here are your descriptive stats:

Exam Score
Mean 73.88
SE 2.49
Median 73.60
Mode 100.00
Stdev 24.94
Sample Variance 621.80
Kurtosis -1.66
Skewness -0.17
Range 69.30
Minimum 30.70
Maximum 100.00
Sum 7388.10
Count 100
95% CL 4.95

Your mean is just a little low (ideally, it should be in the mid 70s), but not low enough for concern. Your mode (most frequently occurring score) is 100, and that’s always good, but your standard deviation is large: Each score on average varied 24.94 points from the mean, and that’s a lot of spread. Your kurtosis is a bit low, too, and along with the large standard deviation, it looks like you have a lot of scores in the tails. It’s not a bad exam from just looking at the descriptive stats, though you would have liked to have had more students clustering around the mean — that is, a lower standard deviation.

Now, we’re going to calculate correlations to test the effectiveness of individual questions for each student, and whether the student got the question correct or not.

Next, look at the correlations between the individual questions and the total score.Statisticians will immediately hear warning bells go off, and they’re correct. The individual questions are part of the exam score, so normally, we would not do this (this is known as collinearity). But here, we’re using collinearity to our advantage.

We need the question data and the exam scores in adjacent cells, or a contiguous block (here, we’ll say we’re looking at questions 11 and 15). Again pull up the Data Analysis Toolpack, but this time, select Correlation from the list. Click in the input box, then select the columns with the question and exam score data, making sure to select the labels (and making sure to check Labels in First Row). Again, in New Worksheet Ply type a worksheet name (Exam question-total correlations), then click OK.

Switch to the Exam question-total correlations worksheet, and this is what you see:

 
Question 11
Question 35
Exam Score
Question 11 1
Question 35 0 1
Exam Score 0.02 0.94 1

Here’s what we’re doing. Both questions are part of the exam score, so we should see the effects of collinearity — that is, we should see a significant correlation between the question and the score, where significant here means greater than or equal to 0.3. When we don’t see the effects of collinearity, there’s a problem. Note that the correlation coefficient between Question 11 and the Exam Score is 0.02! There is something bad wrong with that question. First, pull the exam and read the question; usually, when this happens, it’s pretty obvious what went wrong, often a typo or a badly worded question, but sometimes a question that goes beyond the scope of what you covered. Once in a great while there will be nothing wrong with the question. If that is the case, leave it in, but otherwise, delete it from the exam and the exam results. Note that Question 35 highly correlates with the Exam Score. Leave it in. Do this for all the questions, deleting any that have suspiciously low correlations after you read the question and determine if there is anything wrong with it.

As you’re doing this, you will usually notice that there are questions on topics you covered in class that students should have gotten right, but did not. This is a pedagogical red flag (when this happens with questions only from reading assignments, it indicates that students didn’t do the reading, and I find that these are the questions most students miss). How did you cover those topics? How can you change your presentation to make it clearer to your students?

Go through all the questions fewer than half the students got correct, and run them through the same process. Compare the questions on similar topics. If students missed many of the questions on the same topic, that’s a sign that there’s a problem with the way you present the topic.

Use statistics to tell you how well you’re presenting the material.

If you spot a problem, bring it up in class. Tell your students that a lot of them had trouble with whatever it was, and ask them how you could have made it clearer for them. Never underestimate the value of student feedback. Ask your colleagues. If you have trouble, always get help.

You can also use statistics to determine how effective those assignments you give your students are. Let’s say you’ve just given your first 200-point exam, and before that, you had given several assignments (we’ll look at three). Your data look like this (the table represents part, not all, of your data):

Assignment 1
Assignment 2
Assignment 3
Exam Score
3.50 3.50 8.36 78.40
34.30 34.30 6.40 200.00
34.80 34.80 9.42 183.20
12.80 12.80 22.95 149.60
29.30 29.30 14.33 200.00
27.20 27.20 24.51 133.20
6.35 6.35 6.59 117.80
0.20 0.20 3.27 89.20
7.25 7.25 21.55 109.40
31.60 31.60 4.29 200.00
17.30 17.30 1.54 88.80
26.15 26.15 19.34 109.80
33.70 33.70 4.65 200.00
6.10 6.10 9.68 77.60
39.50 39.50 17.78 200.00
25.70 25.70 6.33 112.40
13.05 13.05 0.89 82.60
7.15 7.15 0.45 79.40
18.25 18.25 17.45 105.80
19.60 19.60 16.11 96.40
26.75 26.75 2.18 187.40
22.60 22.60 3.95 120.80
42.90 42.90 17.68 200.00
29.70 29.70 13.34 200.00

Run correlations on the assignments and exam:

  Assignment 1 Assignment 2 Assignment 3 Exam Score
Assignment 1 1
Assignment 2 0.99 1
Assignment 3 -0.07 -0.05 1
Exam Score 0.86 0.86 0.02 1

If your assignments are effective (and if they cover the same skills covered on the exam), you should get at least a 0.3 (better a 0.5) correlation coefficient between the assignments and the exam score. Assignments 1 and 2 correlate pretty highly, but note the third assignment. There is nearly no correlation between it and the exam score. This is a great big red flag, so compare the three assignments. It’s not enough just to ditch the third assignment and replace it with something else; you need to figure out what is wrong with the third assignment. What is different about the third one? How are the first two similar–and how is the third different from the first two? Whatever it is, it’s not working.

Note that you can use exactly the same method to determine how well your assignments and exams are teaching students what they need to know by running correlations on your students class scores and their standardized exam scores. If you’re a coordiantor or adminstrator, you can also determine which teachers are better preparing their students. Here are two teachers’ 100-point final exam scores and the standardized exam scores (only part of the data are represented):

T1 Exam Score
T2 Exam Score
Standardized Exam Score
50.80 100.00 93.32
44.60 17.00 67.77
46.70 51.00 93.64
54.00 100.00 95.86
49.00 100.00 64.67
100.00 99.00 100.00
39.70 33.00 86.63
73.80 57.00 100.00
44.00 100.00 68.95
43.30 100.00 72.85
100.00 10.00 100.00
90.60 100.00 100.00
100.00 96.00 100.00
100.00 51.00 100.00
54.10 100.00 96.49
37.30 37.00 64.80
100.00 30.00 100.00
46.20 15.00 63.13
100.00 100.00 100.00
40.90 100.00 56.34
68.70 20.00 99.02
100.00 100.00 100.00
100.00 10.00 100.00
100.00 72.00 100.00

First, let’s look at the descriptive stats:

T1 Exam Score
T2 Exam Score
Standardized Exam Score
Mean 73.88 Mean 64.93 Mean 88.83
SE 2.49 SE 3.58 SE 1.65
Median 73.60 Median 81.00 Median 100.00
Mode 100.00 Mode 100.00 Mode 100.00
Stdev 24.94 Stdev 35.84 Stdev 16.53
Sample Variance 621.80 Sample Variance 1284.39 Sample Variance 273.26
Kurtosis -1.66 Kurtosis -1.53 Kurtosis 0.36
Skewness -0.17 Skewness -0.38 Skewness -1.31
Range 69.30 Range 97.00 Range 58.94
Minimum 30.70 Minimum 13.00 Minimum 41.06
Maximum 100.00 Maximum 100.00 Maximum 100.00
Sum 7388.10 Sum 6493.00 Sum 8882.69
Count 100.00 Count 100.00 Count 100.00
95% CL 4.95   95% CL 7.11   95% CL 3.28

Both teachers’ scores are lower than the standardized exam scores, and this can be a good thing, provided that the class exams are covering the right material. Both have fairly high standard deviations, though the second teacher’s is higher than the first, both have a low kurtosis, usually indicating more data in the tails, and both are slightly left skewed, indicating more data in the left tail (low scores) than the right. Note that the second teacher’s minimum score is 13/100! From only looking at the descriptive stats, it looks like the second teacher probably has a more difficult class than the first. But difficulty isn’t the issue; how well the teacher’s class matches the curriculum is the issue. To check that, we run correlations:

 
T1 Exam Score
T2 Exam Score
Standardized Exam Score
T1 Exam Score 1
T2 Exam Score 0.06 1
Standardized Exam Score 0.75 0.17 1

We see a vast difference between the two teachers. The first teacher’s scores correlate highly with the standardized exam score, at 0.75. This means his curriculum fairly closely matches what the state prescribes. But the second teacher’s curriculum doesn’t correlate highly with the state curriculum at all, at only 0.17. The second teacher should sit down with the first and compare what they do, to see where he is going astray from the curriculum.

Universities often give departmental exams to large undergraduate classes. The same method can be used if you teach one of those classes to see how well you are teaching what you’re supposed to be teaching.

If you teach a course with mostly written assessments, statistics are still a valuable tool, although not as precise. You have to have some kind of anchor, even if it’s only in your head — you cannot meaningfully read a stack of papers and assign grades without some idea of what an A or a C paper is. The problem is grade creep, or rather, consistency. I know all about reading papers — I coordinated an ESL writing program and taught many writing courses. Assume that you’re going to grade inconsistently, and to guard against it, take frequent breaks from grading those papers. Grade creep arises in part from fatigue. As you tire, you tend to read less closely and grade less accurately.

Use a pencil to grade the papers. First check your curve. Let’s say you have 30 students. If 10 or 15 got As, ask yourself if you’re grading too leniently. You may not be, of course. You may just have a good class. But you should always check.

After you’ve gone through them, pick out a couple of A papers, B papers, etc., from the top and bottom of the stack. Compare them. Is that A paper from the bottom of the stack of comparable quality to the A paper from the top? If your papers check, write the grades in ink.

For you folks, stats are more valuable used over time, to judge the consistency of your grading over time. If you only use letter grades, convert them to numbers (there are lots of ways to do this, but a standard university GPA system — 4=A, 3.7=A-, 3.3=B+, 3=B, etc. — is the simplest). After you’ve graded a few assignments, run descriptive statistics on them. If the assignments accurately reflect what you’re covering in class, the mean scores for the assignments should not vary wildly — that is, if your class mean for your first assignment was 2.7, and your third assignment mean was 1.3 or 3.4, there’s probably a problem.

After a few assignments, pull up your gradebook and make a line chart of your students’ scores. This is a quick and easy way to spot potential problems. If, say, Mary’s scores are steadily dropping whereas most of the other students’ scores are level, or rising, you need to look into it. Go over her work. Is her work really degrading over time? Or (and it happens) did you inaccurately grade her? If the problem is Mary’s work, then you need to talk to the appropriate administrator. If the problem is your grading, then you need to correct it, and talk to Mary, explain and apologize. We all make mistakes. Students, like anyone else, respect people who admit and correct their mistakes.

After your students have taken the standardized exam, run correlations on your assignment scores and the exam scores. Expect a lower correlation, perhaps, but still, there should be a correlation. If there isn’t, something is wrong. Let’s say you teach history, and you (appropriately) want students to focus on historical context and issues in their assignments, but place little emphasis on dates. If the exam tests students on dates, you have done them a disservice. Don’t abandon historical context, but place more emphasis in your assignments on what the exam tests. If you send the message that something isn’t important, students are not going to learn it.

If you teach multiple sections of the same class, then you should compare your sections’ scores to check yourself for consistency (this is particularly valuable for those whose courses do not allow much discrete-point assessment). Yes, different classes develop different personalities, and there will be variation between sections, but too much variation indicates possible problems.

After every assessment, I run descriptive statistics (see above) on all my sections’ scores. Let’s say you teach two sections of the same course (we’ll call them a and b), with 27 students in one, and 30 in the other. You have just given a 100-point assisgnment, and the scores look like this:

a
b
87.2 54
48.7 62.5
73.3 42.7
54.8 55.9
78.6 76.8
63.5 87.6
50.2 46
67.8 57.2
83 56.2
71.4 85.1
95.3 89.8
74.6 82.1
47.2 77.5
69.7 50.8
79.6 42.3
68.3 82.9
92.3 53.6
63.9 89.2
44.7 97.2
83.6 74.2
94.8 67.9
95.3 56.7
87.8 65.7
46.6 87.7
60.8 93.6
81.1 74.6
68.3 61.1
45.7
70.8
85.7

When you calculate the descriptive statistics, your results look like this:

a
b
Mean 71.57 Mean 69.10
Standard Error 3.06 Standard Error 3.04
Median 71.40 Median 69.35
Mode 95.30 Mode #N/A
Standard Deviation 15.90 Standard Deviation 16.66
Sample Variance 252.88 Sample Variance 277.47
Kurtosis -1.00 Kurtosis -1.30
Skewness -0.18 Skewness -0.03
Range 50.60 Range 54.90
Minimum 44.70 Minimum 42.30
Maximum 95.30 Maximum 97.20
Sum 1932.40 Sum 2073.10
Count 27 Count 30
Confidence Level(95.0%) 6.29 Confidence Level(95.0%) 6.22

Actually, if you got stats like these (note how close the two means are), you wouldn’t worry about consistency, but for the sake of illustration, we’ll say you want to compare the means.

You need to use statistics to test similarities and differences because of random variation. The difference in two sets of scores may be due to random variation, or it may be due to something else. The standard probability we set is 95% (that the difference is due to random variation). We can’t run ANOVA, because the two classes have different numbers of students. Instead, we’ll run a t-test.

Run the Data Analysis Toolpack (Tools, Data Analysis), and select T-Test: Two Sample Assuming Equal Variances. Click OK. Click in the Variable 1 Range box, then select the first class scores. Click in the Variable 2 Range box, and select the second class scores (select the labels at the top). We’re comparing the means to each other, and not to a specific value, so in the Hypothesized Mean Difference box type 0. Check Labels, and type 0.05 in the Alpha box (if it isn’t already there). Select New Worksheet Ply, and in the box, type a name for the results worksheet, like t-test. Click OK.

Here are your results:

t-Test: Two-Sample Assuming Equal Variances
a
b
Mean 71.57 69.10
Variance 252.88 277.47
Observations 27 30
Pooled Variance 265.85
Hypothesized Mean Difference 0
df 55
t Stat 0.57
P(T<=t) one-tail 0.29
t Critical one-tail 1.67
P(T<=t) two-tail 0.57
t Critical two-tail 2.00  

The t-test tells us that if these two sets of scores are from the same population — that is, if they have the same population mean — that we would expect the means to differ by this much 57% of the time. Therefore, the differences between the means is random variation, and is not cause for concern. If the P(T<=t) two-tail value is less than 0.05, then the difference is not due to random variation, and you should explore further. What works well for one class does not always work well for another.

The point I’m trying to make is that statistics are more than just a tool for research (or for some of us, sheer fun). Statistics are an important tool that tell you how well you’re teaching, how well your curriculum matches the states’, how fair your tests are, and how your classes compare, and all by doing nothing more complicated than Excel. If you’ll forgive the gun reference, statistics are the laser grips that allow you to shoot in the dark.

 

2 Comments

  1. Welcome to The 134th Carnival of Education at www.matthewktabor.com : Education and School Issues, News and Analysis:

    […] Right Wing Nation has updated his excellent, easy-to-follow guide on how a teacher can use basic statistics to evaluate curriculum, tests and compare classes. […]

  2. Ryan:

    Brilliant post that I’ll be consulting regularly this year. Thanks for the tips!