Ken DeRosa took a journalist to task for inaccuracies in her article about four Reading First schools in Madison, Wisconsin (go here for all the relevant information) then pointed me to the reading proficiency data the state of Wisconsin reported for all schools (the Wisconsin data are here). I downloaded the data, cleaned them up in Excel, and ran the stats, comparing the 98-99 and 04-05 school years. They reported four proficiency levels: minimal, basic, proficient, and advanced. We are interested in the percentages testing proficient or above (proficient+ in the tables below), so I added the percent proficient and percent advanced, and analyzed those data.
Before I go on, let me quickly address why we must analyze the data statistically, and cannot just report means. If we gave the same kids the same proficiency exams on two different days, say only a week apart, their scores would be different. Anytime we see a difference between scores, without statistics, we do not know if those differences are due to random variation or not. We cannot without statistics point to two different scores or means and say, "See? The scores increased!"
Also, let me mention a few crucial points.
First, I tackled the reading proficiency scores for the entire state. Here are the descriptive stats:
| % Proficient+ 98-99 |
|
% Proficient+ 04-05 |
|
|
|
|
|
| Mean |
71.04 |
|
Mean |
87.54 |
| SE |
0.48 |
|
SE |
0.36 |
| Median |
73.58 |
|
Median |
91.00 |
| Mode |
75.00 |
|
Mode |
100.00 |
| Stdev |
16.12 |
|
Stdev |
12.11 |
| Sample Variance |
259.79 |
|
Sample Variance |
146.72 |
| Kurtosis |
1.03 |
|
Kurtosis |
4.27 |
| Skewness |
-0.92 |
|
Skewness |
-1.86 |
| Range |
100.00 |
|
Range |
83.40 |
| Minimum |
0.00 |
|
Minimum |
16.70 |
| Maximum |
100.00 |
|
Maximum |
100.00 |
| Sum |
80559.47 |
|
Sum |
100494.30 |
| Count |
1134.00 |
|
Count |
1148.00 |
| CL (95.0%) |
0.94 |
|
CL (95.0%) |
0.70 |
The mean increased from 98-99 to 04-05 by 16.5%. The standard deviation — the amount of variance, which we can roughly define as the average amount each school differed in one direction or another from the mean — decreased. The lower the standard deviation, the less "spread out" the data. This suggests that in the 04-05 school year, more schools clustered around the mean.
The kurtoses support this. Kurtosis is the "peakedness" of the data distribution. Visualize a bell curve as a water balloon (sorry no, I am not going to do graphics; you’ll have to use your imagination). Now, if you place your hand on the top and squish it downward, more of it will squish into the tails at either side. This would be a "flattened" curve, with a low kurtosis. Now, instead of squishing the top down, visualize taking either hand and pushing it in from the sides. The water (data) would push the middle of the curve upward, or make it more "peaked." This would be a curve with a higher kurtosis. So the higher the kurtosis, the more data is squished up into the middle, where the mean is. Now note that the kurtosis for 04-05 is higher than that for 98-99. This supports what the standard deviations tell us, that in 04-05, more data were clustered around the mean (and in 98-99, more data were squished into the tails).
So far, everything looks positive, until we look at the skewness. Go back to that bell curve water balloon. If you grab the tail on the right and pull it further to the right, more of the water will spill into that tail, right? That’s what we call a right-skewed curve, and it has a positive skewness factor. If you pull the left tail out, more water spills into that left tail, and we have a left-skewed curve, with a negative skewness factor. When we look at the skewness for the two years, both are left-skewed — that is, in both years, there are more data in the left tails (lower end of proficient) — but the 04-05 curve is more left-skewed than 98-99 (-1.86 and -0.92, respectively). So even though it does look like Wisconsin may have improved the reading proficiency between the two years, they also slightly increased those who were at the low end of proficient (if this seems like a paradox to you, think of the water balloon again, and all will be made clear).
However, we cannot say whether these differences are meaningful (statistically significant) or whether they are due to random variation from looking at the descriptive statistics alone. So we ran ANOVA to a probability of 95% (alpha=0.05) to test the null hypothesis, that the differences are due to random variation:
| Anova: Single Factor |
|
|
|
|
|
|
|
|
|
|
|
|
|
| SUMMARY |
|
|
|
|
|
|
| Groups |
Count |
Sum |
Average |
Variance |
|
|
| % Proficient+ 98-99 |
1134 |
80559.47 |
71.04 |
259.79 |
|
|
| % Proficient+ 04-05 |
1134 |
99261.70 |
87.53 |
146.58 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| ANOVA |
|
|
|
|
|
|
| Source of Variation |
SS |
df |
MS |
F |
P-value |
F crit |
| Between Groups |
154221.06 |
1 |
154221.06 |
759.01 |
2.3E-144 |
3.85 |
| Within Groups |
460420.91 |
2266 |
203.19 |
|
|
|
|
|
|
|
|
|
|
| Total |
614642 |
2267 |
|
|
|
|
The value of p is extremely small (2.3E-144), far smaller than 0.05, so to a 95% probability, we have disproved the null hypothesis. In other words, we are 95% certain that the difference between the percent proficient and above in the two years is not due to random variation. Wisconsin can from these data validly claim that they raised their proficiency.
Now, let’s turn to Madison. First, the descriptive stats:
| % Proficient+ 98-99 |
|
% Proficient+ 04-05 |
|
|
|
|
|
|
| Mean |
62.82 |
|
Mean |
82.46 |
| SE |
2.14 |
|
SE |
1.91 |
| Median |
63.91 |
|
Median |
82.80 |
| Mode |
#N/A |
|
Mode |
92.30 |
| Stdev |
10.89 |
|
Stdev |
9.76 |
| Sample Variance |
118.59 |
|
Sample Variance |
95.34 |
| Kurtosis |
0.54 |
|
Kurtosis |
-0.35 |
| Skewness |
-0.43 |
|
Skewness |
-0.55 |
| Range |
48.64 |
|
Range |
35.30 |
| Minimum |
36.36 |
|
Minimum |
60.30 |
| Maximum |
85.00 |
|
Maximum |
95.60 |
| Sum |
1633.45 |
|
Sum |
2143.90 |
| Count |
26.00 |
|
Count |
26 |
| CL (95.0%) |
4.40 |
|
CL (95.0%) |
3.94 |
We certainly see a difference between the two years in the means, a larger difference than we saw for the whole state, though keep in mind that while we had over a thousand reporting schools for the state, we have 26 reporting schools for Madison, and recall that the more data we have, the more reliable our statistics are. So don’t jump immediately to conclusions. The kurtosis in 04-05 is flatter than that in 98-99, indicating that there are more data in the tails in 04-05; the skewness in both years is roughly identical.
Again, we have to run ANOVA to test the null hypothesis, that the difference is due to random variation:
| Anova: Single Factor |
|
|
|
|
|
|
|
|
|
|
|
|
|
| SUMMARY |
|
|
|
|
|
|
| Groups |
Count |
Sum |
Average |
Variance |
|
|
| % Proficient+ 98-99 |
26 |
1633.45 |
62.82 |
118.59 |
|
|
| % Proficient+ 04-05 |
26 |
2143.90 |
82.46 |
95.34 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| ANOVA |
|
|
|
|
|
|
| Source of Variation |
SS |
df |
MS |
F |
P-value |
F crit |
| Between Groups |
5010.78 |
1 |
5010.78 |
46.84 |
1.05E-08 |
4.03 |
| Within Groups |
5348.45 |
50 |
106.97 |
|
|
|
|
|
|
|
|
|
|
| Total |
10359.23 |
51 |
|
|
|
|
And again, ANOVA disproves the null hypothesis. The value of p is 1.05E-08, so to a 95% level of probability, the difference is not due to random variation (alpha=0.05). Madison can from these data claim that they raised their proficiency levels.
However, the real issue here is the group of Reading First school systems in Madison (Glendale, Hawthorne, Lincoln, and Orchard Ridge). But before I go on, let me mention a crucial statistical issue we’re about to encounter, one I listed above and said would become an issue later on.
For the entire state of Wisconsin, we have over 1,000 schools reporting proficiency levels. For Madison, we have 26 reportings (this is why the p-value for the Madison ANOVA is larger than the p-value for the whole state, even though Madison reports a larger difference in proficiency than the state). A data set of twenty-six data points isn’t statistically ideal, but we can work with it.
At issue here, however, are the Reading Firsts in Madison, and there are only four. We could run ANOVA, but with only four data points the results wouldn’t be reliable. All we can do is calculate the descriptive statistics and interpret them cautiously:
| % Proficient+ 98-99 |
|
% Proficient+ 04-05 |
|
|
|
|
|
| Mean |
50.64 |
|
Mean |
66.50 |
| SE |
7.53 |
|
SE |
2.82 |
| Median |
47.14 |
|
Median |
65.90 |
| Mode |
#N/A |
|
Mode |
#N/A |
| Stdev |
15.06 |
|
Stdev |
5.63 |
| Sample Variance |
226.92 |
|
Sample Variance |
31.73 |
| Kurtosis |
2.46 |
|
Kurtosis |
1.33 |
| Skewness |
1.29 |
|
Skewness |
0.61 |
| Range |
35.55 |
|
Range |
13.60 |
| Minimum |
36.36 |
|
Minimum |
60.30 |
| Maximum |
71.91 |
|
Maximum |
73.90 |
| Sum |
202.56 |
|
Sum |
266.00 |
| Count |
4.00 |
|
Count |
4.00 |
| CL (95.0%) |
23.97 |
|
CL (95.0%) |
8.96 |
Did the percentage of students proficient or higher increase? Yes (50.64% to 66.5%). Again, though, we only have four data points here, so even the descriptive stats are questionable. The kurtosis suggests that there were more data around the mean than in the tails in 04-05 than there were in 98-99, though the standard deviations and ranges indicate the reverse, but again, there are only four data points here.
From these data, we cannot say that the Reading First schools did or did not increase their proficiencies. There just aren’t enough data. But — and this is crucial — neither can the Reading First schools claim that they raised their proficiencies using these data. The only way we can determine whether these school systems did or did not raise their proficiencies is by analyzing the raw data, and not the aggregates by school. In other words, Ken was right, and the journalist was wrong.
Remember that I said we are assuming that the state standards did not change, or if they did, that the scores in the two years we looked at are comparable? And remember how Wisconsin’s percentage of proficient or better went from 71.04% to 87.54% between the 98-99 and 04-05 school years? Ken states:
As the NAEP data clearly shows, the Wisconsin’s proficiency exam standards did change between 1998 and 2005. NAEP scores declined slightly, while the Wisconsin scores magically skyrocketed. Suspiciously so.
So, when RWP says that Wisconsin and Madison can validly claim that scores have increased, this conclusion only holds if Wisconsin’s proficiency exam standards didn’t change between 1998 and 2005. And, as we know from NAEP scores, they did.
Houston, we have a problem. You can write exams that are comparable to earlier exams. That’s why standardized test scores (SAT, GRE, GMAT, LSAT, etc.) are reliable from one administration to another. But if you change the standards, you can’t, because you change the definition of proficiency. Wisconsin has to explain the discrepancy between the NAEP data and their reported data, and if they did change standards, they must explain how, exactly, their scores from year to year are comparable, and how they can claim their reading proficiency increased.
Here endeth the lesson (add cute little smiley face here).
UPDATE: It looks like Wisconsin fudged the data.
« Close it