Test Item Analysis Using Microsoft Excel Spreadsheet Program
Introduction
This article is written for teachers and researchers whose budgets are limited and who do not have access to purposely designed item analysis software such as Iteman (2003). It describes how to organize a computer spreadsheet such as Microsoft Excel in order to obtain statistical information about a test and the students who took it.
Using a fictitious example for clarity, and also a real example of a personally written University placement test, I will show how the information in a spreadsheet can be used to refine test items and make judicious placement decisions. Included is the web address for accessing the sample Excel files for the class of fictitious students (Elvin, 2003a , 2003b).
Background
I had been teaching in high schools in Japan for many years, and upon receiving my first University appointment, I was eager to make a good impression. My first task was to prepare a norm-referenced placement test to separate approximately one hundred first year medical students into ten relative levels of proficiency and place them into appropriate classes. This would allow teachers to determine appropriate curricular goals and adjust the teaching methodology based more closely on students’ personal needs. It was also hoped that a more congenial classroom atmosphere, with less frustration or boredom, would enhance motivation and engender a true learning environment.
The course was called Oral English Communication, so the placement test needed to measure this construct. However, since time restricted us to no more than half an hour for administering the test, a spoken component for the test was ruled out. It had to be listening only, and in order to ensure reliablity, the questions had to be as many as possible. I decided I could only achieve this by having as many rapid-fire questions as possible within the time constraint. In order for the test to be valid, I focused on points that one might expect to cover in an oral English course for "clever" first year university students. It was not possible to meet the students beforehand, so I estimated their level based on my experience of teaching advanced-level senior high school students.
Organizing The Spreadsheet – Part A
To show briefly how I compiled my students' test scores, I have provided here the results of a fabricated ten-item test taken by nine fictitious students. (see Table 1; to download a copy of this file, see Elvin, 2003a.) The purpose of this section of the spreadsheet is primarily to determine what proportion of the students answered each item, how many answered correctly, and how efficient the distractors were, It also helps the instructor prepare for item discrimination analysis in a separate part of the spreadsheet.
Table 1. Fabricated 10-Item Test - Part A: Actual letter choices
A | B | C | D | E | F | G | H | I | J | K | L | |
1 | ID | ITEM NUMBER | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
2 | 200201 | Arisa | D | A | A | B | C | D | A | A | B | D |
3 | 200202 | Kana | A | C | D | A | D | C | B | C | A | A |
4 | 200203 | Saki | D | B | D | B | C | D | D | A | B | A |
5 | 200204 | Tomomi | A | B | B | A | C | C | C | A | D | D |
6 | 200205 | Natsumi | C | B | A | B | C | D | B | C | C | D |
7 | 200206 | Haruka | C | B | A | B | C | D | A | A | B | C |
8 | 200207 | Momo | * | C | B | D | D | B | A | A | C | B |
9 | 200208 | Yuka | B | D | B | C | C | D | D | B | D | B |
10 | 200209 | Rie | C | B | A | B | C | B | A | A | B | C |
11 | CORRECT ANSWER | C | B | A | B | C | D | A | A | B | C | |
12 | A | 0.22 | 0.11 | 0.44 | 0.22 | 0.00 | 0.00 | 0.44 | 0.67 | 0.11 | 0.22 | |
13 | B | 0.11 | 0.56 | 0.33 | 0.56 | 0.00 | 0.22 | 0.22 | 0.11 | 0.44 | 0.11 | |
14 | C | 0.33 | 0.22 | 0.00 | 0.11 | 0.78 | 0.22 | 0.11 | 0.22 | 0.22 | 0.22 | |
15 | D | 0.22 | 0.11 | 0.22 | 0.11 | 0.22 | 0.56 | 0.22 | 0.00 | 0.22 | 0.44 | |
16 | TOTAL | 0.89 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 | 1.00 |
What proportion of students answered the question?
It may be expected that for a multiple-choice test, all of the students would answer all of the questions. In the real world, this is rarely true. The quality of a test item may be poor, or there may be psychological, environmental, or administrative factors to take into consideration. To try to identify these potential sources of measurement error, I calculate the ratio of students answering each question to students taking the test.
In cell C16 of the Excel file spreadsheet, the formula bar reads “=SUM (C12:C15)”, which adds the proportion of students answering “A”, “B”, “C” and “D” respectively. One student didn’t answer question 1 (cell C8), so the total for this item is eight out of nine, which is 0.89. Perhaps she was feeling a little nervous, or she couldn't hear well because of low speaker volume or noise in the classroom. The point is, if it is possible to determine what was responsible for a student or students not answering, then it may also be possible to rectify it. In some cases, a breakdown of questions on a spreadsheet can contribute to the discovery of such problems.
What proportion of students answered the question correctly?
For question 1, the correct answer is “C”, as shown in cell C11. The proportion of students who chose “C” is shown in cell C14. To calculate this value, we use the COUNTIF function. In cell C14, the formula bar reads “=COUNTIF(C2:C10,"C")/9”, which means that any cell from C2 to C10 which has the answer “C” is counted, and then divided by the number of test takers, which is nine for this test. This value is also the item facility for the question, which will be discussed in more detail later in this paper.
How efficient were the distractors?
We use the same function, COUNTIF, for finding the proportion of students who answered incorrectly. For item 5, cell G12 reads “=COUNTIF(G2:G10,"A")/9” in the formula bar. The two other distractors are “B”, which is shown in cell G13 (“=COUNTIF(G2:G10,”B”/9”) and “D”, which is shown in cell G15 (“=COUNTIF(G2:G10,”D”/9”). For this question, seven students answered correctly (answer “C”), and two students answered incorrectly by choosing “D”. If this test were real, and had many more test takers, I would want to find out why “A” and “B” were not chosen, and I would consider rewriting this question to make all three distractors equally attractive.
Preparing for item discrimination analysis in a separate part of the spreadsheet
Part A of the spreadsheet shows the letter choices students made in answering each question. In Part B of the spreadsheet, I score and rank the students, and analyze the test and test items numerically.
Organizing The Spreadsheet – Part B
The area C22:L30 in part B of the spreadsheet (see Table 2; also Elvin, 2oo3a) correlates to the absolute values of C2:L10 in part A of the spreadsheet in Table 1. This means that even after sorting the students by total score in part B of the spreadsheet, the new positions of the ranked students will still refer to their actual letter choices in part A of the spreadsheet.
Absolute cell references, unlike relative cell references, however, cannot be copied and pasted. They have to be typed in manually. It is therefore much quicker to make a linked worksheet with copied and pasted relative cell references. The linked spreadsheet can then be sorted without fear of automatic recalculation, as would happen if working within the same spreadsheet using relative references. For the actual test, I used a linked spreadsheet. If you would like to see a linked file, a copy of one is available for download from my website (see Elvin, 2003b).
The purposes of part B of the spreadsheet are to
Table 2. Fabricated 10-Item Test - Part B: Scoring and Ranking of Students
A | B | C | D | E | F | G | H | I | J | K | L | M | N | |
21 | ID | ITEM NUMBER | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | TOTAL | |
22 | 200206 | Haruka | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 10 | |
23 | 200209 | Rie | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 9 | |
24 | 200201 | Arisa | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 7 | |
25 | 200203 | Saki | 0 | 1 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 6 | |
26 | 200205 | Natsumi | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 6 | |
27 | 200204 | Tomomi | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 3 | |
28 | 200207 | Momo | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 2 | |
29 | 200208 | Yuuka | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 2 | |
30 | 200202 | Kana | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
31 | IF total | 0.33 | 0.56 | 0.44 | 0.56 | 0.78 | 0.56 | 0.44 | 0.67 | 0.44 | 0.22 | Reliability | 0.87 | |
32 | IF upper | 0.67 | 0.67 | 1.00 | 1.00 | 1.00 | 0.67 | 1.00 | 1.00 | 1.00 | 0.67 | Average | 5.00 | |
33 | IF lower | 0.00 | 0.00 | 0.00 | 0.00 | 0.33 | 0.33 | 0.33 | 0.33 | 0.00 | 0.00 | SD | 3.43 | |
34 | ID | 0.67 | 0.67 | 1.00 | 1.00 | 0.67 | 0.33 | 0.67 | 0.67 | 1.00 | 0.67 | SEM | 1.21 |
a) Converting students’ multiple-choice options to numerals
Cell C22, in this previously sorted part of the spreadsheet, reads “=IF($C$7="C",1,0)” in the formula bar. This means that Haruka has answered “C” for item 1 in cell C7 of part A of the spreadsheet, so she will score one point. If there is anything else in cell C7, she will score zero. (The dollar signs before C and 7 indicate absolute reference.)
b) Calculating total scores
Cell M22 reads “=SUM(C22:L22)”. This calculates one student’s total score by adding up her ones and zeros for all the items on the test from C22 to L22.
c) Sorting students by total scores
The area A22:M30 is selected. Sort is then chosen from the data menu in the menu bar, which brings up a pop-up menu and a choice of two radio buttons. Column M is selected from the pop-up menu, and the descending radio button is clicked. Finally, the OK button is selected. This sorts the students by test score from highest to lowest.
d) Computing item facility and item discrimination values
Item facility (IF) refers to the proportion of students who answered the question correctly. In part A of the spreadsheet, we calculated the IF using the COUNTIF function for the letter corresponding to the correct answer. With these letter answers now converted numerically, we can also calculate the IF using the SUM function. For example, In cell 31, the formula bar reads “=SUM(C22:C30)/9”, which gives us the IF for item 1 by adding all the ones and dividing by the number of test-takers.
The item discrimination (ID) is usually the difference between the IF for the top third of test takers and the IF for the bottom third of test takers for each item on a test (some prefer to use the top and bottom quarters). The IF for the top third is given in cell C32 and reads “=SUM(C22:C24)/3”. Similarly, the IF for the bottom third of test takers is given in cell C33, and reads “=SUM(C28:C30)/3”. The difference between these two scores, shown in cell 34 (“=C32-C33”), gives us the ID. This value is useful in norm-referenced tests such as placement tests because it is an indication of how well the test-takers are being purposefully spread for each item of the test.
e) Calculating the average score and standard deviation of the test
The Excel program has functions for average score and standard deviation, so they are both easy to calculate. Cell N32 reads “=AVERAGE(M22:M30)” in the formula bar, which gives us the average score. The standard deviation is shown in cell N33 and reads “=STDEV(M22:M30)” in the formula bar.
f) Determining the test’s reliability
I use the Kuder-Richardson 21 formula for calculating reliability because it is easy to compute, relying only on the number of test items, and the average and variance of the test scores.
The formula is KR-21 = n/n-1[1-{(X-X2/n)/S2}], where n is the number of test items, X is the average score, and S the standard deviation. (See Hatch and Lazaraton, 1991, p. 538 for information on the Kuder-Richardson 21 formula.)
In cell N31, the formula bar reads “=(10/9)*(1-(N32-N32*N32/10)/(N33*N33))”, which will give us a conservative estimate of the test’s reliability, compared to the more accurate but more complex KR-20 formula.
g) Estimating the standard error of measurement of the test
The true-score model, which was proposed by Spearman (1904), states that an individual’s observed test score is made up of two components, a true component and an error component. The standard error of measurement, according to Dudek (1979), is an estimate of the standard deviation expected for observed scores when the true score is held constant. It is therefore an indication of how much a student’s observed test score would be expected to fluctuate either side of her true test score because of extraneous circumstances. This error estimate uncertainty means that it is not possible to say for sure which side of a cut-off point the true score of a student whose observed score is within one SEM of that cut-off point truly lies. However, since the placement of students into streamed classes within our university is not likely to effect the students lives critically, I calculate SEM not so much to determine the borderline students, who in some circumstances may need further deliberation, but more to give myself a concrete indication of how confident I am that the process of streaming is being done equitably.
To measure SEM, we type in the formula bar for cell N34, “=SQRT(1-N31)*N33”, which gives us a value of 1.21. We can therefore say that students’ true scores will normally be within 1.21 points of their observed scores.
The 2002 Placement Test
A 50-item placement test was administered to 102 first year medical students in April, 2002. To the teachers and students present, it may have appeared to be a typical test, in a pretty booklet, with a nice font face, and the name of the college in bold. But this face value was its only redeeming feature. After statistical analysis, it was clear that its inherent weakness was that it was unacceptably difficult and therefore wholly inappropriate. If the degree to which a test is effective in spreading students out is directly related to the degree to which that test fits the ability levels of the students (Brown, 1996), then my placement test was ineffective because I had greatly overestimated the students’ level based on the naïve assumption that they would be similar to students in my high school teaching experience.
It had a very low average score – not much higher than guesswork, and such a low reliability, and therefore large SEM, that it meant that many students could not be definitively placed. In short, I was resigned to the fact that I’d be teaching mixed ability classes for the next two semesters.
Pilot Procedures for the 2003 Placement Test
Statistical analysis of the 2002 test meant that I had to discard almost all of the items. The good news was that at least I now had the opportunity to pilot some questions with my new students. I discovered that nearly all of them could read and write well, and many had impressive vocabularies. Most had been taught English almost entirely in Japanese, however, and very few of them had had much opportunity to practice English orally. Fewer still had had contact with a native-English speaker on a regular basis.
According to Brown (1996), ideal items of a norm-referenced language test should have an average IF of 0.50, and be in the range of 0.3 to 0.7 to be considered acceptable. Ebel’s guidelines (1979) for item discrimination consider an ID of greater than 0.2 to be satisfactory. These are the criteria I generally abide by when piloting test questions, after, of course, first confirming that these items are valid and also devoid of redundancy.
A Comparison Of The 2002 And 2003 Placement Tests
Table 3: A Comparison of the 2002 and 2003 placement tests
Reliability | Average | SD | SEM | IF<0.3 | 0.3=<IF<0.7 | IF>0.7 | ID>0.2 | |
2002 | 0.57 | 16.09 | 4.95 | 3.26 | 27 | 23 | 0 | 2 |
2003 | 0.74 | 24.8 | 6.71 | 3.44 | 3 | 40 | 7 | 38 |
A statistical analysis of the 2003 50-item test showed a great improvement compared to the previous year (see Table 3), with just ten items now falling outside the criteria guidelines. The average score of the 2003 test was very close to the ideal, but the reliability was still not as good as it should have been. Despite this, we were still able to identify and make provision for the highest and lowest scoring students, and feedback from all classes, thus far, has generally been very positive.
Conclusion
I plan to extend my database of acceptable test items to employ in developing the test for 2004. The reliability should improve once the bad items are replaced with acceptable ones, and distractor efficiency analysis may help to pinpoint which acceptable items can be modified further. My main concern, however, is the very small standard deviation. If it remains stubbornly small, we may have to conclude that our students are simply too homogenous to be streamed effectively, and that may ultimately force us to reconsider establishing mixed-ability classes.
References
Brown, J.D. (1996). Testing in language programs. Upper Saddle River, NJ: Prentice Hall.
Dudek, F.J. (1979). The continuing misinterpretation of the standard error of measurement. Psychological Bulletin, 86, 335-337.
Ebel, R.L. (1979). Essentials of educational measurement (3rd ed.). Englewood Cliffs, NJ: Prentice Hall.
Elvin, C. (2003a). Elvin’sdata.xls [Online]. Retrieved September 26, 2003, from
<www.eflclub.com/elvin/publications/2003/Elvinsdata.xls>.
Elvin, C. (2003b). Elvin’soutput.xls [Online]. Retrieved September 26, 2003, from <www.eflclub.com/elvin/publications/2003/Elvinsoutput.xls>.
Hatch, E. & Lazaraton, A. (1991). The research manual: Design and statistics for applied linguists. Boston, MA: Heinle & Heinle.
Iteman (Version 3.6). (1997). [Computer software]. St. Paul, MN: Assessment Systems Corporation.
Spearman, C. (1904). “General intelligence,” objectively determined and measured. American Journal of Psychology, 15, 201-293.