ITEM ANALYSIS PROCEDURES FOR
NORM-REFERENCED AND CRITERION-REFERENCED MASTERY TESTS
Norms
Meaning: Norms means average scores or standard scores or values.
Etymological Meaning: Norms are the minimum criteria needed at a
particular period of time and standard is the criterion above or below norms.
Definitions:
- Frank S. Freeman – “A norm is the average or standard scores on a test made by a specified population”.
- Thorndike & Hagen – “Norms are defined as the average performance on a particular test made by a standardized sample”.
Norm-Referenced: Positions in the group or the class is referred to
as norm-referenced dimension of performance.
Criterion-referenced: Attainment of standards is referred to as
criterion-referenced dimension of performance, criterion being external and
expected standard.
Meaning of the Terms
- Item = question
- Analysis = Scrutiny / securitization
Item Analysis
Item
analysis is the procedure of finding out which questions have discriminatory
power and which have not.
Need of Item Analysis
An
item analysis is needed to indicate which items are very easy or very difficult
and which are not functioning properly. It is not uncommon for an item to
appear satisfactory even to an expert while being intrinsically ambiguous-that
is, to illicit undesired response from the students.
The
immediate purpose of item analysis is to determine the difficulty and
discriminatory power of each item. When an item analysis is performed on a
test, one is almost certain to gain additional important insight into the
examinees’ thinking, understanding and test-taking behavior.
Objectives of Item Analysis
- To select suitable items for the final draft of the test and reject the poor ones.
- To find the difficulty value of all the items given in the preliminary draft.
- To select positively discriminating items and to reject the items of negative or zero discriminating power.
- To rectify the function of the distracters.
- To make the calculation of the validity and reliability of the test easy.
- To provide a logical basis for selection of items for the final draft.
Procedures / Steps in Item Analysis:
Following information are
obtained by item analysis-
After
the test has been given, the papers or answer sheets are scored by marking all
items incorrectly answered or omitted. Because of the instruction concerning
omissions, they should be few. Each student’s score (not corrected for chance)
will be the number of items on the test less the number of errors (wrong or
omitted items) on his/her paper or answer sheet.
1. Administration of the Test – After the test, re-test is
administered on the sample representing the same population. The conditions of
administration are kept standard to avoid cheating or faking the responses.
Sufficient time is also given to attempt the test-items. Clear instructions are
given to the examinees.
2. Scoring – Answer key is prepared before scoring the items. One
mark is given to every correct answer and zero is allotted to every wrong one.
Scoring machines are also used in this work. If the answers are crossed on the
separate OMR sheet, computer can score more than one thousand OMR sheets in one
hour. Correlation Coefficient is obtained by calculating the scores of test and
re-test.
3. Arranging the answer sheets in a descending order of total marks:
The answer sheets of all the students (n) are put in descending order, placing
the sheets having the highest score on top and continuing sequentially until
the paper having the lowest score is placed on the bottom. n =30.
4. Item analysis chart: After the test and re-test, analyses of
items are done. A chart is prepared of all the items with names of the
examinees as follows:
i) Tick (□) marks for the items
which have correct answers (1 or 2) that is as expected by the researcher.
ii) – mark for the items which
have 3s circled.
iii) X marks for the items which
have 4 & 5 circled, and
iv) O Marks for the items which
are not at all answered by the students.
The
items with the highest number of students responding correctly or circling 1
and 2 are selected and items with the highest number of students answering wrongly
that is circling 3, 4, and 5 and not at all answered items are rejected.
5. N, the total number of students, is multiplied by 0.27 and round
off the result to the nearest whole number and the obtained number is called
‘n’. If ‘N’ is 30 (30 x 0.27 = 8.1), ‘n’ would be 8, the rounded figure of 8.1.
6. Take the upper group 27% of the cases: The best ‘n’ papers are
counted off from the top of the stack. This is the “upper” / “high” (Ru) group.
7. Take the lower group 27% of the cases: The poorest n papers are
counted off from the bottom of the stack. This is the “lower” (R1) group.
8. The calculation of difficulty Index: In order to obtain an item
Difficulty Index ‘p’ that is, the
proportion of total group who answered each item correctly, the Ru and R1 are
added and then divided by Nu + N1.
Formula is:
Here,
Ru
= Number of examinees in the upper group answering the items correctly.
R1
= Number of examinees in the lower group answering the items correctly.
Nu
= Number of examinees in the upper group.
N1
= Number of examinees in the lower group.
This
must be interpreted with the changed level of the item in mind. For example, p = 0.5 for a two option item that all
examinees mark, probably indicates a little or no knowledge of the point
tested.
9. Calculation of discrimination (V) Value: It shows how an item
differentiates the high achievers and low achievers. In order to obtain a
measure of item discrimination ‘V’ (that is how well this item distinguished
between the students who understand the content universe and those who do not),
subtract R1 from Ru.
The formula is:
For example, = 0.23 (Ru, R1 & Nu are taken from # 12.)
The
items that yield a discrimination index (V-values) of 0.4 or more are high in
discrimination. Those with V-values below 0.2 are low in discrimination (Ebel,
1954). They deserve careful scrutiny, particularly if they are revised for the
future use. The items that are miskeyed or that are intrinsically ambiguous
will tend to have negative V-values, or other options of that item will have
higher V-values than the keyed-correct option.
These
options usually should be double-keyed because the distinction between the best
and the next best options was too fine for the knowledgeable students to make.
Of course, no item should be double-keyed if there is no logical justification
in terms of the concept being measured. This logic may be not readily apparent
to the test constructor, but can usually be supplied by high scoring examinees
who did not select the keyed-correct option.
10. Evaluation of the item: Evaluation is done with the help of
difficulty index and discrimination value, i.e., the items which are difficulty
index 40% - 60% and discrimination value of above 0.4 can be retained. The
other items can be discarded.
11. Effectiveness of the distracters: Item analysis is done for
each item in the tools. If one answer is correct, the others are distracters.
We can modify the distracters i.e., we can use plausible distracters. The
example is given below -
Options : a b c d
Ru : 5 17 3 2
R1 : 5 10 6 9
Here,
‘b’ is the keyed-response and the others are distracters.
Formula is: D =
Ru + R1 ÷ Nu + N1 x 100
=
17 + 10 ÷ 30 + 30 x 100 = 27 ÷ 60 x 100 = 45%.
Factors affecting item selection:
Difficulty
value of an item (p), Discriminating
power of an item (V) effectiveness of distracters are affected by a number of
factors which are as follows -
1. Ambiguity and complexity of an item – due to this p-value and V-value may also be low. And
if the distracters are not functioning well or evenly distributed in the middle
and the lower group, then the p-value
may be high and V-value may be low.
2. Non-familiarity of the examinees about the forms of the test –
They will commit many mistakes and the p-value
will be low.
3. Methods of estimating the indices – There are more than a dozen
methods of estimating p-value and
V-value, and all of them give different results.
4. Techniques of dichotomizing low and high groups – Some people
take top 25% and bottom 25% for the purpose of dichotomy. Kelley has raised
this limit to 27%. There are some educators who divide the whole group in the following
way:
Top Middle Bottom
33% 34% 33%
All
the dichotomies will give different p-values
and V-values.
5. Correction formula – Different correction formulae are used to
reduce the influence of guessing. Students who do not adopt guessing techniques
for attempting the test suffer a lot. All correction formulae give different p-values and V-values.
Guilford’s formula of correction for guessing
is as follows:
S = R – (W ÷ N – 1)
Example,
suppose an item is correctly responded by 300 examinees out of 400 with each
item having 5 response options, the correct score will be
S
= 300 – 100 ÷ 5 – 1 = 275
So,
the number of right responses taken as 275 instead of 300. Now, the p-value which was 300 ÷ 400 = 0.75
before the correction will go down to 275 ÷ 400 = 0.689.
6. Ability level of the examinees – If the group of examinees is homogeneous,
very talented or very poor, the index of discrimination will be low. Similarly,
high ability level of examinees increase p-values
of items and vice-versa. On the contrary to it, if the group is heterogeneous,
V-value will be high and reliability of the test will also be high due to high
standard deviation.
7. Skill of test construction – The more skillfully the test has
been constructed, the more valid and reliable the test will be. Resultant, the
range of p-values will be from 0.25
to 0.75 and desired degree of V-values. Similarly, if destructors are very
close to the right option, p-values
will be very low. All these will depend on the skills and experience of the
test-maker.
8. Lack of time for attempting the items – If the reasonable time
is not made available and a power test is made a speed test, then at least some
of the students will not be able to reach the last 10% to 15% items. It will lower
both the p-values and the V-values.
Difficulties and limitations of Item Analysis
A
number of techniques and methods have been developed for the purpose of item
analysis so far which are used by a test-maker and a test is standardized.
Despite this fact, some basic problems still remain unresolved which are to be
discussed below.
1. Problem of spurious correlation in item total correlation –
Whenever an individual item is correlated with the total score for the purpose
of validating the item, the obtained coefficients are spuriously high. This
problem becomes more serious when all the items in the test measure almost the
same function.
Two
things may be done for the purpose of minimizing this problem:
a)
The number of items in the test should be kept large.
b)
Heterogeneous items rather than homogeneous should be
included in the test, that is, extreme indices of items should be there.
2. Problem relating to dichotomous items – Items with true/false,
agree/disagree, yes/no, etc. responses are generally used in non-ability test,
like personality test, interest inventory, attitude scale, etc. For e.g., the
test maker constructs 100 dichotomous items 0 positive items with + 1 mark each
and 50 negative items with zero score. In this case, all item total correlation
will be close to zero and all positive statements would correlate negatively
with negative statements.
3. Problem of controlling unwanted factors – In homogeneous test,
all items do not correlate with each other as all items are said to measure
only one factor. For e.g. in aptitude test, it measures quantitative, verbal
comprehension and is likely to correlate with factor also.
4. Problem of Guessing – Problem of guessing is very common in all
types of objective type tests. It inflates the scores and increases both the p-values and the V-values. This problem
can be minimized by introducing minus marking in the scoring procedure. For
every wrong answer, ¼ marks should be deducted from the total score. For e.g.,
if an examinee attempted 20 questions wrongly out of 100, his total score will
be:
100
– (20 + ¼ x 20) = 75.
5. Problem related to time limit – No power test is purely a power
test in itself. The examinees are required to complete the test in reasonable
time. Thus, it becomes a speed test for about 25% of the examinees. In this way
their scores are affected by time limit resulting in low p-values and high V-values.
Conclusion:
An
Item Analysis, comparing the performance on each item of the most and least
successful examinees on the total test-will identifies items that are
non-functional, intrinsically ambiguous or miskeyed. So, that they can be
revised or thrown out of the tool. Usually, not only this procedure improves
the reliability and hence the validity of a particular test, but the experience
of studying the students’ responses in depth will help the instructor in his
teaching and in subsequent test construction.
Items
of moderate difficulty have the potential for good item discrimination. The
theoretical maximum item discrimination ‘V’-value (1.0) is possible only when
item difficulty is 0.5.
A
chain of relationship exists between certain item and test characteristics.
Item difficulty affects possible item discrimination which in turn directly
determines the variance and internal-consistency reliability of the test
scores. Reliability is necessary, but not sufficient, for validity.
Reference
- Justin C. Stanley & Kenneth D. Hopkins: Educational & Psychological Measurement and Evaluation, Prentice Hall of India Pvt. Ltd. (Item Analysis for Classroom Tests: pp.267-281).
- Dr. M.S. Ansari: UGC-Education, Ramesh Publishing House, New Delhi (Item Analysis, pp.683-691).
- R.A. Sharma: Technological Foundation of Education, R. Lall Book Depot (Evaluation of Teaching-Learning, pp.360 & 386-387).
Want to use 'pay to do my assignment' services? Get the Best homework help at takemyonlineexams.com. We help you to negotiate with the Tutors and get your assignment ready instantly.You can hire Take My Online Exam's expert tutors to take your online exams, quizzes & classes, and with top grades. Contact us!
ReplyDelete