ITEM ANALYSIS - ADVANCED EDUCATIONAL RESEARCH AND STATISTICS



ITEM ANALYSIS PROCEDURES FOR NORM-REFERENCED AND CRITERION-REFERENCED MASTERY TESTS

Norms

Meaning: Norms means average scores or standard scores or values.

Etymological Meaning: Norms are the minimum criteria needed at a particular period of time and standard is the criterion above or below norms.

Definitions:
  1. Frank S. Freeman – “A norm is the average or standard scores on a test made by a specified population”.
  2. Thorndike & Hagen – “Norms are defined as the average performance on a particular test made by a standardized sample”.

Norm-Referenced: Positions in the group or the class is referred to as norm-referenced dimension of performance.
Criterion-referenced: Attainment of standards is referred to as criterion-referenced dimension of performance, criterion being external and expected standard.

Meaning of the Terms
  1. Item = question
  2. Analysis = Scrutiny / securitization

Item Analysis
            Item analysis is the procedure of finding out which questions have discriminatory power and which have not.

Need of Item Analysis
            An item analysis is needed to indicate which items are very easy or very difficult and which are not functioning properly. It is not uncommon for an item to appear satisfactory even to an expert while being intrinsically ambiguous-that is, to illicit undesired response from the students.

            The immediate purpose of item analysis is to determine the difficulty and discriminatory power of each item. When an item analysis is performed on a test, one is almost certain to gain additional important insight into the examinees’ thinking, understanding and test-taking behavior.

Objectives of Item Analysis

  1. To select suitable items for the final draft of the test and reject the poor ones.
  2. To find the difficulty value of all the items given in the preliminary draft.
  3. To select positively discriminating items and to reject the items of negative or zero discriminating power.
  4. To rectify the function of the distracters.
  5. To make the calculation of the validity and reliability of the test easy.
  6. To provide a logical basis for selection of items for the final draft.

Procedures / Steps in Item Analysis:
Following information are obtained by item analysis-
            After the test has been given, the papers or answer sheets are scored by marking all items incorrectly answered or omitted. Because of the instruction concerning omissions, they should be few. Each student’s score (not corrected for chance) will be the number of items on the test less the number of errors (wrong or omitted items) on his/her paper or answer sheet.

1. Administration of the Test – After the test, re-test is administered on the sample representing the same population. The conditions of administration are kept standard to avoid cheating or faking the responses. Sufficient time is also given to attempt the test-items. Clear instructions are given to the examinees.

2. Scoring – Answer key is prepared before scoring the items. One mark is given to every correct answer and zero is allotted to every wrong one. Scoring machines are also used in this work. If the answers are crossed on the separate OMR sheet, computer can score more than one thousand OMR sheets in one hour. Correlation Coefficient is obtained by calculating the scores of test and re-test.

3. Arranging the answer sheets in a descending order of total marks: The answer sheets of all the students (n) are put in descending order, placing the sheets having the highest score on top and continuing sequentially until the paper having the lowest score is placed on the bottom. n =30.

4. Item analysis chart: After the test and re-test, analyses of items are done. A chart is prepared of all the items with names of the examinees as follows:
i) Tick (□) marks for the items which have correct answers (1 or 2) that is as expected by the researcher.
ii) – mark for the items which have 3s circled.
iii) X marks for the items which have 4 & 5 circled, and
iv) O Marks for the items which are not at all answered by the students.

            The items with the highest number of students responding correctly or circling 1 and 2 are selected and items with the highest number of students answering wrongly that is circling 3, 4, and 5 and not at all answered items are rejected.

5. N, the total number of students, is multiplied by 0.27 and round off the result to the nearest whole number and the obtained number is called ‘n’. If ‘N’ is 30 (30 x 0.27 = 8.1), ‘n’ would be 8, the rounded figure of 8.1.

6. Take the upper group 27% of the cases: The best ‘n’ papers are counted off from the top of the stack. This is the “upper” / “high” (Ru) group.

7. Take the lower group 27% of the cases: The poorest n papers are counted off from the bottom of the stack. This is the “lower” (R1) group.

8. The calculation of difficulty Index: In order to obtain an item Difficulty Index ‘p’ that is, the proportion of total group who answered each item correctly, the Ru and R1 are added and then divided by Nu + N1.

Formula is:
           

Here,
            Ru = Number of examinees in the upper group answering the items correctly.
            R1 = Number of examinees in the lower group answering the items correctly.
            Nu = Number of examinees in the upper group.
            N1 = Number of examinees in the lower group.

            This must be interpreted with the changed level of the item in mind. For example, p = 0.5 for a two option item that all examinees mark, probably indicates a little or no knowledge of the point tested.
9. Calculation of discrimination (V) Value: It shows how an item differentiates the high achievers and low achievers. In order to obtain a measure of item discrimination ‘V’ (that is how well this item distinguished between the students who understand the content universe and those who do not), subtract R1 from Ru.

The formula is:
           

For example, = 0.23 (Ru, R1 & Nu are taken from # 12.)
            The items that yield a discrimination index (V-values) of 0.4 or more are high in discrimination. Those with V-values below 0.2 are low in discrimination (Ebel, 1954). They deserve careful scrutiny, particularly if they are revised for the future use. The items that are miskeyed or that are intrinsically ambiguous will tend to have negative V-values, or other options of that item will have higher V-values than the keyed-correct option.

            These options usually should be double-keyed because the distinction between the best and the next best options was too fine for the knowledgeable students to make. Of course, no item should be double-keyed if there is no logical justification in terms of the concept being measured. This logic may be not readily apparent to the test constructor, but can usually be supplied by high scoring examinees who did not select the keyed-correct option.

10. Evaluation of the item: Evaluation is done with the help of difficulty index and discrimination value, i.e., the items which are difficulty index 40% - 60% and discrimination value of above 0.4 can be retained. The other items can be discarded.

11. Effectiveness of the distracters: Item analysis is done for each item in the tools. If one answer is correct, the others are distracters. We can modify the distracters i.e., we can use plausible distracters. The example is given below -

Options           :           a          b          c          d

Ru                   :           5          17        3          2

R1                   :           5          10        6          9

            Here, ‘b’ is the keyed-response and the others are distracters.

Formula is: D = Ru + R1 ÷ Nu + N1 x 100
                        = 17 + 10 ÷ 30 + 30 x 100 = 27 ÷ 60 x 100 = 45%.

Factors affecting item selection:

            Difficulty value of an item (p), Discriminating power of an item (V) effectiveness of distracters are affected by a number of factors which are as follows -

1. Ambiguity and complexity of an item – due to this p-value and V-value may also be low. And if the distracters are not functioning well or evenly distributed in the middle and the lower group, then the p-value may be high and V-value may be low.

2. Non-familiarity of the examinees about the forms of the test – They will commit many mistakes and the p-value will be low.


3. Methods of estimating the indices – There are more than a dozen methods of estimating p-value and V-value, and all of them give different results.

4. Techniques of dichotomizing low and high groups – Some people take top 25% and bottom 25% for the purpose of dichotomy. Kelley has raised this limit to 27%. There are some educators who divide the whole group in the following way:

            Top                             Middle                        Bottom

            33%                              34%                           33%

            All the dichotomies will give different p-values and V-values.

5. Correction formula – Different correction formulae are used to reduce the influence of guessing. Students who do not adopt guessing techniques for attempting the test suffer a lot. All correction formulae give different p-values and V-values.

            Guilford’s formula of correction for guessing is as follows:

                        S = R – (W ÷ N – 1)

            Example, suppose an item is correctly responded by 300 examinees out of 400 with each item having 5 response options, the correct score will be

            S = 300 – 100 ÷ 5 – 1 = 275

            So, the number of right responses taken as 275 instead of 300. Now, the p-value which was 300 ÷ 400 = 0.75 before the correction will go down to 275 ÷ 400 = 0.689.

6. Ability level of the examinees – If the group of examinees is homogeneous, very talented or very poor, the index of discrimination will be low. Similarly, high ability level of examinees increase p-values of items and vice-versa. On the contrary to it, if the group is heterogeneous, V-value will be high and reliability of the test will also be high due to high standard deviation.

7. Skill of test construction – The more skillfully the test has been constructed, the more valid and reliable the test will be. Resultant, the range of p-values will be from 0.25 to 0.75 and desired degree of V-values. Similarly, if destructors are very close to the right option, p-values will be very low. All these will depend on the skills and experience of the test-maker.

8. Lack of time for attempting the items – If the reasonable time is not made available and a power test is made a speed test, then at least some of the students will not be able to reach the last 10% to 15% items. It will lower both the p-values and the V-values.

Difficulties and limitations of Item Analysis

            A number of techniques and methods have been developed for the purpose of item analysis so far which are used by a test-maker and a test is standardized. Despite this fact, some basic problems still remain unresolved which are to be discussed below.

1. Problem of spurious correlation in item total correlation – Whenever an individual item is correlated with the total score for the purpose of validating the item, the obtained coefficients are spuriously high. This problem becomes more serious when all the items in the test measure almost the same function.


            Two things may be done for the purpose of minimizing this problem:

a)      The number of items in the test should be kept large.

b)      Heterogeneous items rather than homogeneous should be included in the test, that is, extreme indices of items should be there.

2. Problem relating to dichotomous items – Items with true/false, agree/disagree, yes/no, etc. responses are generally used in non-ability test, like personality test, interest inventory, attitude scale, etc. For e.g., the test maker constructs 100 dichotomous items 0 positive items with + 1 mark each and 50 negative items with zero score. In this case, all item total correlation will be close to zero and all positive statements would correlate negatively with negative statements.

3. Problem of controlling unwanted factors – In homogeneous test, all items do not correlate with each other as all items are said to measure only one factor. For e.g. in aptitude test, it measures quantitative, verbal comprehension and is likely to correlate with factor also.

4. Problem of Guessing – Problem of guessing is very common in all types of objective type tests. It inflates the scores and increases both the p-values and the V-values. This problem can be minimized by introducing minus marking in the scoring procedure. For every wrong answer, ¼ marks should be deducted from the total score. For e.g., if an examinee attempted 20 questions wrongly out of 100, his total score will be:

            100 – (20 + ¼ x 20) = 75.

5. Problem related to time limit – No power test is purely a power test in itself. The examinees are required to complete the test in reasonable time. Thus, it becomes a speed test for about 25% of the examinees. In this way their scores are affected by time limit resulting in low p-values and high V-values.

Conclusion:

            An Item Analysis, comparing the performance on each item of the most and least successful examinees on the total test-will identifies items that are non-functional, intrinsically ambiguous or miskeyed. So, that they can be revised or thrown out of the tool. Usually, not only this procedure improves the reliability and hence the validity of a particular test, but the experience of studying the students’ responses in depth will help the instructor in his teaching and in subsequent test construction.

            Items of moderate difficulty have the potential for good item discrimination. The theoretical maximum item discrimination ‘V’-value (1.0) is possible only when item difficulty is 0.5.

            A chain of relationship exists between certain item and test characteristics. Item difficulty affects possible item discrimination which in turn directly determines the variance and internal-consistency reliability of the test scores. Reliability is necessary, but not sufficient, for validity.

Reference

  1. Justin C. Stanley & Kenneth D. Hopkins: Educational & Psychological Measurement and Evaluation, Prentice Hall of India Pvt. Ltd. (Item Analysis for Classroom Tests: pp.267-281).
  2. Dr. M.S. Ansari: UGC-Education, Ramesh Publishing House, New Delhi (Item Analysis, pp.683-691).
  3. R.A. Sharma: Technological Foundation of Education, R. Lall Book Depot (Evaluation of Teaching-Learning, pp.360 & 386-387).

1 comment:

  1. Want to use 'pay to do my assignment' services? Get the Best homework help at takemyonlineexams.com. We help you to negotiate with the Tutors and get your assignment ready instantly.You can hire Take My Online Exam's expert tutors to take your online exams, quizzes & classes, and with top grades. Contact us!

    ReplyDelete