Test Construction


  • Test= systematic method for measuring a sample of behavior
  • Test construction specifying the test’s purpose
    • Generating test items
    • Administering the items to a sample of examinees for the purpose of item analysis
    • Evaluating the test’s reliability and validity
    • Establishing norms

Item analysis

  • Relevance– extent to which test items contribute to achieving the stated goals of testing

To determine relevance:

  • Content appropriateness: Does the item actually assess the domain the test is designed to evaluate
  • Taxonomic level: Does the item reflect the appropriate ability level?
  • Extraneous abilities: Does the item require knowledge, skills, or abilities outside the domain of interest
  • Ethics– to meet the requirements of privacy, the information asked by test items must be relevant to the stated purpose of testing (Anastasi)

Item Difficulty

  • Determined by the Item Difficulty Index (p)
    • values of p range from 0-1
    • calculated by dividing # who answered correctly by total # of sample
    • larger indicate easier items
      • p=1, all people answered correctly
      • p=0, no one answered correctly
    • Typically, items with moderate difficulty level (p=.5) are retained
      • increases score variability
      • ensure that scores will be normally distributed
      • provides maximum differentiation between subjects
      • helps maximize test’s reliability
    • Optimal difficulty level is affected by several factors
      • greater the probability that correct answer can be selected by guessing, the higher the optimal difficulty level
        • for true/ false item, where chance is .50, preferred difficulty level is .75
        • If goal of testing is to choose a number of examinees, the preferred difficulty level will = the proportion of examinees to be chosen
        • if on test, only 15% are to be admitted, the average item difficulty level for entire test should be .15

People in sample must be representative of population of interest, since difficulty index is affected by nature of tryout sample. Study tip- In most situations p=.50 is optimal, except in T/F test, where p=.75 is optimal.

Item Discrimination

  • Extent to which an item differentiates between examinees who obtain high vs low scores
  • Item Discrimination Index (D)
    • identifying the ppl who obtained the upper and lower 27%
    • for each item, subtract the percent of examinees in the lower-scoring group (L) from the percent of examinees in the upper-scoring (U) group who answered it right
  • D= U-L
    • range from –1 to +1.
      • D= +1 if all in upper group and none in lower group answer right
      • D= -1 if none in upper group and all in lower group answer right
    • Acceptable D= .35 or higher
    • Items with moderate difficulty level (.50) have greatest potential for maximum discrimination

Item Response Theory

Classical Test Theory

  • An obtained test score reflects truth and error
  • Concerned with item difficulty and discrimination, reliability, and validity
  • Shortcomings
    • item and test parameters are sample-dependent
      • item difficulty index, reliability coeff, etc likely to differ between samples
    • Difficult to equate scores on different tests
      • score of 50 on one test doesn’t = 50 on another

Item Response Theory (IRT)

  • Advantages over Classical Test Theory
    • item characteristics are sample invariant- same across samples
    • test scorers reported in terms of examinee’s level of a trait rather than in terms of total test score, possible to equate scores from different tests
    • Had indices that help identify item biases
    • Easier to develop computer-adaptive tests, where administration of items of based on examinee’s performance on previous items
  • Item characteristic curve (ICC)-
    • plot the proportion of ppl who answered correctly against the total test score, performance on an external criterion, or mathematically-derived estimate of ability
    • provides info on relationship between an examinee’s level on the trait measured by the test and the probability that he will respond correctly to that item
    • P value: probability of getting item correct based on examinee’s overall level
  • Various ICCs provide information on either 1, 2, or 3 parameters:
    • difficulty
    • difficulty and discrimination
    • difficulty, discrimination, and guessing (probability of answering right by chance)

Study tip Link these with IRT- sample invariant, test equating, computer adaptive tests


  • Classical test theory– obtained score (X) composed of true score (T) and error component (E) where X = T + E
  • True score= examinee’s status with regard to attribute measured by test
  • Error component= measurement of error
  • Measurement of Error= random error due to factors that are irrelevant to what is being measured and have an unpredictable effect on test score
  • Reliability– estimate of proportion of variability in score that is due to true differences among examinees on the attribute being measured
    • when a test is reliable, it provides consistent results
    • consistency = reliability

Reliability Coefficient

  • Reliability Index– (in theory)
    • calculated by dividing true score variance by the obtained variance
    • would indicate proportion of variability in test scores that reflects true variability
    • however, true test scores variance not known so reliability must be estimated
  • Ways to estimate a test’s reliability:
  • consistency of response over time
  • consistency across content samples
  • consistency across scorers
  • Variability that is consistent is true score variability
  • Variability that is inconsistent is random error
  • Reliability coefficient– correlation coefficient for estimating reliability
    • ranges from 0-1
    • r = 0, all variability is due to measurement error
    • r = 1, all variability due to true score variability
    • Reliability coefficient symbolized by rxx .
      • Subscript indicates correlation coefficient calculated by correlating test with itself rather than with another measure
    • Coefficient is interpreted directly as the proportion of variability in obtained test scores and reflects true score variability
      • r= .84 means that 84% of variability in scores due to true score differences while 16% (1.0 – .84) is due to measurement error.
      • If double it, reflects twice as much variability
      • Does NOT indicate what is being measured by a test
      • Only indicates whether it is being measured in a consistent precise way

Study tip Unlike other correlations, the reliability coefficient is NEVER squared to interpret it. It is interpreted directly as a measure of true score variability. r=.89 means that 89% of variability in obtained scores in true score variability.

Methods for Estimating Reliability

  1. Test-Retest Reliability
  • administering the same test to the same group on two occasions
  • correlating the two sets of scores
  • reliability coefficient indicates degree of stability or consistency of scores over time
    • coefficient of stability
  • Source of error
    • Random Factors-Primary sources of error are random factors related to the time that passes
  • Time sampling factors
    • random fluctuations in examinees over time (anxiety, motivation)
    • random variations in testing situation
    • memory and practice when don’t affect all examinees in the same way
  • Appropriate for tests that measure things that are stable over time and not affected by repeated measurement.
    • good for aptitude
    • bad for mood or creativity
  • Higher coefficient than alternate form because only one source of error
  1. Alternate (Equivalent, Parallel) Forms Reliability
  • two equivalent forms of the test are given to same group and scores are correlated
  • consistency of response to different item samples
  • may be over time, if given on two different occasions
  • alternate forms reliability coefficient
    • coefficient of equivalence when administered at same time
    • coefficient of equivalence and stability when administered at two different times
  • Content Sampling-Primary source of error is content sampling
    • error introduced by an interaction between different examinees knowledge and the different content assessed by the two forms
    • items on form A might be a better match for one examinee’s knowledge, while the opposite is true for another examinee
    • two scores will differ, lowering the reliability coefficient
  • Time sampling can also cause error
    • Not good for
      • when attribute not stable over time
      • when scores can be affected by repeated measurement
      • when same strategies used to solve problems on both forms= practice effect
      • when practice differs for different examinees (are random), it is a source for measurement error
    • Good for speeded tests
    • Considered to be the most rigorous and best method for estimating reliability
    • Often not done because of difficulty of creating two equal forms
    • Less affected by heterogenous items than internal consistency
      • higher coefficient than internal consistency (KR-20) when items are heterogeneous
  1. Internal Consistency Reliability
  • Administering the test once to a single sample
  • Yields the coefficient of internal consistency
  • Split half
    • Test is split into equal halves so that each examinee has two scores
    • Scores are correlated
    • Most common to divide by even and odd numbers
    • Problem- produces reliability coefficient based on test scores derived from one-half of the length of the test
      • reliability tends to decrease as length of test decreases
      • split-half tends to underestimate true reliability
      • however when two halves not = in mean or SD, may either under or overestimate
      • corrected using the Spearman-Brown prophecy formula, which estimates what reliability coefficient would have been
      • S-B used to estimate effects of increasing or decreasing length of test on reliability
    • Cronbach’s coefficient alpha
      • Test administered once to single sample
      • Formula used to determine average degree of inter-item consistency
      • Average reliability that would be obtained from all possible splits of the test
      • Tends to be conservative, considered the lower boundary of test’s reliability
      • Kuder-Richardson Formula 20- use when tests scored dichotomously (right or wrong); produces high reliability coeff for speeded tests
      • Sources of error
        • Content sampling: 1) split half: error resulting from differences in the two halves of the test (better fit for some examinees). 2) coefficient alpha: differences between individual test items
        • Heterogeneity of content domain: 1) coefficient alpha only. 2) test is heterogeneous when it measures several different domains. 3) greater heterogeneity, lower inter item correlation –> lower magnitude of coefficient alpha
        • Good for: 1) tests measuring a single characteristic. 2) characteristic changes over time. 3) scores likely to be affected by repeated measures
        • Bad for: 1) speed tests, because produce spuriously high coefficients. 2) alternate forms best for speed tests
  1. Inter-rater ( Inter-scorer, Inter-observer) Reliability
  • When test scores rely on rater’s judgment
  • Done by
    • calculating a correlation coefficient (kappa coefficient or coefficient of concordance)
    • determining the percent agreement between the raters
      • does not take into account the level of agreement that would have occurred by chance
      • px when recording high-frequency behavior because degree of chance agreement is high
    • Sources of error
      • factors related to the raters (motivation, biases)
      • characteristics of the measuring device
        • reliability low when categories are not exhaustive and/or not mutually exclusive and discrete
      • consensual observer drift
        • observers working together influence each other so they score in similar, idiosyncratic ways
        • tends to artificially inflate reliability
      • Improve reliability
        • eliminate drift by having raters work independently or alternate raters
        • tell raters their work will be checked
        • training should emphasize difference between observation and interpretation

Study tip Spearman-Brown = split-half reliability; KR-20 = coefficient alpha; Alternate form most thorough method: Internal consistency not appropriate for speeded tests

Factors that Affect the Reliability Coefficient

  1. Test length– longer the test, larger the reliability coefficient
  • Spearman-Brown can be used to estimate effects of lengthening or shortening a test on its reliability coefficient
  • Tends to overestimate a test’s true reliability
    • Most likely when the added items do not measure the same content domain
    • When new items are more susceptible to measurement error
  • When mean and SD not equal, can over or underestimate
  1. Range of test scores– maximized when range is unrestricted
  • range affected by degree of similarity among sample on attribute measured
    • when heterogeneous, range is maximized
    • will overestimate if sample is more heterogeneous than examinees
  • affected by item difficulty level
    • if all easy or hard, results in restricted range
    • best to choose items in mid-range (p = .50)
  1. Guessing
  • as probability of guessing correct answer increases, reliability coefficient decreases
  • T/F test lower reliability coefficient than multiple choice
  • Multiple choice lower reliability coefficient than free recall

Interpretation of Reliability

  • The Reliability Coefficient
    • Interpreted directly as the proportion of variability in a set of test scores that is attributable to a true score variability
    • R= .84 means that 84% of variability in test score is due to true score differences among examinees, while 16% due to error
    • Coefficient of .80 acceptable; achievement and ability is usually .90
    • No single index of reliability for any test
    • Test’s reliability can vary by situation and sample
  • Standard Error of Measurement
    • Assists in interpreting the individuals score
    • Index of error in measurement
    • Construct a confidence interval around the score
      • estimate range examinee’s true score likely to fall in
      • when raw scores converted to percentile ranks, called percentile band
    • Use standard error of measurement
      • index of amount of error expected in obtained scores due to unreliability of test
    • Standard error affected by the standard deviation and the test’s reliability coefficient
    • Lower standard deviation and higher reliability coefficient, the smaller the standard error of measure (vice versa)
    • Type of standard deviation, so talk about area under the normal curve
      • 68% confidence interval by +/- 1
      • 95% confidence interval by +/- 2
      • 99% confidence interval by +/- 3
    • Problem- measurement error not equally distributed throughout range of scores
      • use of same standard error to construct confidence intervals for all scores can be misleading
      • manuals report different standard errors for different score intervals

Study tip Name “standard error of measurement” can help remember when it’s used- used to construct a confidence interval around a measured (obtained score) Know difference between standard error of measurement and standard error of estimate

  • Estimating True Scores from Obtained Scores
    • Because of measurement effects, obtained test scores tend to be biased estimates of true scores
      • scores above mean tend to overestimate
      • scores below mean tend to underestimate
      • farther from the mean, greater the bias
    • Rather than using confidence interval, can use a formula that estimates true score by taking into account this bias by adjusting the obtained score by using the mean of the distribution and the test’s reliability coefficient.
      • less used
    • Reliability of Difference Scores
      • Compare performance of one person on two test scores (i.e., VIQ and PIQ)
      • Reliability coefficient for the difference score can be no larger than the average reliabilities of the two tests
        • Test A has reliability coefficient of .95 and Test B has .85, difference score will have reliability of .90 or less
      • Reliability coefficient for difference scores depends on degree of correlation between tests
        • more highly correlated, the smaller the reliability coefficient and the larger the standard error of measure


  • Test’s accuracy. Valid when it measures what it is intended to measure.
  • Intended use for tests, each has it’s own method for establishing validity
    • content validity- for a test used to obtain information about a person’s familiarity with a particular content or behavior domain
    • construct validity- test used to determine the extent to which an examinee possesses a trait
    • criterion related validity- test used to estimate or predict a person’s standing or performance on an external criterion
  • Even when a test is found to be valid, it might not be valid for all people

Study tip When scores are important because they provide info on how much a person knows or on each person’s status with regard to a trait- content and construct When scores used to predict scores on another measure, and those scores are of most interest- criterion related validity

  • Content Validity
    • The extent that a test adequately samples the content or behavior domain it is to measure
    • If items not a good sample, results of test misleading
    • Most associated with achievement tests that measure knowledge and with tests designed to measure a specific behavior
    • Usually “built into” test as it is constructed by identifying domains and creating items
    • Establishment of content validity relies on judgment of subject matters experts
      • If experts agree items are adequate and representative, then test is said to have content validity
    • Qualitative evidence of content validity
      • coefficient of internal consistency will be large
      • test will correlate highly with other tests of same domain
      • pre-post test evaluations of a program designed to increase familiarity with domain will indicate appropriate changes
    • Don’t confuse with face validity
    • Content validity = systematic evaluation of tests by experts
    • Face validity = whether or not a test looks like it measures what it’s supposed to
    • If lacks face validity, ppl may not be motivated
  • Construct Validity
    • When a test has been found to measure the trait that it is intended to measure
    • Abstract characteristic, cannot be observed directly but must be inferred by observing its effects.
    • No single way to measure
    • Accumulation of evidence that test is actually measuring what it was designed to
      • Assessing internal consistency: 1) do scores on individual items correlate highly with overall score 2) are all items measuring same construct
      • Studying group differences: 1) Do scores accurately distinguish between people known to have different levels of the construct
      • Conducting research to test hypotheses about the construct: 1) Do test scores change, following experimental manipulation, in the expected direction
      • Assessing convergent and discriminant validity: 1) does it have high correlations with measures of the same trait (convergent) 2) does it have low correlations with measures of different traits (discriminant)
      • Assessing factorial validity: 1) Does it have the factorial composition expected
    • Most theory laden of the methods of test validation
      • begin with a theory about the nature of the construct
      • guides selection of test items and choosing a method for establishing validity
      • example: if want to develop a creativity test and believe that creativity is unrelated to intelligence, is innate, and that creative ppl generate more solutions, you would want to determine the correlation between scores on creativity tests and IQ tests, see if a course in creativity affects scores, and see if test scores distinguished between ppl who differ in number of solutions they generate
      • Most basic form of validity because techniques involved overlap those used for content and criterion-related validity

“all validation is one, and in a sense all is construct validation” Cronbach

  • Convergent and Discriminant Validity
    • Correlate test scores with scores on other measures
    • Convergent = high corr with measures of same trait
    • Discriminant = low corr with measures of unrelated trait
    • Multitrait-multimethod matrix- used to assess convergent and discriminant
      • table of correlation coefficients
      • provides info about degree of association between 2 or more traits that have been assessed using 2 or more measures
      • see if the correlations between different methods measuring the same trait are larger than the correlations between the same and different methods measuring different traits
    • You need two traits that are unrelated (assertiveness and aggressiveness) and each trait measured by different methods (self and other rating)
    • Calculate correlation coefficient for each pair and put in matrix
    • Four types of correlation coefficients
    • Monotrait-monomethod coefficient = same trait-same method
      • reliability coefficients
      • indicate correlation between a measure and itself
      • not directly relevant to validity, need to be large
    • Monotrait-heteromethod coefficient = same trait-different method
      • correlation between different measures of the same trait
      • provide evidence of convergent validity when large
    • Heterotrait-monomethod coefficient = different traits-same method
      • correlation between different traits measured by same method
      • provide evidence of discriminant validity when small
    • Heterotrait-heteromethod coefficient = different traits- different method
      • correlation between different traits measured by different methods
      • provide evidence of discriminant validity when small

Factor Analysis

  • Identify the minimum number of common factors (dimensions) required to account for the intercorrelations among a set of tests, subtests, or test items.
  • Construct validity when it has high correlations with factors it would be expected to correlate with and low correlations with factors it wouldn’t be expected to correlate with (another way for convergent and discriminant validity)
  • Five steps
    • Administer several test to a sample:
      • administer test in question along with some that measure same construct and some that measure different construct
    • Correlate scores on each test with scores on every other test to obtain a correlation [R] matrix***high correlations suggest measuring same construct
      • low correlations suggest measuring different constructs
      • pattern of correlations determines how many factors will be extracted
    • Convert correlation matrix into factor matrix using one of several factor analytic techniques
      • Data in correlation matrix used to derive a factor matrix
      • Factor matrix contains correlation coefficients (“factor loadings”) which indicate the degree of association between each test and each factor
    • Simplify the interpretation of the factors by “rotating” them
      • pattern of factor loadings in original matrix is difficult to interpret, so factors are rotated to obtain clearer pattern of correlation
      • rotation can produce orthogonal or oblique factors
    • Interpret and name factors in rotated matrix
      • names determined by looking as tests that do and do not correlate with each factor
    • Factor loadings– correlation coefficients indicate the degree of association between each test and each factor
      • square it and determine the amount of variability in test scores explained by the factor
    • Communality– “common variance”
      • amount of variability in test scores that is due to the factors that the test shares in common with the other tests in the analysis
      • total amount of variability in test scores explained by the identified factors
      • communality = .64 means that 64% of the variability in those test scores is explained by a combination of the factors
    • A test’s reliability (true score variability) consists of two components
      • Communality– variability due to factors that the test shares in common with other tests in the factor analysis
      • Specificity– variability due to factors that are specific and unique to the test and are not measured by other tests in the factor analysis
        • portion of true test score variability not explained by the factor analysis
      • Communality is a lower limit estimate of a test’s reliability coefficient
        • a test’s reliability will always be at least as large as it’s communality
      • Naming of factor done by inspecting pattern of factor loadings
      • Rotated matrix: redividing the communality of each test included in the analysis
        • as a result, each factor accounts for a different proportion of a test’s variability than it did before the rotation
        • makes it easier to interpret the factor loadings
        • Two types
          • orthogonal- resulting factors are uncorrelated: 1) attribute measured by one factor is independent from the attribute measured by the other factor 2) choose if think constructs are unrelated 3) types include varimax, quartimax, equimax
          • oblique- factors are correlated: 1) attributes measured are not independent 2) choose if think constructs may be related 3) types include quartimin, oblimin, oblimax
          • When factors are orthogonal, test’s communality can be calculated from it’s factor loadings
          • Communality equals the sum of the squared factor loadings
          • When factors are oblique, the sum of the squared factor loadings exceeds the communality

Study tip

  • squared factor loading provides measure of shared variability
  • when orthogonal, test’s communality can be calculated by squaring and adding the test’s factor loading
  • orthogonal factors are uncorrelated, while oblique factors are correlated


  • Used when test scores are used to draw conclusions about an examinee’s standing or performance on another measure.
  • Predictor– the test used to predict performance
  • Criterion– other measure that is being predicted
  • Correlating scores of a sample on the predictor with their scores on the criterion.

When the criterion related validity coefficient is large, confirms the predictor has criterion related validity

  • Concurrent vs. Predictive Validity
    • Concurrent- criterion data collected prior to or at same time as predictor data
      • preferred when predictor used to estimate current status
      • examples: estimate mental status, predict immediate job performance
    • Predictive- criterion data collected after predictor data
      • preferred when purpose of testing is to predict future performance on the criterion
      • examples: predict future job performance, predict mental illness

Study tip Convergent and divergent associated with construct validity Concurrent and predictive associated with criterion-related validity

  • Interpretation of the Criterion-Related Validity Coefficient
    • Rarely exceed .60
    • .20 or .30 might be acceptable if alternative predictors are unavailable or have lower coefficients or if test administered in conjunction with others
    • Shared variability- squaring the coefficient gives you the variability that is accounted for by the measure
    • Expectancy table- scores on predictor used to predict scores on criterion

Study tip You can square a correlation coefficient to interpret it only when it represents the correlation between two different tests.

  • When squared, it gives a measure of shared variability
  • Terms that suggest shared variability include “accounted for by” and “explained by”
  • If asks how much variability in Y is explained by S, square the correlation coeff
  • Standard Error of Estimate
    • Derive a regression equation used to estimate criterion score from obtained predictor score
    • There will be error unless correlation is 1.0
    • Standard error of estimate used to construct confidence interval around estimated criterion score.
    • Affected by two factors: 1) standard deviation of the criterion scores and 2) predictor’s criterion related validity coefficient
    • Standard error of estimate smaller with smaller standard deviation and larger validity coefficient
    • Larger SD, larger standard error of estimate
    • When validity coefficient = +/- 1, standard error of estimate = 0
    • When validity coefficient = 0, standard error of estimate = standard deviation

Study tip Know difference between standard error of estimate and standard error of measurement. Standard error of estimate is confidence interval around an estimated (predicted) score

Incremental Validity

  • Increase in correct decisions that can be expected if predictor is used as a decision-maker
  • Important- even when a predictor’s validity coefficient is large, use of the predictor might not result in a larger proportion of correct decisions
  • Scatterplot
  • To use, criterion and predictor cutoff scores must be set
  • Criterion cutoff score- provides cutoff for your criterion being predicted, i.e. successful and unsuccessful
  • Predictor cutoff score- provides score that would have been used to hire or not hire
    • divides into positives and negatives
    • postivies= those who scored above the cutoff
    • negatives= those who scored below the cutoff
  • Four quadrants of the scatterplot
  • True positives– predicted to succeed by the predictor and are successful on the criterion
  • False positives– predicted to succeed by the predictor and are not successful on the criterion
  • True negatives– predicted to be unsuccessful by predictor and are unsuccessful on the criterion
  • False negatives– predicted to be unsuccessful by predictory and are successful on the criterion
  • If predictor score lowered number of positives would increase and number of negative would decrease.
  • If predictor score raised number of positives would decrease and number of negatives would increase. (false + decrease)
  • Selection of optimal predictor cutoff based on:
    • number of people in the four quadrants
    • goal of testing
      • goal is to maximize proportion of true positives, high predictor score set because reduces number of false positives
    • Criterion cutoff can also be raised or lowered, but might not be feasible
      • Low scores may not be acceptable
    • Incremental validity calculated by subtracting the base rate from the positive hit rate
      • Incremental validity = Positive Hit Rate – Base Rate
    • Base rate is proportion of people who were selected without use of the predictor and who are currently considered successful on the criterion
    • Positive hit rate is proportion of people who would have been selected on the basis of their predictor scores and who are successful on the criterion
    • Best when:
    • validity coefficient is high
    • base rate is moderate
    • selection ratio is low
    • When incremental validity is also used in determining whether a screening test is an adequate substitute for a lengthier evaluation
      • Positives- people who are identified as having the disorder by the predictor
      • Negatives- people who are not identified as having the disorder by the predictor
    • Criterion cutoff divides people into those who have been dx with the lengthier evaluation as having the disorder or not

Study tip

  • Predictor determines whether a person is positive or negative
  • Criterion determines whether person is false or positive
  • The scatterplot on the test may not have the same quadrant labels

Relationship between Reliability and Validity

  • Reliability places a ceiling on validity
    • when a test has low reliability, it cannot have a high degree of validity
  • High reliability does not guarantee validity
    • test can be free of measurement error but still not test what its supposed to
  • Reliability is necessary but not sufficient for validity
  • Predictor’s criterion related validity cannot exceed the square root of its reliability coefficient
    • If reliability coefficient of predictor is .81, validity coefficient must be <.90
  • Validity is limited by reliability of predictor and criterion
    • To obtain a high validity coefficient, reliability of both must be high

Correction of Attenuation Formula

  • Estimate what a predictor’s validity coefficient would be if the predictor and the criterion were perfectly reliable (r=1.0)
  • Need
  • the predictor’s current reliability coefficient
  • the criterion’s current reliability coefficient
  • criterion-related validity coefficient
  • Tends to overestimate the actual validity coefficient

Criterion Contamination

  • Accuracy of a criterion measure can be contaminated by way in which scores on the criterion measure are determined.
  • Tends to inflate the relationship between the predictor and a criterion, resulting in artificially high criterion-related validity coefficient
  • If eliminate it, coefficient decreases
  • Make sure that individual rating ppl on criterion measure is not familiar with performance on predictor.


  • When a predictor is developed, items that are retained for final version are those that correlate highly with criterion.
  • However, can be due to unique characteristics of try out sample
  • Predictor “made” for that sample, and if use same sample to validate the test, the criterion related validity coefficient with be high
  • Must cross validate a predictor on another sample.
  • Cross validation coefficient tends to shrink and be smaller than the original coefficient. Smaller the initial evaluation sample, the greater the shrinkage of the validity coefficient


  • Norm Referenced Interpretation- comparing to scores obtained by people in a normative sample
    • Raw score is converted to another score that indicates standing in norm group
    • Emphasis is on identifying individual differences
    • Adequacy relies on how much person’s characteristics match those in sample
    • Weaknesses
      • Finding norms that match person
      • Norms quickly become obsolete

Percentile Ranks-

  • raw score expressed in terms of percentage of sample who obtained lower scores.
  • Advantage: Easy to interpret
  • Distribution always flat (rectangular) in shape regardless of shape of raw score distribution because evenly distributed throughout range of scores (same # of ppl between 20 and 30 as between 40 and 50)
    • Nonlinear transformation: distribution of raw scores differs in shape from the distribution of raw scores
  • Disadvantage: ordinal scale of measurement
    • Indicate relative position in distribution, do not provide info on absolute differences between subjects
    • Maximizes differences between those in middle of distribution while minimizing differences between those at extremes
    • Can’t perform many mathematical calculations on percentiles

Study tip

  • Linear = distributions look alike
  • Nonlinear = distributions look different

Standard Scores

  • indicates position in normative sample in terms of standard deviations from the mean
  • Advantage: permits comparison of scores across tests
  • Z score: subtracting mean of distribution from raw score to obtain a deviation score, then dividing by distribution’s standard deviation
    • Z = (X – M)/SD
    • Following properties
  • mean = 0
  • SD = 1
  • All raw scores below mean are negative, all above mean are positive
  • Unless it is normalized, distribution has same shape as raw scores

Linear transformation Normalized by special procedure if think normal distribution in pop and that non-normality is error

  • T score: Mean of 50, standard deviation of 10 (essentially, multiply z times 10, and add 50)
    • 26% of scores fall within one standard deviation from the mean
    • 44% of scores fall within two standard deviations from the mean
    • 72% of scores fall within three standard deviations from the mean

Study tip Can calculate percentile rank from a z-score using the area under the normal curve. Percentile rank of 84 = z score of 1 50% fall below the mean and 34% (half of 68%) fall between the mean and 1SD, so50+34= 84

Age and grade equivalent scores

  • score that permits comparison of performance to that of average examinees at different ages or grades
  • easily misinterpreted
    • highly sensitive to minor changes in raw scores
    • do not represent equal intervals of measurement
    • do not uniformly correspond to norm referenced scores


  • Interpreting scores in terms of a prespecified standard.
  • Percentage score on type: indicates percentage of questions answered correctly.
    • A cutoff score in then set so that those above pass, and those below fail
  • Mastery (criterion referenced) testing: specifying terminal level of performance required for all learners and periodically administering test to assess degree of mastery
    • to see if performance improves from program or not
    • if deficiencies seen, given remedial instruction and process repeated until passes
    • goal not to identify differences between examinees, but to make sure all examinees reach same level of performance
  • Regression equation/ expectancy tables: interpreting scores in terms of likely status on an external criterion

Study tip Link percentile ranks, standard scores, and age/grade equivalents with norm referenced interpretation Link percentage scores, regression equation, and expectancy table with criterion referenced interpretation


  • Ensure that use of measures does not result in adverse impact for members of minority groups
  • Used to:
  • alleviate test bias
  • achieve business or societal goals (“make up” for past discrimination, allow for greater diversity in workplace)
  • increase fairness in selection system by ensuring that score on single instrument not overemphasized
  • Only first justification has been widely accepted, and only under certain circumstances (when predictive bias demonstrated)
  • Involve taking group differences into account when assigning or interpreting scores
    • Bonus points- adding a constant number of points to all ppl of a certain group
    • Separate cutoff scores- different cutoff scores for different groups
    • Within-group norming- raw score converted to standard score within group
    • Top-down selection from separate lists- ranking candidates separately within
      • Groups and then selecting top-down from within each group in accordance with a prespecified rule about number of openings. Often used by
      • Affirmative action to overselect from excluded groups
    • Banding: considering ppl within a range as having identical scores. May be
      • Combined with another method to ensure minority representation
    • Section 106 of the Civil Rights Act prohibits score adjustment or use of different cut off scores in employment based on race, color, religion, sex or national origin
      • Sackett and Wilk: may take group membership into if doing so can be shown to increase accuracy of measurement or accuracy of prediction without increasing the adverse impact of the test
        • banding found legal in one case
        • banding with minority preference may be useful for balancing competing goals of diversity and productivity and legal under the Act

Correction for Guessing

  • Ensure examinee does not benefit from guessing
  • When tests corrected, best to omit rather than guess
  • Corrections involve calculating a corrected score by taking into account:
    • the number of correct responses
    • the number of incorrect responses
    • the number of alternatives for each item
      • Corrected score = R – (W/n-1) where
    • R = number of right answers
    • W = number of wrong answers
    • N = number of choices per answer
    • When the correction involves subtracting points from scores, the resulting distribution will have a lower mean and a larger SD than the original distribution
    • Experts generally discourage correction for guessing
      • unless considerable number of items omitted by some of the examinees, relative position of examinees will be same regardless of correction
      • correction ahs little or no effect on test’s validity

Only time justified is for speeded tests