Design and Stats


Internal validity

  • permits the conclusion that there is a causal relationship btwn IV and DV
  • extraneous Fs (that might account for IV differences, not DV)
  • threats: hx (external event), maturation (biological or psychological, e.g., fatigue, boredom, phys/intellectual dev), testing (exp with pre-test), instrumentation (change in nature of instrument), regression toward the mean (individual variation in species is limited, selection biases, differential mortality (drop-outs), experimenter bias (Rosenthal effect/Pygmalion effect, experimenter expectancy effect- beh of S changes b/c your expectancies)
    • random assignment of Ss (most powerful -minimizes this), matching Ss (ensure equivalency), blocking (studying the effects of an extraneous S characteristics- treat as IV), ANCOVA

External validity

  • generalizability of results
  • threats: interactions btwn selection and txt (txt effects don’t generalize to members of same pop), testing and txt (txt effects don’t generalize to individuals who don’t take a pre-test; see demand Cs and HE), hx and txt (txt effects don’t generalize beyond setting or time period exp done), demand Cs (cues in research setting allow Ss to guess rsch Hs), Hawthorne Effect (tendency of Ss to beh differently when they are being observed in rsch setting), order effects (repeated measures studies)
    • random S selection from pop of interest, conducting naturalistic/field research, using single or double-blind research designs, counterbalancing
    • stratified random sampling (taking a random sample from each of several subgroups of the total target population)
    • cluster sampling (unit of sampling is a naturally occurring group of individuals, rather than the individual)

Types of Designs

True experimental design

  • Ss randomly assigned to IV; permits greatest experimenter ctrl and highest internal validity

Quasi-experimental design

  • manipulated variable is studied, but Ss not randomly assigned to group (often pre-existing- pt grps), since Vs are manipulated, some experimenter ctrl

Correlational design

  • Vs not manipulated and no causal relationship btwn Vs can be assumed
  • measure the degree of relationship btwn Vs

Developmental research

  • assessing Vs as a fx of dev over time (aging on IQ scores)
  • longitudinal (pblms: high cost, attrition, practice effects; tend to under-est true age-related change b/c of attrition and practice effects), cross-sectional (cohort effects- intergenerational effects, difference might due to experience rather than age; tend to overestimate true age-related declines in performance), cross-sequential (Ss of different age grps studied over a short period of time; combines the two)

Time-Series Design

  • DV is measured several times, at regular intervals, both before and after txt is administered
  • ctrls for maturation, regression and testing; but hx (internal validity threat)

Single-Subject Designs

  • designs involving a single S (or one group txt as single S) and at least one baseline (pre-txt) phase and one txt phase, DV measured several times during both phases; simple AB (baseline-txt) design, reversal designs (txt withdrawn to determine if beh reverts to baseline levels; ABA or ABAB), multiple baseline designs (txt is sequentially admin across different beh, settings, or Ss)

Qualitative (Descriptive) Research

  • collect data in order to arrive at a theory about “how things are” (data fishing)
  • observation, interviews, surveys, case studies
  • often used in pilot studies to ID Vs and Hs for future studies

Scales of Measurement

  • nominal (unordered categories; gender)
  • ordinal (ordered/rank data; Likert scale)
  • interval (successive pnt =, but no absolute 0 pnt; IQ)
  • ratio (has absolute 0; weight, time)

Parametric Statistics

  • one-way ANOVA significant findings: population means differed

Parametrics test interval/ratio

  • assumes: normal distrib, homogeneity of variance (variance of all grps =), indep of observations
  • many parametric tests robust, esp to deviations of 1st 2, but not for multicolinearity

t-test: comparison of 2 means (Student’s t-test)

  • one sample t-test (comparing sample mean to known population mean) df=N-1
  • t-test for indep samples (compare means from 2 indep samples) df=N-2
  • t-test for correlated samples (compare means of 2 correlated samples- before/after design) df=N=1 (N is number of pairs of scores); e.g.s would be matched samples and pretest-posttest design

One-way ANOVA

  • one DV and more than 2 grps
  • yields an F value (ratio of btwn group variances-differences btwn grp means also known as txt variance; and within grp variance-differences btwn scores within each group); B variance/W variance; want B>W
  • sum of squares is measure of variability of a set of data; B df=k-1 (k is number of grps); W df=N-k (N is total number of Ss); Sum of Sq/df for each divided by each gives you F ratio
  • post-hoc tests (Tukey, Scheffe): pinpoint exact pattern of differences among the means b/c F alone doesn’t indicate exact patterns of differences among the means
  • doing multiple comparison’s increases Type I error (experimenter wise error)
  • other post-hoc tests: Neuman-Keuls test, Duncan’s multiple range test, Fischer’s Least Significant Difference Test
  • Scheffe is most conservative of all (greatest protection against Type I error; also highest Type II error rate)
  • Tukey appropriate if conducting pairwise comparisons
  • can also conduct planned pairwise or complex comparison (Hs a priori)

Factorial ANOVA

  • study involves 2/more IV and one DV
  • allows assessment of both main effects (group/IV differences; marginal means), and interaction effects (degree to how IV is different at different levels of the IVs)
  • can’t interpret main effects when interaction b/c don’t generalize to all situations (IV acts differently at different levels of another IV)


  • used to analyze data from studies with multiple DVs and at least one IV
  • can also conduct separate ANOVAs, one for each DV; but overall advantage of reducing Type I error

Nonparametric Statistics

  • nonparametrics test nominal or ordinal; distribution free tests; less powerful
  • both parametrics and nonparametrics assume samples are rep of pop


  • used to analyze nominal data (compares observed freqs of observations within nominal categories to freqs that would be expected under the null)
  • cautions in using chi-sq:
  1. all observations must be indep (no for before/after study)
  2. each observation can be classifiable into only one category/cell
  3. % of observations within categories can’t be compared (freq data required)
  • expected frequencies= (column total)(row total)/N

Mann-Whitney U

  • compare two indep grps on a DV measured with rank-ordered data
  • alternative to t-test for indep samples (if nonparametric)
  • can convert ratio/interval data to ordinal rank if assumptions of parametric tests not met

Wilcoxon Matched Pairs Test

  • compare 2 correlated gps on DV measured with rank-ordered data
  • alternative to t-test for correlated samples (if nonparametric)

Kruskal-Wallis Test

  • compares 2/more indep grps on a DV with rank-ordered data
  • alternative to one-way ANOVA (if nonparametric)

Distribution issues

  • descriptive stats describe data set
  • inferential stats makes inferences about entire pop based on sample data
  • negative skewed: most scores are high (to the right), but a few extreme low scores
    • mean is lower than the median, median lower than the mode
    • easy test; ceiling effects
  • positive skewed: most scores are low (to the left), but a few extreme high scores
    • mean is higher than the median, median higher than the mode
    • difficult test; floor effects
  • z-scores have same shape distribution as raw score distribution which it was derived
  • standard deviation most commonly used measure of variability
    • curve: 68% scores fall btwn +1 sd; 95% btwn +2; 99% +3
    • 16% >1 sd v-v; 2% >2 sd v-v
    • percentiles: 0.1% (z = -3); 2% (z = -2); 16% (z = -1); 84% (z = 1); 98% (z = 2); 99.9% (z = 3)
  • variance: average of the sq differences of each observation from the mean
  • sd: square root of the variance
  • stanine: divide the distribution into 9 = intervals, 1 lowest, 9 highest
  • % score refers to items on the test; % rank refers to other scores in the distribution
    • % ranks have a flat (rectangular) distribution- within a given range of % ranks, same number of scores- compared to normal distribution where most scores fall at ctr and few at extremes; non-linear transformation (changing to % distribution)
    • in a normal distribution,
      • z-score of 1.0 is equivalent to a % rank of 84 (top 16%) and z-score of -1.0 is equivalent to 16% (bottom 16%)
      • z-score of 2.0 = PR 98 (top 2%); -2.0 = PR 2 (bottom 2%)
    • change in raw score in middle of normal distribution results in much greater change in PR than same raw score change at the distribution’s extreme
  • homogeneity of variance: parametric tests robust to this unless unequal number of Ss in all exp groups (then Type I error increases)
  • sampling distribution of means: distrib of values of that stat, which each value computed from same-sized samples drawn with replacement from a pop
  1. has less variability than pop distrib
  2. sd is standard error of the mean
  • central limit theorem:
  1. as sample size increases, shape of sampling distrib of means approaches normal shape (even if pop distrib of scores not normally distrib)
  2. mean of sampling distrib of means is equal to mean of pop

Samples, Populations, Sampling Error

  • sample values limited b/c est of pop values
  • sample value=stat; pop value=parameter
  • sampling error always occurs in stats (b/c sample mean no always = pop mean)

Standard Error of the Mean

  • provides index of expected inaccuracy of sample mean (deviation btw est pop mean (i.e., sample mean) and true pop mean)
  • difference btwn sample mean and pop mean (e.g. of sampling error)

sd/√N sd (of pop); N (sample size)

  • decreasing SD and increasing N will result in smaller value and v-v

Statistical Decision Making

  • null H: IV has no effect on DV
  • alternative H: IV has effect on DV
  • possibilities:
  1. true null retained (correct; no difference btwn IV)
  2. true null rejected (incorrect; Type I error; at alpha- saying difference, not)
  3. false null rejected (correct; find difference that does occur)
  4. false null retained (incorrect; Type II error; beta; unknown value)
  • one-tailed H: predict the direction that means differ
  • two-tailed H: don’t predict direction
  • power: probability of rejecting null H when it’s false (probability of not making at Type II error); 1-beta (power)
    • increases by: larger N; increase alpha, 1-tailed test; greater difference btwn pop means under study
  • to determine whether or not to reject the null, you must compare stat value obtained to a critical value and dependent upon 2 Fs:
    • pre-set alpha level
    • degrees of freedom for stat test
  • if obtained value is lower than critical value, null retained


  • correlation (relationship of 2 Vs); correlation coefficient (stat index of relationship)- tells you magnitude and direction
  • use of scattergram to pictorially represent this
  • Pearson r (correlation btwn 2 continuous Vs): Pearson Product Moment (PPM)
    • Fs that can affect it: linearity (assumes linear relationship btwn 2 Vs); homoscedasticity (dispersion of scores are equally distributed); range of scores (wider the range of sample beh, more accurate est of correlation)
    • sq Pearson r: percentage of variability in one measure that is accounted for by variability in the other measure (coefficient of determination)
    • nonlinear relationship can be measured by coefficient eta.
  • point-biserial coefficient (correlates one continuous V with one dichotomous V)
  • phi coefficient (correlate 2 dichotomized Vs)
  • contingency (correlation btwn 2 nominally scaled Vs)
  • Spearman’s rho (correlate 2 rank-ordered Vs)


  • when 2 Vs correlated, construct equation to est the value of a “criterion” (outcome) V on the basis of scores on a “predictor” (input) V; multiple regression (2 or more Vs used to predict scores on one criterion)
  • assumptions: linear relationship btwn X and Y; regression line is picture of the overall relationship btwn 2 Vs
  • error score: difference btwn predicted and actual criterion score; error scores assumed to be normally distributed with mean of zero; correlation btwn error scores and actual criterion scores assumed to be zero; relationship btwn error scores and actual criterion scores must be homoscedastic
  • location of regression line determined using least sqrs criterion (line drawn at location where least amnt of error in predicting Y scores from X scores)
  • relationship btwn 2/more predictor Vs and one criterion V (multiple correlation coefficient; multiple R)
  • multiple correlation coefficient (predictive power of MR equation) is highest when predictor Vs each have high correlations with criterion but low correlations with each other; don’t want multicollinearity (when predictor Vs correlate)- doesn’t add predictive power
  • multiple correlation coefficient never lower than the highest simple correlation btwn individual predictor and criterion; it’s also possible that adding predictors can decrease multiple R due to multicollinearity
  • multiple R can never be negative (b/c of calculation procedure won’t allow it)
  • like Pearson r, multiple R can be squared (R2) called coefficient of multiple determination (proportion of variance in criterion V accounted for by combination of predictor Vs)
  • goal of stepwise regression is to come up with smallest set of predictors that maximizes predictive power (retain predictors that have high multiple correlation with criterion)

Other Correlation Techniques

  • Canonical Correlation: used to calculate relationship btwn 2/more predictors and 2/more criterion Vs
  • Discriminate Function Analysis: used when goal is to classify individuals into groups based on their scores on multiple predictors
  • Multiple Cutoff: setting a minimum cutoff score on a series of predictors; if cutoff score not achieved on even one of the predictors, person isn’t selected
  • Partial Correlation: used to assess relationship btwn 2 Vs with the effects of another V “partialled out.” (stat removed); converse to this is zero-order correlation (correlation btwn 2 Vs determine without regard for any other Vs); can have a suppressor V (suppresses relationship btwn predictor and a criterion)
  • Structural Equation Modeling: general term used for techs that involve calculating pairwise correlations btwn multiple Vs; purpose for causal modeling, testing a H that posits a causal relationship among multiple (3/more) Vs (path analysis, LISREL)
    • assumes linear relationship btwn Vs
    • path assumes one-way causal flow; LISREL one-way and/or 2-way; Path used in models that include observed Vs only; LISREL used when model specifies both latent and observed Vs
    • basic steps:
  1. specifying a structural model involving many different Vs
  2. conducting stat analysis
  3. interpreting results of analysis
  • Trend Analysis: stat tech used to examine trend of change (linear, quadratic, cubic, quartic) in DV, as opposed to whether or not DV changes at all
  • logistic regression like discriminate fx analysis b/c makes predictions about which criterion grp person belongs to; mostly used for dichotomous DV
    • differences btwn LR and DFA:
  1. DFA has 2 assumptions: multivariate normal distrib, homogeneity of variance/covariance
  2. can use nominal or continuous Vs and DFA continuous only (interval/ratio)

Advanced statistics

  • autocorrelation: correlation btwn observations at given lags (e.g., every 2nd observation); used in time-series analysis
  • Bayes’ theorem: formula for obtaining special type of conditional probability; revise conditional probabilities based on additional info