tirsdag 7. februar 2012

Did Hattie get his statistics wrong?

[Update 11 February 2012: After I wrote this post, both John Hattie and Arne Kåre Topphol have commented on the debate. You can read more about this here.]

For reasons unfathomable to the average Norwegian, the entire rest of the Global community has heretofore failed to recognize Norway as the undisputed apogee of the civilized world, and has thus made only feeble and inconsequential attempts to learn to speak and understand our language. Hence, considering the following topic as having some interest beyond the confines of Norwegiandom, this post will be written in English.
By now, it is probably impossible to underestimate the impact John Hattie, and specifically his 2009 publication Visible learning: a synthesis of over 800 meta-analyses relating to achievement, has had on policy-makers, researchers and academics, and educators, concerned with improving school systems and best teacher practices across the globe. References to Hattie’s effect study are prolific in teacher training curricula, in white papers, and in the general public discussion on education (at least as seen from Norway, and the teacher-training course at the University of Oslo, and my own admittedly subjective assessment of it all).
Now, however, an article has emerged, titled “Can we trust the use of statistics in educational research” (“Kan vi stole på statistikkbruken i utdanningsforskinga”) (Topphol, 2011), published in the Norwegian Journal of Pedagogy (Norsk pedagogisk tidsskrift), questioning Hattie’s use of one of the statistical measures in Visible learning, namely the Common Language Effect size (CLE).
(Unfortunately, the article is in Norwegian only; you must be connected to the university server to gain access.) 
First introduced by McGraw and Wong (1992), the CLE is, as I understand it, an effect measure intended to be more intuitively understandable for the average non-expert-statistician reader. In short, the CLE size is the probability that one randomly selected sample from one distribution (e.g. of scores on a test) will be greater or smaller than one randomly selected sample from another distribution. This definition is consistent with Hattie’s own explanation of this measure (2009, p. 9), and also with the accompanying example, first used by McGraw and Wong, and revisited by Hattie in Visible learning:
Consider a difference in male (5’10”/177,5 cm) and female (5’4”/162,5 cm) average height. The effect size measured in Cohen’s d (for an explanation of this measure, I must refer to Hattie, or to the work of Jacob Cohen himself) is 2.0, considered very large, and the calculated CLE will be 92 per cent. The CLE percentage then says that, if you pick a random man and a random woman from the two respective populations, there is a 92 per cent chance that the man would be taller than the woman. Thus, in 92 out of 100 couples selected for blind dates, the man would be taller than the woman. (The “effect” in this case would be something like the effect of gender on average height.) Note that it does not say anything about how much taller – only the probability of a difference in heights in favor of one of the populations compared with the other, when you compare randomly selected samples from the two populations.
(And if we considered the bizzarro-world alternative that men and women had the exact same average height, what would the probability be of picking out a random sample from each group where the man was taller than the woman? Think for a second, and click “Les resten av dette innlegget -->” for the answer.)
(If men and women had the exact same average size, in a random selection from each group there would be a 50 per cent chance for the man to be the tallest – and a 50 per cent chance for the woman to be tallest (this very pedagogical instructional method is inspired by this excellent introduction to statistics).)
Thus, any CLE size has to be a percentage between 0 and 100. A positive effect would give a percentage between 51 and 100; a negative would give a percentage between 0 and 49 per cent; and no effect, i.e. no difference between populations, would give a CLE of 50 per cent.
However, when Hattie goes on to explain this measure further (2009, p. 9), he gives the following example: “Consider the [Cohen’s] d = 0.29 from introducing homework […]. The CLE is 21 per cent so that in 21 times out of 100, introducing homework into schools will make a positive difference, or 21 per cent of students will gain in achievement compared to those not having homework.”
This does not make sense, for several reasons: First, if introducing homework has an effect, albeit small, the CLE must be a number in the range between 51 and 100 per cent. A CLE of 21 per cent signifies a negative effect: if you have two groups, one given homework and the other not, and there is (only) a 21 per cent chance that a randomly chosen pupil from the group doing homework has a better outcome than a random pupil from the group not doing homework, homework clearly has a negative effect (and is probably better avoided altogether).
Second, if the effect of introducing homework in schools is positive (as Hattie already has stated, given the d=0.29), it will have that positive effect every time, i.e. 100 out of 100 times,1 – and not, as Hattie says, “if you take two classes, the one using homework will be more effective 21 out of a 100 times” (ibid.).
And third, Hattie draws a conclusion on a group level based on an individually measured effect size: Albeit the middle values of class (i.e. group) scores will be identical to the average of individual scores, the standard deviation of class scores will be much smaller, and thus the CLE will increase considerably (Topphol, 2011, p. 469–70) (If you are interested in the math behind this argument, please refer to the article – or send me an email, and I will give you the underlying calculations.)
Throughout the book, Hattie continues to use the CLE measure indicating effect sizes for all the factors he considers. Remarkably, Hattie’s CLEs ranges from -49 per cent to 219 per cent (cf. e.g. Hattie, 2009, p. 263ff), which, as seen from the above discussion, are well outside the range of possible measures of CLE. Moreover, the concrete ways in which Hattie uses the CLE brings about some strange reasoning. For example, on page 42, in a discussion on the effect of prior achievement, Hattie says that
the overall effect size of [Cohen’s d=]0.67 is among the highest effect sizes in this synthesis of meta-analysis, although the common language estimate [CLE] should remind us that, on average, prior achievement will lead to gains in achievement on 48 per cent of the occasions, although there is much that is unexplained beyond prior achievement (100-48=52 per cent that is unexplained) and so there is much that schools can influence beyond what the student brings from prior experiences and attainments.
Again, (i) a percentage of 48 would actually indicate a negative effect; (ii) any effect would be an effect always and not only in a selection of cases;1 and (iii) it makes no sense to deduct 48 from 100 if you operate with a scale from -49 to +219.
This is remarkable in its own right. To my knowledge, the other means Hattie employs to present effect sizes have not been questioned; but then again, for the averagely informed stakeholder in educational questions, the level of statistical knowledge needed to do that is rare.
However, and much more importantly – and especially considering that Visible learning has been the focus of immense attention the last couple of years, without any of this having been noticed – in my opinion, this opens up for at least two much more pressing conclusions beyond the question of use or misuse of statistics in Visible learning:
Most people supposed to consider and sometimes act upon statistic-based research have an insufficient knowledge of statistics; and most people do not actually sit down and read the literature they use and refer to (reading here meaning something beyond the superficial skimming of abstracts, summaries, tables, and conclusions). Lots of people write, but do we read? Not so much.   
As to the first point, the conclusion in Topphol’s article – and one which I fully support – is a call for more knowledge of statistical methodology among publishers and peer-reviewers; and a suggestion that researchers do not go beyond the limitations of their statistical competency when doing research and presenting their material. For my own account, being a student at a teacher-training program, and at the receiving end of much of the research and subsequent literature relevant to our object of study, I would add that quantitative methodology should be given a much more prominent place in at least our curriculum, and probably in similar curricula in general.
As to the second point, my humble suggestion is this: we should introduce a blanket ban on all educational research for the foreseeable future – that is, until we have actually read, understood, and if deemed necessary implemented into our professional practices and educational curricula, all the educational research and literature that have piled up so far.
And in the meantime, we could redirect the immense amount of freed-up resources to the things we all already know that we actually need and want: More supplementary education and training for teachers; better school buildings and school infrastructure; early intervention for struggling pupils; eradication of bullying; qualified teachers for all pupils; qualified substitutes, and a budget to get them; and so on (and you can add to this list at your own discretion. You know perfectly well what should be included).
Finally, a disclaimer of sorts: I am by absolutely no standards an accomplished statistician. If you see anything in this post that is wrong, please let me know, and I will adjust it accordingly.

1 Not considering the variance: in a sample with a large dispersion, some individual samples may of course come out with a negative effect.
Cited sources
Hattie, J. (2009) Visible learning: a synthesis of over 800 meta-analyses relating to achievement. London, Routledge.
McGraw, K. O. & Wong, S. P. (1992). A common language effect size statistic. Psychological Bulletin 111, 2. p. 361–5.
Topphol, A. K. (2011) Kan vi stole på statistikkbruken i utdanningsforskinga? Norsk pedagogisk tidsskrift, Vol. 6/2011. p. 460–71.

6 kommentarer:

  1. Hei

    Jeg er utrolig dårlig i Engelsk, og derfor spør jeg deg (selv om du kan ha svart på dette i teksten) : Hvis Hattie trekker konklusjoner i denne meta analysen, trekker han uriktige konklusjoner pga. feil bruk av statistiske verktøy?

    SvarSlett
  2. Vel, det er faktisk et meget godt spørsmål, egentlig det aller mest sentrale i denne saken, og ikke så lett å svare entydig på.

    På et mer overordnet plan bør man nok kunne slå fast at Hatties mest sentrale formål med ”Visible learning” nettopp handler om å trekke konklusjoner: Han ville samle resultater fra en enorm mengde forskning (til sammen er det ifølge Hattie snakk om over 50 000 enkeltstående studier, som igjen er sammenfattet i over 800 metastudier), og så finne et generelt og sammenlignbart mål på læringseffekt på hvert enkelt av et sett med faktorer, til sammen 138, som påvirker læring i større og mindre grad.

    Det effektmåltallet som er mest sentralt i Hatties fremstilling, er såkalt Cohens d. Så for hver enkelt faktor som påvirker læring, gir Hattie den en Cohens d-verdi, noe vi absolutt må kalle en konklusjon i hvert enkelt tilfelle. Disse Cohens d-verdiene er for eksempel samlet i en rangeringsliste bak i boken, og brukt gjennomgående i drøftingene. Konklusjonene på dette nivået har jeg ingen grunn (og absolutt ingen kompetanse) til å stille spørsmål ved, og jeg vet heller ikke om noen andre som har gjort det.

    Problemet er at han også blander inn et annet effektmål, CLE, hvis bruk Arne Kåre Topphol kritiserte i en artikkel publisert i Norsk pedagogisk tidsskrift før jul, og som jeg har skrevet om her. Jeg er nå selv ganske overbevist om at Hattie ikke har brukt dette riktig, særlig siden hans bruk ikke stemmer overens med hvordan han selv har introdusert og forklart det.

    Så her er det jeg mener er det springende punktet: Blir alle Hatties konklusjoner ok om man bare tar ut de feilaktig brukte CLE-verdiene fra boken?

    Rangeringslisten bakerst i boken, og lignende fremstillinger inne i boken, synes fremdeles å ha gyldighet. Men, som jeg viste i posten over med mitt eksempel fra side 42 i Hatties bok, så er det altså slik at i dette spesifikke tilfellet, hvor han tar utgangspunkt i en CLE-verdi av en effekt i en mer detaljert drøfting av implikasjonene av en faktor som påvirker læring, så blir konklusjonene hans helt merkelige, og for meg meningsløse.

    Han sier for eksempel at den faktoren han snakker om bare har effekt i 48 av 100 tilfeller (som må være feil, en effekt har alltid effekt), at man kan trekke 48 fra 100 (som er helt rart, når han opererer på en skala fra –47 til +219) og dermed få en andel av det som påvirker læring som kan påvirkes av noe annet enn den faktoren han snakker om (som ikke gir noen mening for meg), og noen andre rare ting.

    De samme misvisende konklusjonene gjelder også for den videre utgreiingen av hva CLE er på side 9, etter at han har forklart og eksemplifisert (korrekt, ifølge Topphol) hva det er; ellers har jeg ikke hatt tid til å finlese boken med akkurat dette for øyet, så hvor omfattende denne uheldige måten å resonnere på er, vet jeg ikke.

    Så oppsummert? På et overordnet plan er kanskje oversikten over effektstørrelser til å stole på – selv om dette egentlig er tilnærmet umulig å vurdere, i og med at vi ikke har sett alt det regnearbeidet som gikk forut for publiseringen: Vi må faktisk stole på at han har regnet riktig, og da er det jo ikke betryggende at han har bommet med ett av de statistiske verktøyene han har brukt. Men Hatties bruk av Cohens d er det altså ikke satt spørsmålstegn ved.

    Når det gjelder de mer detaljerte drøftingene inne i boken, er vel min konklusjon så langt at hver enkelt bør tas med en stor klype salt, og at man bør gå inn i den enkelte drøftingen og se hva han har brukt av statistiske verktøy, og hvordan han har brukt dem.

    Jeg er også med i fagutvalget på utdanningen jeg går på, vi sendte et brev til Hattie hvor vi spurte hva han tenkte om Topphols kritikk, og der svarte han blant annet at han ikke kom til å bruke CLE i den neste boken sin, som er en slags ”til læreren”-versjon av ”Visible learning”.

    Håper det ga litt svar!

    SvarSlett
  3. Takk, det var klargjørende og interessant. Har ikke tilgang til uio nettverket, men kan vel alltids ta en tur opp til UB å lese artikkelen til Topphol. Keep up the good work;-)

    SvarSlett
  4. Bare hyggelig, og lykke til med Topphols artikkel, den er en munnfull!

    SvarSlett
  5. Thanks for your post, very well written and explained. I always had doubts about Hattie's research, they seem to conflict with my own and many other's experience.
    To see he has made a major mistake on the CLE calculation and interpretation is informative. I've read other critiques and it seems he has made many more errors from misrepresenting studies to mixing x and y axis on graphs.

    SvarSlett
  6. Thanks for your post, very well written and explained. I always had doubts about Hattie's research, they seem to conflict with my own and many other's experience.
    To see he has made a major mistake on the CLE calculation and interpretation is informative. I've read other critiques and it seems he has made many more errors from misrepresenting studies to mixing x and y axis on graphs.

    SvarSlett