Many researchers have reported that, after about age 20, scores on many subtests on intelligence tests decline, while scores on other subtests rise or stay the same after that age, and do not decline until much later. Subtests that fit the former pattern are typically measures of fluid intelligence, that is, the dimension of intelligence that reflects problem solving ability without the application of outside information, while those subtests with the latter trajectory are generally measures of crystalized intelligence, which is the ability to solve problems that require outside information Here are some pertinent quotes from the literature:
[F]or each variable [i.e., subtest] there is an increase until about the late teens or early 20s, followed either by a decrease for the process variables [similar to fluid intelligence], or by a period of stability and then a decline at around age 50 or 60 for product variables [similar to crystalized intelligence]. (Salthouse, 2010, p. 27)
Scores for fluid intelligence (e.g., short-term memory) peak early in adulthood, whereas measures of crystalized intelligence (e.g., vocabulary) peak in middle age… (Hartshorne & Germine, 2015, p. 433)
The growth of gf [i.e., fluid intelligence] is steady but relatively rapid and reaches a maximum in the late teens or early twenties, after which it shows a gradual decline…Scores highly loaded in gc [i.e., crystalized intelligence] on the other hand, show a more gradual increase from infancy to maturity, and…decline does not set in until relatively late in life, usually not until about 60 to 70 years of age. (Jensen, 1980, p. 235).
The truth, however, is that intelligence actually grows into at least middle adulthood (45-60), irrespective of whether it is measured using subtests categorized as being related to fluid or crystalized intelligence.
Issues with Cross-Sectional Studies
All of the research summaries cited above relied entirely on cross sectional studies for their conclusions—IQ was assessed at the same time for people of various ages, and the age-related pattern was examined. The problem with this approach is that age at a given time is perfectly correlated with year of birth, which itself is correlated negatively with IQ among samples of the same age measured at different times. This is known as the Flynn effect.
The bias resulting from the Flynn effect in cross sectional research can perhaps be best illustrated using this graph (Raven, 2008):
It shows the results of two standardization samples, one from “circa 1942,” and the other from 1992, for the Standard Progressive Matrices (SPM). Those born in 1887, and therefore aged approximately 65 at the time of testing, had a median score of about 24, compared to those born in 1927 and aged 65 at the time of testing at the latter date, who obtained a mean score of approximately 48! If the youngest participants included in this figure are examined, that is, twenty year-olds born in either 1922 or 1972, the difference in median scores is about 8 points, or about one standard deviation.1 Furthermore, comparing those of the same birth year shows surprisingly little change over time. As the note says, those born in 1922 obtained approximately the same score at age 20 when tested in the first study as participants in the second study who were also born in 1922, but were then aged 70—the difference is less than one point. This means that, when year of birth is the same, a seventy year old performs equally well on this test as a twenty year old, despite the fact that the SPM is considered to be an almost pure measure of fluid intelligence.
Merrill Hiscock (2007), using essentially the same data as presented above, calculated that the Flynn effect caused an increase of 0.36 points a year on the SPM, while the cross sectional difference by age is 0.51 points per year; this means that 71% of the age difference between the oldest and youngest group (25 and 65 years-old) can be explained by bias caused by the Flynn effect. His estimate suggests that the real decline between age 25 and age 65 is only about 6 points rather than 20 before the adjustment; that is, less than one standard deviation, rather than about 2.5 SDs.2
Later, Dickinson and Hiscock (2010) attempted to recreate this analysis with the WAIS tests. They did this for the normalization samples of three editions of the test—the WAIS-R, WAIS-III, and WAIS-IV. The bias resulting from the Flynn effect was calculated as follows:
For the WAIS-R, the IQ difference between those who were administered both the original WAIS and the WAIS-R (n=78) was used to calculate the rise in IQ between the normalizations of these two tests (~24.5 years), and used that rate to estimate a rise over 50 years. Because an IQ of 100 corresponds to the average of the normalization sample, the extent to which the average scores of the same sample is lower on the more recently normed test is a good estimate of the average rise in IQ during the intervening years.
For the WAIS-III, the same was done using a sample of those who took both the WAIS-III and WAIS-R (n=192). The norms for these two tests were made about seventeen years apart.
Lastly, for the WAIS-IV, the prior estimates of the WAIS-R and WAIS-III were used.
These researchers then reduced the difference between 20 and 70 year-olds on each test by the amount that the Flynn effect was estimated to bias the difference upward. They found that, in the WAIS-R, the 70 year-old sample had an IQ only 2.5 points lower than the 20 year-old sample after the adjustment, compared to 16.5 points without adjustment. The same difference of 2.5 points was found for the WAIS-III, while, in the WAIS-IV, the Flynn effect completely explained the difference. When verbal subtests (more indicative of crystalized intelligence) and performance subtests (more indicative of fluid intelligence) are separated, an interesting pattern emerges. In the WAIS-R, verbal scores increased by 7 points while performance scores decreased by 13.5 points between ages 20 and 70 (after adjusting for the Flynn effect). These numbers were +4.5 and -11 for the WAIS-III, and +8 and -9.5 for the WAIS-IV. This suggests that crystalized intelligence increases very slowly between ages 20 and 70, during which time fluid intelligence very slowly decreases. These results are very different than the typical estimates from cross sectional data that crystalized IQ doesn’t change between about ages 20 and 50-60, and then declines (see above), or that fluid IQ decreases by about 1.5 to 2 standard deviations, or about 22-30 points, “from the 20s to the 70s” (Salthouse, 2010, p. 17).
While the above studies tell us quite a bit, they do not say when cognitive ability begins to decline, and also suffer from some methodological problems, such as, for example, the likely false assumption that the Flynn effect had the same effect during the latter half of each period as the former half (since fifty years of the Flynn effect were estimated either from just the last 17 or 24 years of that period for the WAIS-R and WAIS-III). Luckily, there is a different type of study, which allows us to get more precise data, and also does not require estimation of the Flynn effect.
Longitudinal Studies
Longitudinal studies of intelligence are ones in which the same sample is tested multiple times on the same test, in order to estimate how IQ changes over time without needing to worry about cohort level effects. The largest effort in this regard has been made by K. Warner Schaie, whose famous Seattle Longitudinal Study is made up of thousands of participants, some of whom have been tested with the Thurtstone Primary Mental Abilities test in multiple cycles since 1956, with a new cohort of participants added every seven years. In total, there have been six cohorts that have been tested after 7 years, five that have been tested after fourteen years, and so on. The original participants (i.e., members of the first study and who have not yet dropped out) have been tested 42 years apart. All of this data has been summarized in Schaie’s (2005) book on the topic. Here, I can only quote from Schaie, as detailing all of the follow ups for each cohort would take up too much room (all quotes from chapter 5).
7 year follow ups:
One can observe immediately that statistically significant cumulative age decrements from any previous age do not occur for any variable [i.e., subtest] prior to age 60. Several variables were found to have modest increments in young adulthood and middle age.
14 year follow ups:
When age changes are examined over 14-year segments, such change becomes statistically significant for Number as early as age 53, for Word Fluency at age 60, and for the remaining three abilities at age 67.
21 year follow ups:
[M]odest but significant decrements are noted for Number and Index of Intellectual Ability by age 60 and for remaining variables by age 67. Cumulative decrements estimated from the three samples that cover the entire age range from 25 to 88 years amount to 0.5 SD for Verbal Meaning, 0.7 SD for Spatial Orientation and Inductive Reasoning, 1.0 SD for Number, and 0.9 SD for Word Fluency.
28 year follow ups:
[S]ignificant decrements over the 28-year segments are first observed for Number and the Index of Intellectual Ability by age 60; for Inductive Reasoning, Word Fluency, and the Index of Educational Aptitude by age 67; and for Verbal Meaning and Spatial Orientation by age 74.
35 year follow ups:
In these data, significant decline is only observed for Number, Word Fluency, and the Index of Intellectual Ability only by age 67 and for the remaining variables by age 74.
42 year follow ups:
Findings are quite similar to those for individuals followed for 28 years [but the sample size is only 37].
All of these data converge on the fact that there is no decline until at least about age 60, but that the decline after around age 70 is very steep, so that by the late eighties IQ has declined by something like 10 points relative to age 25. Though this is by far the biggest longitudinal study in terms of its sample size and its number of waves, these results should be supplemented using other data. Below is a table summarizing several of the studies I was able to find, with change expressed in + or - SDs.3 (note: the table below may not load properly on the Substack app.)
Every study that includes a follow up before or at age 50 agrees that full scale IQ rose up to that age by quite a bit. Studies examining change in overall IQ after age 50 agree that it declines very slowly or does not change at all. For example, Owens (1966) found a tiny, statistically insignificant decline between ages 50 and 61, Mortensen and Kleven (1993) found a decline of a tenth of a standard deviation between age 50 and 60, and a fifth of one between 60 and 70. The two studies examining decline after age 70 found it to be pretty steep: a quarter of a standard deviation between 75 and 79, and a third of a standard deviation between 70 and 82. To summarize, this data would suggest that even by age 60, and perhaps even by age 70, the average person is more intelligent than they were at age 20, as IQ increases until age fifty but only begins to decline quickly after about age 60, and only very quickly after age 70. It should perhaps be emphasized that the two studies with the longest follow up periods found that IQ increased by approximately one standard deviation between ages 11 and 77!
What about fluid versus crystalized intelligence? The image below (Salthouse, 2010) shows the change per year by subtest in longitudinal studies, based on data from McArdle and colleagues (2002), which study I didn’t include in the summary above because the number of follow up tests, and how much time passed between each administration, was different for each participant.
In the longitudinal data (the black circles and triangles), scores on all woodcock Johnson (W-J) subtests go up as one ages in the under-50 data, and all but one (long-term retrieval) go up in the above-50 data. This means that, in general, one’s scores in all types of intelligence, whether fluid or crystal, go up as he ages until at least around 50, and, for the most part, even after age 50, though it is unclear for how long. In fact, the highest amount of growth occurs in the fluid reasoning and visual processing subtests of the W-J, which are measures of fluid intelligence. Similarly, the Seattle Longitudinal Study data reviewed earlier show that there is essentially no decline on any subtest before around age 60, irrespective of which cognitive dimension it most measured. A recent study by Tucker-Drob and his colleagues (2022) found that individual level decline in crystalized and fluid IQ was highly correlated, suggesting that the two variables shouldn’t be contrasted in aging research, though they did find slightly more decline in fluid IQ between ages 50 and 60 (0.09 SDs instead of 0.04 SDs). To summarize, it would be true to say that intelligence increases at least until 50, and that one’s IQ at age 70 is, on average, probably not much lower than it was at age 20, and perhaps even higher; it would also be true to say that both crystalized intelligence and fluid intelligence show this pattern, and that the distinction between them has been overemphasized in the study of cognitive aging. Most likely, the reason for this overemphasis has been the focus on cross sectional studies, as they are biased by the Flynn effect, which also happens to be much stronger in tests of fluid intelligence (Pietschnig & Voracek, 2015).
(Non-)Issues with Longitudinal Studies
Unfortunately, even longitudinal studies do not provide unbiased estimates. The most commonly cited flaw is that, in longitudinal studies, participants take the same test multiple times, which biases scores upward at the follow up waves, due to the practice effect (e.g., Salthouse, 2010, pp. 42-43). The issue of practice effects is difficult to resolve, as, by definition, participants in a longitudinal study must be tested at least twice, and the testing must be conducted using the same test throughout, as otherwise comparison is essentially impossible.4 I am aware of only one study which compared those of the same birth cohort who did not participate in initial testing, but did in the follow up, such that it is possible to see if those who had never taken the test scored more or less poorly, without needing to worry about the Flynn effect. This was done by Larsen and colleagues (2008), who found no evidence for a practice effect (i.e those who had taken the test at initial testing had the same scores as those who didn’t). However, this result may actually be illusory; they noted,
Unfortunately we could not definitively rule out the possibility that the subjects who appeared to have been tested only once…in fact also took the tests at the time of induction [i.e., at initial testing], but that the results somehow got lost. The archival nature of the data did not allow us to dig further into this possibility. (p. 33)
Another way to estimate the bias of the practice effect is to use a quasi longitudinal design, in which multiple cross sectional studies are conducted, with a significant amount of time separating each one (e.g., five years). This data is then used to create an estimate of IQ at a given age independent of cohort (as, given the fact that the cross sectional studies are not all from the same year, age will not be completely dependent on birth year). This design is similar to the studies cited above in which the bias caused by the Flynn effect was calculated, and used to estimate the true IQ gap between young adults and those in later middle age. Timothy Salthouse (2019) estimated longitudinal, cross sectional, and quasi longitudinal changes due to age in a large sample collected over a large span of time. Participants in the longitudinal studies differed in terms of the time between first and final testing (at the third wave), but this time was, on average, six years. Therefore, the quasi longitudinal data was also based on individuals born in the same year tested six years apart. The data is displayed in the following graph:
Here, I will focus on Reasoning and Vocabulary, because these were measured using tests most similar to traditional IQ tests.5 It can be seen that, while the longitudinal data shows an increase in both measures until the 60s, this is not the case for the quasi longitudinal data, where Reasoning starts declining in the twenties and Vocabulary scores begin to decrease in the forties. More precise data is provided by Salthouse by age decade (Table 2). He estimates that the annual decline varies from between 0.003 and 0.025 SDs between the twenties and eighties, but with some weird results—the lowest decline is in the eighties (0.003 SDs per year) and the second highest is in the twenties (0.019 SDs per year). Still, the overall pattern matches expectations. Between ages 25 and 45, the decline was, on average, 0.012 SDs per year, while the same figure was 0.025 between ages 65 and 85. For Vocabulary, there is no significant decline (using an alpha of 0.01) for any age decade, but the majority of the signs are negative. For ages 25-45, there is on average a decline of 0.003 SDs per year, and for ages 65-85, the decline is 0.012 SDs per year. Neither of those numbers are significant.
Salthouse’s data show a slow decrease in Reasoning ability until middle age, and one roughly twice as quick in late adulthood. In total, the decline between age 25 and 45 would be equal to approximately 0.24 standard deviations, or 3.6 IQ points; between ages 65 and 85, the IQ loss would be 7.5 points. The Vocabulary data show no statistically significant decline, but, if they are nonetheless taken at face value, the decline would be close to zero between ages 25 and 45, and then 3.6 points between ages 65 and 85. No aggregate data is given for ages 45 to 65, but, probably, these data suggest a loss of around one standard deviation in reasoning ability between age 25 and 85, but little loss in Vocabulary. In terms of fluid and crystalized intelligence, this means a loss of 1 SD of fluid IQ but at most perhaps a fourth of a standard deviation in crystalized IQ between young adulthood and old age. The decline in fluid IQ is somewhat similar to the estimated decline from the SPM (a little under one SD) or from the WAIS tests (9.5 to 13.5 points) between ages 20 and 70. However, the slight loss, or perhaps no change, in crystalized IQ is very different from the WAIS data, which show a sizeable increase (4.5-8 points). Finally, it is worth comparing the quasi longitudinal results to the cross sectional ones. For Reasoning, the cross sectional decline was twice as high in both age groups (i.e., 25-45 and 65-85), while the numbers were almost the same for Vocabulary, though comparison in this case is less useful due to a lack of statistical significance.
Schaie (2005) in his Seattle Longitudinal Study also conducted a quasi longitudinal study of change in IQ. Just like Salthouse, he tried to eliminate practice effects by comparing those born in the same year but tested at different times; recall that, in this study, a new cohort was introduced every seven years, so that he has six different cross sectional datasets collected over 42 years to work with. However, unlike Salthouse, Schaie adjusted the longitudinal data for attrition. After this adjustment was made, none of the practice effects were significant (with an alpha of 0.05, one tailed) “except for Verbal Meaning in Sample 1 from T₁ to T₂ and from T₆ to T₇” (p. 201). Therefore, practice effects appear to have been minimal, and so there is little reason to worry about them.
Another way to check whether or not the practice effect had a major effect is to see if results from different waves pass a test of measurement invariance. (For a description of what that means, see this endnote.)6 This is because practice effects are perfectly negatively correlated with g loading (Nijenhuis et al., 2007); so, if a small decline, or no change, is replaced with an increase due to the practice effect, then each cognitive variable’s g loading must be decreased or increased to the extent to which it is more or less related to g, respectively, which would lead to a violation of metric invariance. This is intuitive, but should, nonetheless probably be proved mathematically, because I have never seen it directly stated. However, to quote from Keynes, “[t]hose who (rightly) dislike algebra will lose little by omitting…[this] section” (1936/1964, p. 280, fn. 1)—for most people, this should be intuitive.
The traditional factor model equation looks like this:
Where Xi(t) refers to the observed score on subtest i at testing wave t and g(t) refers to latent general intelligence at wave t, while λi refers to the factor loading of subtest i on g, and the residual is represented by ϵi(t). For waves after the first, there will be a practice effect, which is perfectly negatively correlated with subtest factor loading, so that
And
Where Pi refers to the increase caused by the practice effect, and t+1 refers to the wave that succeeds the initial testing. k is a scalar denoting the magnitude of the practice effect between waves t and t+1. The second equation makes the practice effect on subtest i negatively proportional to the average magnitude of the practice effect on subtests generally (k) and to the subtest i’s factor loading (λi), so that the increase in a subtest’s observed score due to the practice effect is perfectly negatively related to its factor loading; all subtests’ observed scores would be increased by the practice effect, but more g loaded subtests would experience less gain, and vice versa. Substituting Pi for -kλi, we get
This way of expressing the equation simply emphasizes that the practice effect shifts the observed score on a given subtest in the positive direction to an extent determined by that subtest’s factor loading. This can then be transformed in terms of λi:
This clearly shows that metric invariance will be violated, as a given subtest’s factor loading is changed by k, the average magnitude of the practice effect on all subtests generally.7 By definition, any change in subtest loadings between waves, irrespective of its cause, must make measurement invariance untenable. If the practice effect is only of a very small magnitude, this might not be a serious issue, but those arguing against longitudinal studies claim that the sizeable increase in IQ between young adulthood and age 50 or perhaps even 60 is, in reality, a sizeable decrease masked by a practice effect. An effect of such a magnitude would have to be huge (see below), and would certainly be identifiable in a test of measurement invariance.
Here was Schaie’s (2005)’s conclusion from his examination of the measurement invariance issue in his data: “invariance within groups across time can be accepted” (p. 212), or, in other words, measurement invariance held for comparisons of the same cohort across different waves. Specifically, keeping the factor loadings the same for all waves “resulted in a slight but statistically nonsignificant reduction in fit” (p. 212). While tests of the same cohort across time were measurement invariant, Schaie found that this was not true when testing different cohorts at the same or at different times (pp. 212-15). This is not surprising given that the Flynn effect is moderately negatively correlated with g loading, at approximately -.38 (Nijenhuis & Flier, 2013)8, and because measurement invariance has been found to not hold between representative samples at different times (Wicherts et al., 2004). This means that cross sectional studies should probably be dismissed out of hand.
Before concluding this section, it should be noted that the longitudinal data, at face value, simply do not look like an artifact of a practice effect. First, the effect would have to be much larger than it is typically estimated to be. Empirical studies that give the same subject the same test a few days, weeks, or perhaps months apart usually find a gain of around one third of an SD (Jensen, 1998, pp. 314-315). This is much lower than what would be needed if, for example, fluid intelligence really declines 0.24 SDs between ages 25 and 45 (as Salthouse’s quasi longitudinal study indicated) rather than increasing by about half a standard deviation (as is observed in longitudinal studies), the practice effect on fluid IQ would have to be equal to around three fourths of a standard deviation, or more than twice as large as it actually is. Note that the 1/3 SD gain is based off of studies where tests are given in short succession, whereas in the longitudinal designs they are typically given decades apart. The only meta analysis I am aware of found a strong negative relationship between the time between tests and the magnitude of the practice effect (hausknecht et al., 2007). Furthermore, initial IQ is not related to the rate of decline (e.g., Deary et al., 1998), while practice effects tend to be larger for those with higher scores at first testing (Jensen, 1980, p. 590).
Conclusion
While longitudinal studies are obviously not perfect, they are much better than the alternatives, namely, cross sectional and quasi longitudinal methods. While there might be slight practice effects, the pattern that longitudinal studies show, namely, increase in both fluid and crystalized intelligence until age 50 or 60, and only start to very quickly decline after age 70, so that the average 70 year old is likely significantly more intelligent than the average 20 year old, and, based on the two studies with the longest times between initial testing and follow up, those around the age of 80 are about one standard deviation more intelligent than they were at age 11.
It is also worthwhile to summarize conclusions from inferior methods. Studies based on test norms that adjusted for the Flynn effect show approximately a decline of 0.67-0.8 SDs in fluid IQ and an increase of about 0.33-0.5 SDs in crystalized IQ between ages ~20 and ~70. Quasi longitudinal data suggest essentially no decline in crystalized intelligence between ages 25 and 85 (at most, about 0.25 SDs), but a decline of approximately one SD in fluid intelligence. And, finally, cross sectional studies suggest no decline in crystalized IQ but a decline of about 2-3 SDs in fluid IQ between one’s 20s and 70s. In the table below, I summarize the results from each method and assign each one a Rainbow Six Siege rank. Longitudinal studies are, by far, the best, while studies using cross sectional data from test norms adjusted for the Flynn effect and quasi longitudinal studies are equally good, as they are essentially the same method, with the former prioritizing minimizing sampling bias and also being conducted with much better tests (SPM and WAIS), and the latter prioritizing accuracy in adjusting for cohort effects. The worst method, by far, was the cross sectional study, which should never be the basis of any conclusion about aging and cognitive ability.
References
Bradway, K. & Thompson, C. (1962). Intelligence at Adulthood: A Twenty-Five Year Follow-Up. Journal of Educational Psychology, 53(1), 1-14.
Corley, J., Conte, F., Harris, S., Taylor, A., Redmond, P., Russ, C., Deary, I., & Cox, S. (2022). Predictors of Longitudinal Cognitive Ageing from Age 70 to 82 Including APOE e4 Status, Early-Life and Lifestyle Factors: The Lothian Birth Cohort 1936. Molecular Psychiatry, 28, 1256-1271.
Dam, F. & Raven, J. (2008). Does the “Flynn Effect” Invalidate the Interpretation Places on Most of the Data Previously Believed to Show a Decline in Intellectual Abilities with Age?, in J. Raven & J. Raven (Eds.) Uses and Abuses of Intelligence: Studies Advancing Spearman and Raven’s Quest for Non-Arbitrary Metrics (pp. 258-287). Royal Fireworks Press.
Deary, I., McLennan, W., & Starr, J. (1998). Is Age Kinder to the Initially More Able?: Differential Ageing of a Verbal Ability in the Healthy Old People in Edinburgh Study. Intelligence, 26(4), 357-375.
Deary, I., Whalley, L., Lemmon, H., Crawford, J., & Starr, J. (2000). The Stability of Individual Differences in Mental Ability from Childhood to Old Age: Follow-up of the 1932 Scottish Mental Survey. Intelligence, 28(1), 49-55.
Deary, I., Whiteman, M., Starr, J., Whalley, L., & Fox, H. (2004). The Impact of Childhood Intelligence on Later Life: Following Up the Scottish Mental Surveys of 1932 and 1947. Journal of Personality and Social Psychology, 86(1), 130-147.
Dickinson, M. & Hiscock, M. (2010). Age-related IQ decline is reduced markedly after adjustment for the Flynn effect. Journal of Clinical and Experimental Neuropsychology, 32(8), 865-870.
Hartshorne, J. & Germine, L. (2015). When Does Cognitive Ability Peak? The Asynchronous Rise and Fall of Different Cognitive Abilities Across the Life Span. Psychological Science, 26(4), 433-443.
Hausknecth, J., Halpert, J., Paolo, N., & Gerrard, M. (2007). Retesting in Selection: A Meta-Analysis of Coaching and Practice Effects for Tests of Cognitive Ability. Journal of Applied Psychology, 92(2), 373-385.
Hiscock, M. (2007). The Flynn effect and its relevance to neuropsychology. Journal of Clinical and Experimental Neuropsychology, 29(5), 514-529.
Hu, Meng. (2025). Spearman's g Explains Black-White but not Sex Differences in Cognitive Abilities in the Project Talent. OpenPsych, https://doi.org/10.26775/op.2025.07.18.
Jensen, A. (1980). Bias in Mental Testing. The Free Press.
Jensen, A. (1998). The g Factor: The Science of Mental Ability. Praeger Publishers.
Keynes, J. (1936/64). The General Theory of Employment, Interest, and Money. Harper Business.
Larsen, L., Hartmann, P., & Nyborg, H. (2008). The Stability of General Intelligence from Early Adulthood to Middle-Age. Intelligence, 36, 29-34.
Mortensen, E. & Kleven, M. (1993). A WAIS Longitudinal Study of Cognitive Development During the Life Span from Ages 50 to 70. Developmental Neuropsychology, 9(2), 115-130.
McCardle, J., Ferrer-Caja, J., Hamagami, F., Woodcock, R. (2002). Comparative longitudinal structural analyses of the growth and decline of multiple intellectual abilities over the life span. Developmental Psychology, 38(1), 115-142.
Nijenhuis, J. & Flier, H. (2013). Is the Flynn Effect on g?: A Meta Analysis. Intelligence, 41(6), 802-807.
Owens, W. (1953). Age and Mental Abilities: A Longitudinal Study. Genetic Psychology Monographs, 48, 3-54.
Owens, W. (1966). Age and Mental Abilities: A Second Adult Follow-Up. Journal of Educational Psychology, 57(6), 311-325.
Raven, J. (2008). The Raven Progressive Matrices Tests: Their Theoretical Basis and Measurement Model, in J. Raven and J. Raven (Eds.) Uses and Abuses of Intelligence: Studies Advancing Spearman and Raven’s Quest for Non-Arbitrary Metrics (pp. 17-68). Royal Fireworks press.
Salthouse, T. (2010). Major Issues in Cognitive Aging. Oxford University Press.
Salthouse, T. (2019). Trajectories of normal cognitive aging. Psychological Aging, 34(1), 17-24.
Schwartzman, A., Pushkar, D., Andres, D., Arbuckle, T., & Chaikelson, J. (1987). Stability of Intelligence: A 40-Year Follow Up. Canadian Journal of Psychology, 41(2), 244-256.
Tucker-Drob, E., Fuente, J., Köhncke, Y., Brandmaier, A., Nyberg, L., & Lindenberger, U. (2022). A strong dependency between changes in fluid and crystallized abilities in human cognitive aging. Science Advances, 8(5), eabj2422.
Wicherts, J., Dolan, C., Hessen, D., Oosterveld, P., Baal, G., Boomsba, D., & Span, M. (2004). Are intelligence tests measurement invariant over time? Investigating the nature of the Flynn effect. Intelligence, 32(5), 509-537
Wicherts, J. (2007). Group Differences in Intelligence Test Performance.
Woodley, M., Nijenhuis, J., Must, O., Must, A. (2014). Controlling for increased guessing enhances the independence of the Flynn effect from g: The return of the Brand effect. Intelligence, 43, 27-34.
It is difficult to find information on the standard deviation of the Standard Progressive Matrices test, as scores are typically presented in terms of percentiles and not z scores. However the few standardization studies that include the SD among the summary statistics typically converge on a range of 7 to 10.
See endnote 1 above for the fact that the SD for the SPM is not generally known.
Whenever possible, the SDs from the initial waves were used.
This is because different tests are normed in different years, and so their scale scores are not strictly comparable due to the Flynn effect; they are also typically normed on slightly different populations, and differ in content. Sometimes alternate forms of the same test are produced, which are very similar to each other, but this is not true for most tests produced in the last half century.
Reasoning was indexed with the Raven’s Progressive Matrices, Letter Sets (Basically just the Raven’s Matrices tests except the patterns are made with letters instead of shapes), and the Shipley Abstraction test (somewhat like the Similarities subtest of the WAIS). Vocabulary was measured using the WAIS-III vocabulary subtest, the W-J Picture Vocabulary subtest, and a vocabulary test developed by Salthouse. These two measures had alphas of 0.85 and 0.91, respectively (the former is tied for second highest and the latter is the highest out of the four measures).
In simple terms, measurement invariance means that the instrument you are using to estimate some trait which cannot be directly observed works in the same way in both groups being compared. In this case, the instruments are IQ tests, and the trait of interest is the latent trait g, or the general factor of intelligence. The most common way to test measurement invariance is to see if both groups’ results show that subtests show the same correlations with general intelligence (that is, a given subtest has the same correlation for both groups). If the test is working in the same way in both groups, then the pattern of item correlations with g should be the same. The extent to which this differs between groups is typically measured indirectly, by using confirmatory factor analysis, that is, seeing how well the data fits a certain hierarchical pattern, and then seeing how much the fit of that model diminishes when subtest correlations with g, that is, their factor loadings, are constrained so that they cannot vary between the groups being compared. If the fit is significantly diminished, that means that factor loadings differ between groups, and therefore the test is, to some extent, not measuring g in the same way in both. This is only one type of measurement invariance test, and it is the only one I will describe because it is the one used by Schaie. As far as I know, the best introduction to measurement invariance testing is Jelte Wicherts’ thesis titled Group Differences in Intelligence Test Performance (2007). Meng Hu (who was nice enough to review some portions of this post) has also written a somewhat beginner friendly explanation in his study of race and sex differences in Project Talent data, which is available on his Substack and on OpenPsych (Hu, 2025). Also see Arthur Jensen’s Bias in Mental Testing (1980) for a description of many different methods of estimating measurement invariance that do not involve factor analysis.
It should be emphasized that this equation, while true, is not useful. Its purpose is only to demonstrate most clearly that factor loading would be partially a function of the magnitude of the practice effect.
This correlation is close to 1 when adjusted for the fact that improvement in guessing has reduced observed g loading, due to guessing being more common and more beneficial on harder questions, which themselves are more g loaded (Woodley et al., 2014). However, only the observed correlation matters for measurement invariance, unless for whatever reason subtest scores are adjusted for guessing before conducting confirmatory factor analysis.