Free to share


Add to list

There is no folder

Standardized language tests: That score might not mean what you think it means

If grad school feels like a long time ago, come grab a refresher on what those test scores actually mean for your clients. 

October 20, 2020

This review was updated from the original version in October 2021. 


I know that you know that standardized language tests sometimes get it wrong. We’ve all had a kid who we were really concerned about that kept scoring in the normal range, or a kid who really seemed okay that randomly bombed a subtest or scored just below a cutoff.
So now what? Is the test wrong? Are you wrong? Where do we go from here?
With this Ask TISLP, we’ll show you the evidence to help know:

  • how to choose the best tests for kids on your caseload,
  • how to interpret standardized language test results, and
  • how to make realistic recommendations.

By knowing when standardized tests aren’t cutting it (and why), you’ll be able to do things like, say, explain to an administrator or a parent why the percentile ranks you’re reporting might not align with your recommendation for service eligibility. Because while we don’t want to be anti-standardized test, we do want to be pro-cautious interpretation of standardized test results.


First, why are published tests sometimes subpar?


It’s hard to believe that an expensive language test published by a reputable company might not do a good job of telling us about a child’s language skills. While there’s no one reason for this, here are a few possibilities:

  • We need more from standardized tests than they can reasonably give us. What we need in the time we have for completing an assessment is a standard score, a case for service eligibility, and a plan for treatment. No test can give us all of that in the time we’re able to spend administering and scoring it.
  • Language is a big, complicated thing. I struggle to explain to parents what language is, and I am a speech–language pathologist. And measuring language is arguably even harder. Our attempts to easily measure it in a small amount of time with quick-to-score tasks are just not going to work every time.
  • We keep paying for flawed tests. It’s not our fault, but publishers often can’t justify spending a ton of money fixing these issues. Because that would take a massive amount of extra research and work (we’re talking millions, here); and even then, the tests still wouldn’t be perfect. Especially if test quality isn’t driving sales, there’s not much incentive to address it.


But really… what tests should I buy?


You can’t answer this question without asking the follow-up question: for what? What exactly do you want to measure? Your answer will determine what aspects of a test are most important for your purposes. If you want to use the test to support a diagnosis—which is really the best use of a standardized test—pay particular attention to its diagnostic accuracy (see below).


Let’s back up and talk psychometric properties (we’ve discussed this before here). While no test is perfect, some do have better properties than others, so it’s worth doing the legwork up front to get a good one. A couple of throwback articles, one by McCauley and Swisher (1984) and one by Plante and Vance (1994), are good sources for research-based guidelines for the selection and evaluation of tests. To save some time, we made you a handy checklist that includes a simplified list of the key psychometric properties to look at and can serve as a quick and easy reference for you when you’re checking out a particular test. 


There are a few ways you can find out these stats before you buy, using published reviews (see this systematic reviewthis chart, Table 1 in this article, and reviews from the Leaders Project). Worst case scenario, you can always return the test​ (Elena Plante does!), but doing some research can save you that hassle. The bad news is that your only source for the psychometric properties of some assessments is the actual testing manual, which is not peer-reviewed. This is a good start, but ideally, we would want evidence from multiple, independent sources to verify reliability and validity. This is just one more reason why we should always use multiple methods of assessment for each of our clients.


Psychometric deep-dive: Diagnostic accuracy

Diagnostic accuracy, or how well a test identifies which children have a language disorder, is usually measured in sensitivity and specificity. We’ve talked about the importance of sensitivity and specificity before, in this review of Spaulding et al. 2006This paper also shows you how to approximate the values using group differences, if sensitivity and specificity aren’t available in the manual.
Pause for a personal story. When I was sitting in my master’s program class on assessment, sensitivity and specificity just didn’t click for me. I remember thinking, any published test is going to have good sensitivity and specificity, so I don’t need to understand this. Wrong, *sigh*. I didn’t understand it until I had to teach it to master’s students as a PhD student. That took me hours of preparing, and the feedback I got from the master’s students was that they still didn’t understand sensitivity and specificity, so the circle of confusion is now complete. 
Why is diagnostic accuracy so hard to understand? I don’t know. The numbers aren’t complicated—it’s addition and division and percentages, and we all know how to do that. But they can feel weird and theoretical and hard to grasp. If you want to see the calculation, I really like this video (it’s a medical example). Also know that many physicians don’t understand test accuracy either, which is comforting until you realize it’s terrifying.
Anyway, the take-home is this: The lower a test’s sensitivity, the more likely it is to tell us kids are fine when they really have a language disorder (!!); the lower a test’s specificity, the more likely it is to tell us that kids have a language disorder when they really don’t (also !!). Adequate sensitivity/specificity is usually defined as higher than .80 or 80%, but even at that good-enough 80% threshold, 20% of kids with and without language disorders are getting put into the wrong groups—and not all our tests even meet those criteria! 
While 80% sensitivity and specificity is the goal, we sometimes have valid reasons to use a test with lower diagnostic accuracy. A test might be the best or only option for a particular client, or we might accept lower accuracy for a really efficient screening measure. In these cases, we still want to see good sensitivity, even if specificity is a little lower. A test with low sensitivity is more likely to let kids who need our services fall through the cracks, which none of us want.
Also, one caution for understanding stats on diagnostic accuracy: percent accuracy is not a good measure. Take this example: we know that around 10% of kindergarten-age children have a language disorder. Say I made a new language test and gave it to 100 kindergarteners—10 with language disorders and 90 with typical language. My test tells me that all 100 kids have typical language skills. In other words, my test is useless. My test is also 90% accurate (eek!). To evaluate my test, we have to look at the all-important sensitivity—which is 0% (!!) because the test never correctly flagged kids with language disorders—and the specificity, which is 100% because it always ruled out language disorders in typical kids. Percent accuracy can hide a crummy test behind a misleading number.
Another important thing to know is that sensitivity and specificity will vary at different cut-points (again, the video demonstrates this really nicely). Test manuals should give the sensitivity and specificity at different standard score cutoffs. For example, on the CELF-5, the cutoff that maximizes both sensitivity and specificity is 1.3 standard deviations below the mean (AKA -1.3 SD, or a standard score of 80). Using a standard score cutoff of -1 SD or -2 SD or even -1.5 SD means we’re not getting the highest diagnostic accuracy for that test.
If you’re wondering how agencies or workplaces can recommend a single cutoff to be applied across multiple tests, yeah, that’s really not best practice. The cutoff depends on the test. Granted, workplaces differ on requirements for eligibility (e.g., evidence of need vs. evidence of significant educational impact) and those requirements might differ from the test’s criteria for identifying a disorder. Still, it’s better to determine whether a child meets local criteria by looking at the functional impact of their language skills, not by arbitrarily picking a test score that’s lower or higher. For the record, the same thing applies to using test scores to determine service minutes: test scores don’t necessarily tell us about the severity of a child’s language impairment, so stick with functional impact.

Psychometric deep-dive: Reliability and validity

Reliability is how consistent a test is in terms of things like items across the test, scores between different examiners, or results for the same child on different days. Reliability also includes the standard error of measurement, which is how the confidence interval around a score is calculated. For a test with lots of built-in error, the confidence interval will span a wide range of scores, which should knock down our confidence that we’re reporting the One True Score (when we’re really just reporting a snapshot from a particular day).
Validity is how well the test measures what it’s supposed to measure. Validity can be measured by comparing results to other tests, looking at whether the tasks make sense for the skills they’re supposed to assess, or considering other research or expert opinion on the type of tasks used.
Because there are so many types of reliability and validity, these can be a little harder to evaluate, but check out this study for a recent overview of the reliability and validity of language tests (and Table 3 for the criteria they used to evaluate them).

Psychometric deep-dive: Bias

We all know this, but it’s easy to forget. Standardized tests are standardized, so every child gets the same thing, whether it’s appropriate for them or not. And norm-referenced means that we’re comparing our client to a sample of children who may or may not be similar to them.  While we should always think critically about whether such comparisons are appropriate, we know that test norms are especially problematic basically any time you vary from an average white middle-class monolingual US-born-and-raised kid profile. For example, children who speak nonmainstream dialects are likely to score lower on language tests, but using scoring modifications isn’t necessarily a good solutionSocioeconomic status also potentially affects test results, for both children from both lower- and higher-income families. And test norms are particularly tricky for multilingual children because their experience and skills in each language can vary. For these groups, relying on test scores alone is especially dangerous.

Thinking beyond standardized assessments

For most of us, standardized tests are always going to be a huge part of our language assessments, and that makes total sense—we just need to pick good tests and interpret their results carefully. But standardized tests might not make sense for some kids. And for all kids, standardized tests do a poor job of measuring functional impact and pointing us toward goals. To be fair, though, as imperfect as they are, standardized tests do have evidence behind them! So when we’re using non-standardized assessments to support a diagnosis, we have to be careful to use something that also has good evidence supporting it. Below are some recommendations for evidence-based, non-standardized assessment of child language.
Parent/stakeholder interviews

This is an important step that’s easy to let slide. Talking to parents, teachers, or the client is essential to learning how communication looks outside the therapy room. A few interview resources: (1) The Leaders Project has a good list of questions geared toward multilingual families, (2) this article provides an assessment framework for school-age kids that’s built on observation and student and stakeholder interviews, (3) this study used a (free!) teacher questionnaire for screening kindergartners who speak nonmainstream dialects, and (4) these studies used a parent questionnaire (also free!) to help identify language impairment in multilingual children.
Dynamic assessment

Dynamic assessment can tap a child’s ability to respond to instruction, not just what they know right now. It also might give us an idea of what kind of support will help a child succeed (hello, classroom accommodations). While dynamic assessment is great across the board, it’s especially useful for children who have less English experience or who are less familiar with a typical testing setting because the supportive teaching helps level the playing field. The teaching part of dynamic assessment gives us a chance to support children’s success (or to identify areas where they’ll need more support than their peers).
What’s hard about dynamic assessment is that we often don’t have a lot of guidance for interpreting the results. Past advice has been “make up a dynamic task and see how they do!” That’s fine if you’re a researcher who studies dynamic assessment, but for clinicians who use that specific task with that age child maybe a few times a year, tops, it’s a little harder to interpret the results confidently. Fortunately, a lot of great research on dynamic assessment has come out recently, including tasks for assessing vocabularyAAC and syntax, and morphology. A few dynamic assessment tasks also have standard procedures and norms to go with them, which gives us more guidance when using their results for diagnosis (see the PEARL for early language and literacy screening and the CUBED and DYMOND for narrative skills). 
Language sample analysis (LSA)
LSA is an excellent way to capture children’s language skills. LSA is also incredibly time-consuming, and a lot of the time we can come away from it with nothing useful. When I was a school SLP, I did LSA with a very small chunk of my caseload, and that was only because: (1) I had undergrad research experience where I got comfortable with using LSA software, (2) I had lunch with a mentor and she guilted me into it, and (3) I took my work laptop home and did LSA on nights and weekends because I had poor work-life balance. So my example probably doesn’t help. I don’t know what the solution here is. But barring big, structural changes that free up our time, here are ideas to help make LSA better--because if you take the time to do it, you want to get the best information possible from all that work:

  • Move beyond conversation: Past the preschool years, conversation doesn’t necessarily tax a child’s language system enough. Think of it as a stress test—we want to see where the breakdown is when they’re at the limit of what they can do. Narrative is great for younger kids, and expository and persuasive tasks are good for older kids (see Figure 1 in this article for a handy chart). Another bonus is that we might not even have to transcribe these samples to score their overall quality with a rubric (like in this narrative study). Also, goals for narrative, persuasive, and expository quality are easy to tie to the curriculum in schools.
  • Don’t bother counting morphemes: Mean length of utterance (MLU) is a classic, and it’s fine for comparing young children to their peers. But MLU is a dud when it comes to setting goals and guiding therapy. Once we know that a child is combining words, MLU is really too general to tell us what specifically to work on. I prefer percent grammatical utterances (PGU). PGU involves only a binary yes/no decision whether each utterance is grammatical, and it has decent diagnostic accuracy. While PGU itself won’t tell us what to work on, it will give a list of all of the utterances with errors, which is a good place to start. (Note that MLU and PGU norms aren’t necessarily appropriate for children who speak nonmainstream dialects, though.)
  • Sample, transcribe, and analyze strategically: In an ideal world, we would collect language samples in multiple contexts, transcribe them, get a written sample, and complete detailed coding and analysis of all of it, all while tiny woodland creatures wrote our progress notes. Obviously, that’s not happening. Instead, we can look at anything we already know about a child—from observation, other testing, parent/teacher report, previous progress, etc.—and choose sample contexts and analyses based on what we already suspect is a problem.


Final thoughts


Don’t go out and burn your test kits. (If you knew me, you’d know that I’d insist you recycle them, but don’t do that either.) Also, don’t test-shame your peers; without knowing the reason why they’re choosing a certain test for a certain child (or no test), it’s nearly impossible to judge.
But do make sure your clients get a fair, comprehensive language assessment, and recognize that standardized tests might not be giving us that every time. Adding better options like dynamic assessment and language samples at least some of the time is a huge step in the right direction.

Melissa Brydon, PhD, CCC-SLP, contributed to this review and created the downloadable resource on psychometric properties.


You aren't currently signed up for CE credit. Want to learn more? Read here. Want to add CE credit to your account? Upgrade here

Mollee Sultani, ABD, CCC-SLP

Mollee Sultani, ABD, CCC-SLP

Mollee Sultani is a writer for The Informed SLP. She is a doctoral student at the University of Kansas with clinical experience working in elementary schools. Her research goals include helping speech–language pathologists quickly and accurately assess children’s language skills.
Learn more

Retrieved from on 12/05/2022. The unauthorized copying, sharing or distribution of this copyrighted material is strictly prohibited.

Why do you need my billing address?

Our records indicate your billing address is not on file, and we need this for tax purposes. We will not use your address for anything other than tax record keeping. Please email if you have any questions.