The Informed SLP: Speech Language Pathology Research
  • Home
  • What We Do
    • What is The Informed SLP?
    • Team
  • Members Login
    • Birth to Three
    • Preschool & School-Age
    • Adults
  • Blog
  • Join

Standardized language tests: That score might not mean what you think it means

10/20/2020

2 Comments

 

​by Mollee Sultani, MA, CCC-SLP
TISLP Staff
I know that you know that standardized language tests sometimes get it wrong. We’ve all had a kid who we were really concerned about that kept scoring in the normal range, or a kid who really seemed okay that randomly bombed a subtest or scored just below a cutoff.
 
So now what? Is the test wrong? Are you wrong? Where do we go from there?
 
With this blog post, we’ll show you the evidence to help know:
  • how to choose the best tests for kids on your caseload,
  • how to interpret standardized language test results, and
  • how to make realistic recommendations
 
By knowing when standardized tests aren’t cutting it (and why), you’ll be able to do things like, say, explain to an administrator or a parent why the percentile rank you’re reporting might not align with your recommendation for service eligibility. Because while we don’t want to be anti-standardized test, we do want to be pro-cautious interpretation of standardized test results.
Picture

First, why are published tests sometimes subpar?

It’s hard to believe that an expensive language test published by a reputable company might not do a good job of telling us about a child’s language skills. While there’s not one reason for this, here are a few possibilities:

  • We need more from standardized tests than they can reasonably give us. What we need in the time we have to complete an assessment is a standard score, a case for service eligibility, and a plan for treatment. No test can give us all of that in the time we’re able to spend administering and scoring it.
  • Language is a big, complicated thing. I struggle to explain to parents what language is, and I am a speech–language pathologist. And measuring language is arguably even harder. Our attempts to easily measure it in a small amount of time with quick-to-score tasks are just not going to work every time.
  • We keep paying for flawed tests. It’s not our fault, but publishers often can’t justify spending a ton of money fixing these issues. Because that would take a massive amount of extra research and work (we’re talking millions, here); and even then the tests still wouldn’t be perfect. Especially if test quality isn’t driving sales, there’s not much incentive to address it.
​

How to evaluate standardized language tests

​Diagnostic accuracy
 
Diagnostic accuracy, or how well a test identifies which children have a language disorder, is usually measured in sensitivity and specificity.
 
(But first, a personal story. When I was sitting in my master’s program class on assessment, sensitivity and specificity just didn’t click for me. I remember thinking, any published test is going to have good sensitivity and specificity, so I don’t need to understand this. Wrong, *sigh*. I didn’t understand it until I had to teach it to master’s students as a PhD student. That took me hours of preparing, and the feedback I got from the master’s students was that they still didn’t understand sensitivity and specificity, so the circle of confusion is now complete.)
 
Why is diagnostic accuracy so hard to understand? I don’t know. The numbers aren’t complicated—it’s addition and division and percentages, and we all know how to do that. But they’re weird and theoretical and hard to grasp. If you want to see the calculation, I really like this video (it’s a medical example). Also know that many physicians don’t understand test accuracy either, which is comforting until you realize it’s terrifying.
 
Anyway, the take-home is this: The lower a test’s sensitivity, the more likely it is to tell us kids are fine when they really have a language disorder (!!); the lower a test’s specificity, the more likely it is to tell us that kids have a language disorder when they really don’t (also !!). Adequate sensitivity/specificity is usually defined as higher than .80 or 80%, but even at that good-enough 80% threshold, 20% of kids with and without language disorders are getting put into the wrong group (and not all of our tests even meet that criteria!).
 
While 80% sensitivity and specificity is the goal, we sometimes have valid reasons to use a test with lower diagnostic accuracy. A test might be the best or only option for a particular client, or we might accept lower accuracy for a really efficient screening measure. In these cases we still want to see good sensitivity, even if specificity is a little lower. A test with low sensitivity is more likely to let kids who need our services fall through the cracks, which none of us want.
 
Also, one caution for understanding stats on diagnostic accuracy: percent accuracy is not a good measure. Take this example: we know that around 10% of kindergarten-age children have a language disorder. Say I made a new language test and gave it to 100 kindergarteners, 10 with language disorders and 90 with typical language. My test tells me that all 100 kids have typical language skills. In other words, my test is useless. My test is also 90% accurate (eek!). To evaluate my test, we have to look at the all-important sensitivity—which is 0% (!!) because the test never correctly flagged kids with language disorders—and the specificity—which is 100% because it always ruled out language disorders in typical kids. Percent accuracy can hide a crummy test behind a misleading number.
 
Another important thing to know is that sensitivity and specificity will vary at different cut-points (again, the video demonstrates this really nicely). Test manuals should give the sensitivity and specificity at different standard score cutoffs. For example, on the CELF-5, the cutoff that maximizes both sensitivity and specificity is 1.3 standard deviations below the mean (AKA -1.3 SD, or a standard score of 80). Using a standard score cutoff of -1 SD or -2 SD or even -1.5 SD means we’re not getting the highest diagnostic accuracy for that test.
 
If you’re wondering how agencies or workplaces can recommend a single cutoff to be applied across multiple tests, yeah, technically that’s not best practice. The cutoff depends on the test. Granted, workplaces differ on requirements for eligibility (e.g., evidence of need vs. evidence of significant educational impact) and those requirements might differ from the test’s criteria for identifying a disorder. Still, it’s better to determine whether a child meets local criteria by looking at the functional impact of their language skills, not by arbitrarily picking a test score that’s lower or higher. For the record, the same thing applies to using test scores to determine service minutes​: test scores don’t necessarily tell us about the severity of a child’s language impairment, so stick with functional impact.
 
Reliability and validity

 
Reliability is how consistent a test is in terms of things like items across the test, scores between different examiners, or results for the same child on different days. Reliability also includes the standard error of measurement, which is how the confidence interval around a score is calculated. For a test with lots of built-in error, the confidence interval will span a wide range of scores, which should knock down our confidence that we’re reporting the One True Score (when we’re really just reporting a snapshot from a particular day).
 
Validity is how well the test measures what it’s supposed to measure. Validity can be measured by comparing results to other tests, looking at whether the tasks make sense for the skills they’re supposed to assess, or considering other research or expert opinion on the type of tasks used.
 
Because there are so many types of reliability and validity, these can be a little harder to evaluate, but check out this study for a recent overview of reliability and validity of language tests (and Table 3 for the criteria they used to evaluate them).
 
Bias
 
We all know this, but it’s easy to forget. Standardized tests are standardized, so every child gets the same thing, whether or not it’s appropriate for them. And norm-referenced means that we’re comparing our client to a sample of children who may or may not be similar to them.  While we should always think critically about whether such comparisons are appropriate, we know that test norms are especially problematic basically any time you vary from an average white middle-class monolingual US-born and raised kid profile (which is not just frustrating, but despicable—more blog posts on that coming soon). For example, children who speak nonmainstream dialects are likely to score lower on language tests, but using scoring modifications isn’t necessarily a good solution. Socioeconomic status also potentially affects test results, for both children from both lower- and higher-income families. And test norms are particularly tricky for multilingual children because their experience and skills in each language can vary. For these groups, relying on test scores alone is especially dangerous.

So which tests should I buy?

First, you can’t answer this question without asking the follow-up question: for what? What exactly, do you want to measure?
 
Then, while no test is perfect, some do have better properties than others, so it’s worth doing the legwork up front to get a good one. There are a few ways you can find out these stats before you buy, using published reviews (see this systematic review, this chart, Table 1 in this article, and reviews from the Leaders Project). Worst case scenario, you can always return the test​ (Elena Plante does!) but doing some research can save you that hassle. If you want to use the test to support a diagnosis (which is really the best use of a standardized test), pay particular attention to its diagnostic accuracy.
 
Recommendations for non-standardized language assessment
 
For most of us, standardized tests are still going to be a huge part of our language assessments, and that makes total sense—we just need to pick good tests and interpret their results carefully. But standardized tests might not make sense for some kids. And for all kids, standardized tests do a poor job of measuring functional impact and pointing us toward goals. To be fair, though, as imperfect as they are, standardized tests do have evidence behind them! So when we’re using non-standardized assessments to support a diagnosis, we have to be careful to use something that also has good evidence supporting it. Below are some recommendations for evidence-based, non-standardized assessment of child language.
 
Parent/stakeholder interviews

This is an important step that’s easy to let slide. Talking to parents, teachers, or the client is essential to learning how communication looks outside the therapy room. A few interview resources: (1) The Leaders Project has a good list of questions geared toward multilingual families, (2) this article provides an assessment framework for school-age kids that’s built on observation and student and stakeholder interviews, (3) this study used a (free!) teacher questionnaire for screening kindergartners who speak nonmainstream dialects, and (4) these studies used a parent questionnaire (also free!) to help identify language impairment in multilingual children.
 
Dynamic assessment

Dynamic assessment can tap a child’s ability to respond to instruction, not just what they know right now. It also might give us an idea of what kind of support will help a child succeed (hello, classroom accommodations). While dynamic assessment is great across the board, it’s especially useful for children who have less English experience or who are less familiar with a typical testing setting because the supportive teaching helps level the playing field. The teaching part of dynamic assessment gives us a chance to support children’s success (or to identify areas where they’ll need more support than their peers).
 
What’s hard about dynamic assessment is that we often don’t have a lot of guidance for interpreting the results. Past advice has been “make up a dynamic task and see how they do!” That’s fine if you’re a researcher who studies dynamic assessment, but for clinicians who use that specific task with that age child maybe a few times a year, tops, it’s a little harder to interpret the results confidently. Fortunately, a lot of great research on dynamic assessment has come out recently, including tasks for assessing vocabulary, AAC and syntax, and morphology. A few dynamic assessment tasks also have standard procedures and norms to go with them, which gives us more guidance when using their results for diagnosis (see the PEARL for early language and literacy screening and the CUBED and DYMOND for narrative skills). 
 
Language sample analysis (LSA)
 
LSA is an excellent way to capture children’s language skills. LSA is also incredibly time-consuming, and a lot of the time we can come away from it with nothing useful. When I was a school SLP, I did LSA with a very small chunk of my caseload, and that was only because: (1) I had undergrad research experience where I got comfortable with using LSA software, (2) I had lunch with a mentor and she guilted me into it, and (3) I took my work laptop home and did LSA on nights and weekends because I had poor work-life balance. So my example probably doesn’t help. I don’t know what the solution here is. But barring big, structural changes that free up our time, here are ideas to help make LSA better--because if you take the time to do it, you want to get the best information possible from all that work:

  • Move beyond conversation: Past the preschool years, conversation doesn’t necessarily tax a child’s language system enough. Think of it like a stress test—we want to see where the breakdown is when they’re at the limit of what they can do. Narrative is great for younger kids, and expository and persuasive tasks are good for older kids (see Figure 1 in this article for a handy chart). Another bonus is that we might not even have to transcribe these samples to score their overall quality with a rubric (like in this narrative study). Also, goals for narrative, persuasive, and expository quality are easy to tie to the curriculum in schools.
  • Don’t bother counting morphemes: Mean length of utterance (MLU) is a classic, and it’s fine for comparing young children to their peers. But MLU is a dud when it comes to setting goals and guiding therapy. Once we know that a child is combining words, MLU is really too general to tell us what specifically to work on. I prefer percent grammatical utterances (PGU). PGU involves only a binary yes/no decision whether each utterance is grammatical, and it has decent diagnostic accuracy. While PGU itself won’t tell us what to work on, it will give a list of all of the utterances with errors, which is a good place to start. (Note that MLU and PGU norms aren’t necessarily appropriate for children who speak nonmainstream dialects, though.)
  • Sample, transcribe, and analyze strategically: In an ideal world, we would collect language samples in multiple contexts, transcribe them, get a written sample, and complete detailed coding and analysis of all of it, all while tiny woodland creatures wrote our progress notes. Obviously, that’s not happening. Instead, we can look at anything we already know about a child—from observation, other testing, parent/teacher report, previous progress, etc.—and choose sample contexts and analyses based on what we already suspect is a problem.
​

Final thoughts

Don’t go out and burn your test kits. (If you knew me you’d know that I’d insist you recycle them, but don’t do that either). Also, don’t test-shame your peers; because without knowing the reason why they’re choosing a certain test for a certain child (or no test), it’s nearly impossible to judge.
 
But do make sure your clients get a fair, comprehensive language assessment, and recognize that standardized tests might not be giving us that every time. Adding better options like dynamic assessment and language samples at least some of the time is a huge step in the right direction.
 
The hope is that you can take at least one thing from this blog and use it to do a language assessment that you feel great about.
​

2 Comments
Kelly Dwyer
11/6/2020 12:56:27 pm

Thanks so much for this article and the great references!!

Reply
Audrey McKinnon
12/7/2020 10:33:39 pm

So much great information, and now at my fingertips. Thank you!

Reply



Leave a Reply.

    Picture

    By The Informed SLP Team,
    with occasional guests​
    Picture

    Better Information.
    Better Outcomes.

    Learn what works
    Along with
    members of The Informed SLP
    View all blog posts

    Blog posts by title:


    ​In a fog: Five facts + twelve clinical takeaways about chemo brain
     
    #AmIQualified: Let’s talk about bilingualism and white supremacy in CSD, one layer at a time
     
    #AmIQualified:  Let’s talk about privilege in speech–language pathology
     
    #AmIQualified: Let’s talk about moving from cultural competence to cultural humility

    The Not-New Speech Norms Part 2: An American Tale

    ​Motivational interviewing and behavioral change in clinical practice

    ​Standardized language tests: That score might not mean what you think it means

    ​The grammar guide you never knew you always wanted

    ​Top 12 questions about ASHA CEUs—answered

    ​Schools, safety, SLPs, and the evidence

    ​Response to #BlackLivesMatter, 2020

    COVID-19 and Dysphagia: Considerations for the Medical SLP

    ​COVID-19 and Cognition: Impact for the Medical SLP

    ​Looking for evidence on telepractice for SLPs?

    ​"I don't get what the difference is between ASHA's Evidence Maps, speechBITE, and The Informed SLP..."

    ​SLPD vs. PhD: What's the difference?

    ​That one time a journal article on speech sounds broke the SLP internet

    ​The difference between respecting our science and loving our science

    ​What does the evidence show about treatment intensity?

    ​EBP as a blame game

    ​The EBP barrier nobody is talking about

    ​Guest post: On evidence analysis

    ​Guest post: On trauma and language development

    ​Guest post: Working memory, processing speed, and language disorder

    ​Guest post: Push-in services—how to collaborate!

    ​Guest post: Complexity approach for speech sound disorders

    ​How am I supposed to find time to read research?!?

    ​SLPs: How to make sure you're using EBP

    ​SLPs: How to get access to full journal articles 
     
     
     

PlEASE READ ​our  privacy & terms and conditions of service policies.


CONTACT US
© COPYRIGHT The Informed SLP ® 2015. ALL RIGHTS RESERVED.
  • Home
  • What We Do
    • What is The Informed SLP?
    • Team
  • Members Login
    • Birth to Three
    • Preschool & School-Age
    • Adults
  • Blog
  • Join