by Mollee Sultani, MA, CCC-SLP
I know that you know that standardized language tests sometimes get it wrong. We’ve all had a kid who we were really concerned about that kept scoring in the normal range, or a kid who really seemed okay that randomly bombed a subtest or scored just below a cutoff.
So now what? Is the test wrong? Are you wrong? Where do we go from there?
With this blog post, we’ll show you the evidence to help know:
By knowing when standardized tests aren’t cutting it (and why), you’ll be able to do things like, say, explain to an administrator or a parent why the percentile rank you’re reporting might not align with your recommendation for service eligibility. Because while we don’t want to be anti-standardized test, we do want to be pro-cautious interpretation of standardized test results.
First, why are published tests sometimes subpar?
It’s hard to believe that an expensive language test published by a reputable company might not do a good job of telling us about a child’s language skills. While there’s not one reason for this, here are a few possibilities:
How to evaluate standardized language tests
Diagnostic accuracy, or how well a test identifies which children have a language disorder, is usually measured in sensitivity and specificity.
(But first, a personal story. When I was sitting in my master’s program class on assessment, sensitivity and specificity just didn’t click for me. I remember thinking, any published test is going to have good sensitivity and specificity, so I don’t need to understand this. Wrong, *sigh*. I didn’t understand it until I had to teach it to master’s students as a PhD student. That took me hours of preparing, and the feedback I got from the master’s students was that they still didn’t understand sensitivity and specificity, so the circle of confusion is now complete.)
Why is diagnostic accuracy so hard to understand? I don’t know. The numbers aren’t complicated—it’s addition and division and percentages, and we all know how to do that. But they’re weird and theoretical and hard to grasp. If you want to see the calculation, I really like this video (it’s a medical example). Also know that many physicians don’t understand test accuracy either, which is comforting until you realize it’s terrifying.
Anyway, the take-home is this: The lower a test’s sensitivity, the more likely it is to tell us kids are fine when they really have a language disorder (!!); the lower a test’s specificity, the more likely it is to tell us that kids have a language disorder when they really don’t (also !!). Adequate sensitivity/specificity is usually defined as higher than .80 or 80%, but even at that good-enough 80% threshold, 20% of kids with and without language disorders are getting put into the wrong group (and not all of our tests even meet that criteria!).
While 80% sensitivity and specificity is the goal, we sometimes have valid reasons to use a test with lower diagnostic accuracy. A test might be the best or only option for a particular client, or we might accept lower accuracy for a really efficient screening measure. In these cases we still want to see good sensitivity, even if specificity is a little lower. A test with low sensitivity is more likely to let kids who need our services fall through the cracks, which none of us want.
Also, one caution for understanding stats on diagnostic accuracy: percent accuracy is not a good measure. Take this example: we know that around 10% of kindergarten-age children have a language disorder. Say I made a new language test and gave it to 100 kindergarteners, 10 with language disorders and 90 with typical language. My test tells me that all 100 kids have typical language skills. In other words, my test is useless. My test is also 90% accurate (eek!). To evaluate my test, we have to look at the all-important sensitivity—which is 0% (!!) because the test never correctly flagged kids with language disorders—and the specificity—which is 100% because it always ruled out language disorders in typical kids. Percent accuracy can hide a crummy test behind a misleading number.
Another important thing to know is that sensitivity and specificity will vary at different cut-points (again, the video demonstrates this really nicely). Test manuals should give the sensitivity and specificity at different standard score cutoffs. For example, on the CELF-5, the cutoff that maximizes both sensitivity and specificity is 1.3 standard deviations below the mean (AKA -1.3 SD, or a standard score of 80). Using a standard score cutoff of -1 SD or -2 SD or even -1.5 SD means we’re not getting the highest diagnostic accuracy for that test.
If you’re wondering how agencies or workplaces can recommend a single cutoff to be applied across multiple tests, yeah, technically that’s not best practice. The cutoff depends on the test. Granted, workplaces differ on requirements for eligibility (e.g., evidence of need vs. evidence of significant educational impact) and those requirements might differ from the test’s criteria for identifying a disorder. Still, it’s better to determine whether a child meets local criteria by looking at the functional impact of their language skills, not by arbitrarily picking a test score that’s lower or higher. For the record, the same thing applies to using test scores to determine service minutes: test scores don’t necessarily tell us about the severity of a child’s language impairment, so stick with functional impact.
Reliability and validity
Reliability is how consistent a test is in terms of things like items across the test, scores between different examiners, or results for the same child on different days. Reliability also includes the standard error of measurement, which is how the confidence interval around a score is calculated. For a test with lots of built-in error, the confidence interval will span a wide range of scores, which should knock down our confidence that we’re reporting the One True Score (when we’re really just reporting a snapshot from a particular day).
Validity is how well the test measures what it’s supposed to measure. Validity can be measured by comparing results to other tests, looking at whether the tasks make sense for the skills they’re supposed to assess, or considering other research or expert opinion on the type of tasks used.
Because there are so many types of reliability and validity, these can be a little harder to evaluate, but check out this study for a recent overview of reliability and validity of language tests (and Table 3 for the criteria they used to evaluate them).
We all know this, but it’s easy to forget. Standardized tests are standardized, so every child gets the same thing, whether or not it’s appropriate for them. And norm-referenced means that we’re comparing our client to a sample of children who may or may not be similar to them. While we should always think critically about whether such comparisons are appropriate, we know that test norms are especially problematic basically any time you vary from an average white middle-class monolingual US-born and raised kid profile (which is not just frustrating, but despicable—more blog posts on that coming soon). For example, children who speak nonmainstream dialects are likely to score lower on language tests, but using scoring modifications isn’t necessarily a good solution. Socioeconomic status also potentially affects test results, for both children from both lower- and higher-income families. And test norms are particularly tricky for multilingual children because their experience and skills in each language can vary. For these groups, relying on test scores alone is especially dangerous.
So which tests should I buy?
First, you can’t answer this question without asking the follow-up question: for what? What exactly, do you want to measure?
Then, while no test is perfect, some do have better properties than others, so it’s worth doing the legwork up front to get a good one. There are a few ways you can find out these stats before you buy, using published reviews (see this systematic review, this chart, Table 1 in this article, and reviews from the Leaders Project). Worst case scenario, you can always return the test (Elena Plante does!) but doing some research can save you that hassle. If you want to use the test to support a diagnosis (which is really the best use of a standardized test), pay particular attention to its diagnostic accuracy.
Recommendations for non-standardized language assessment
For most of us, standardized tests are still going to be a huge part of our language assessments, and that makes total sense—we just need to pick good tests and interpret their results carefully. But standardized tests might not make sense for some kids. And for all kids, standardized tests do a poor job of measuring functional impact and pointing us toward goals. To be fair, though, as imperfect as they are, standardized tests do have evidence behind them! So when we’re using non-standardized assessments to support a diagnosis, we have to be careful to use something that also has good evidence supporting it. Below are some recommendations for evidence-based, non-standardized assessment of child language.
This is an important step that’s easy to let slide. Talking to parents, teachers, or the client is essential to learning how communication looks outside the therapy room. A few interview resources: (1) The Leaders Project has a good list of questions geared toward multilingual families, (2) this article provides an assessment framework for school-age kids that’s built on observation and student and stakeholder interviews, (3) this study used a (free!) teacher questionnaire for screening kindergartners who speak nonmainstream dialects, and (4) these studies used a parent questionnaire (also free!) to help identify language impairment in multilingual children.
Dynamic assessment can tap a child’s ability to respond to instruction, not just what they know right now. It also might give us an idea of what kind of support will help a child succeed (hello, classroom accommodations). While dynamic assessment is great across the board, it’s especially useful for children who have less English experience or who are less familiar with a typical testing setting because the supportive teaching helps level the playing field. The teaching part of dynamic assessment gives us a chance to support children’s success (or to identify areas where they’ll need more support than their peers).
What’s hard about dynamic assessment is that we often don’t have a lot of guidance for interpreting the results. Past advice has been “make up a dynamic task and see how they do!” That’s fine if you’re a researcher who studies dynamic assessment, but for clinicians who use that specific task with that age child maybe a few times a year, tops, it’s a little harder to interpret the results confidently. Fortunately, a lot of great research on dynamic assessment has come out recently, including tasks for assessing vocabulary, AAC and syntax, and morphology. A few dynamic assessment tasks also have standard procedures and norms to go with them, which gives us more guidance when using their results for diagnosis (see the PEARL for early language and literacy screening and the CUBED and DYMOND for narrative skills).
Language sample analysis (LSA)
LSA is an excellent way to capture children’s language skills. LSA is also incredibly time-consuming, and a lot of the time we can come away from it with nothing useful. When I was a school SLP, I did LSA with a very small chunk of my caseload, and that was only because: (1) I had undergrad research experience where I got comfortable with using LSA software, (2) I had lunch with a mentor and she guilted me into it, and (3) I took my work laptop home and did LSA on nights and weekends because I had poor work-life balance. So my example probably doesn’t help. I don’t know what the solution here is. But barring big, structural changes that free up our time, here are ideas to help make LSA better--because if you take the time to do it, you want to get the best information possible from all that work:
Don’t go out and burn your test kits. (If you knew me you’d know that I’d insist you recycle them, but don’t do that either). Also, don’t test-shame your peers; because without knowing the reason why they’re choosing a certain test for a certain child (or no test), it’s nearly impossible to judge.
But do make sure your clients get a fair, comprehensive language assessment, and recognize that standardized tests might not be giving us that every time. Adding better options like dynamic assessment and language samples at least some of the time is a huge step in the right direction.
The hope is that you can take at least one thing from this blog and use it to do a language assessment that you feel great about.