Evaluation of the revised Nipissing District Developmental Screening (NDDS) tool for use in general population samples of infants and children

Background There is widespread interest in identification of developmental delay in the first six years of life. This requires, however, a reliable and valid measure for screening. In Ontario, the 18-month enhanced well-baby visit includes province-wide administration of a parent-reported survey, the Nipissing District Developmental Screening (NDDS) tool, to facilitate early identification of delay. Yet, at present the psychometric properties of the NDDS are largely unknown. Method 812 children and their families were recruited from the community. Parents (most often mothers) completed the NDDS. A sub-sample (n = 111) of parents completed the NDDS again within a two-week period to assess test-retest reliability. For children 3 or younger, the criterion measure was the Bayley Scales of Infant Development, 3rd edition; for older children, a battery of other measures was used. All criterion measures were administered by trained assessors. Mild and severe delays were identified based on both published cut-points and on the distribution of raw scores. Sensitivity, specificity, positive and negative predictive values were calculated to assess agreement between tests. Results Test-retest reliability was modest (Spearman’s rho = .62, p < 001). Regardless of the age of the child, the definition of delay (mild versus severe), or the cut-point used on the NDDS, sensitivities (from 29 to 68 %) and specificities (from 58 to 88 %) were poor to moderate. Conclusion The modest test-retest results, coupled with the generally poor observed agreement with criterion measures, suggests the NDDS should not be used on its own for identification of developmental delay in community or population-based settings.


Background
The first six years of life are the crucial period of human development, and there is broad consensus that investment in optimizing health and development in this period will result in significant individual, social and economic benefits [1]. Results from developmental neuroscience suggest that both prevention and treatment efforts need to occur as early in this period as possible, as treatment later in life may be less effective in preventing poor outcomes [2,3].
Developmental delay is one target for early identification and intervention. While the prevalence of global delay in children under 6 is between 1 and 3 % [4], 12 to 16 % of children show meaningful delay in one or more cognitive, motor, language, and socioemotional areas [5][6][7]. Such delays are associated with increased risk of future physical and mental health problems and with poor functional and educational outcomes later in life [8,9].
Early intervention requires early identification. The detection rate of developmental delay in clinical settings, however, is well below the estimated prevalence [10]. Systematic screening provides a possible solution, but requires measures that are cost-effective, easily administered, reliable, and valid. These requirements are exacting, given the complexities of measuring development in early childhood [11]. While early screening and surveillance is recommended by many professional organizations [5,10], and has been implemented in many countries, there is no consensus on the instruments to be used.
The Nipissing District Developmental Screening tool (NDDS), is increasingly used for this purpose in Canada [12,13] and the United States (e.g., Early Head Start Program: http://www.nemcsa.org/headstart/ECDHS_A.aspx). The NDDS was first developed in 1993, and its content and design were revised in 2011. It comprises 13 age group-specific parent-completed checklists of developmental milestones for children between 1 month and 6 years of age. In Ontario, the NDDS is one of the recommended measures to be used during the recently-implemented enhanced 18-month well-baby visit [14,15], a population-wide, comprehensive developmental assessment and parenting education session connected to the 18-month immunization visit. In Ontario, the government has paid to provide free access to the NDDS to all parents.
Despite its increasing use, the psychometric properties of the NDDS are largely unknown; we could locate only three reports, two of them unpublished, and all limited by small samples [16][17][18]. Only Currie et al. [16] evaluated the current version of the NDDS, and this was a pilot study of 31 children, only 4 of whom met criteria for mild developmental delay. The psychometric properties of the NDDS have not therefore been assessed with an adequate sample.

Sample
We recruited a sample of participants from community organizations who provide services to families in Hamilton, Ontario and surrounding areas and which targeted sociodemographically diverse populations. Organizations included Ontario Early Years Centres and Parent and Family Literacy Centres. Staff of some organizations shared information about the study with their clients, and some referred families directly. We also used recruitment posters and notices on web sites, and operated a booth at the Hamilton Baby and Toddler Expo, which is well-attended by families from Hamilton and surrounding areas. Families were recruited between May 2010 and October 2011. Parents were eligible if they could speak and read English, and were the child's primary caregiver and legal guardian. We aimed to recruit 50 children for each of the NDDS's 10 age bands up to 36 months (group A; n = 500) and 100 in each of the remaining 3 age bands (4 to 6 years of age; group B; n = 300), for a total of 800 children across all 13 age bands. Child age was adjusted for prematurity if the child was under 2 years and born 4 weeks or more prematurely.

Study design
We randomly selected 111 (14 %) participants to complete the NDDS a second time after an interval of 2 weeks, and 55 (7 %) to complete a qualitative interview. Criterion measures were administered by research assistants, all of whom had an undergraduate or Master's degree (e.g., psychology, health sciences). RAs received a minimum of 8 h of pre-test administration training and at least 10 h of supervised test administration experience prior to being able to conduct independent assessments. Assessment reports were monitored continuously for quality assurance throughout the study. We received ethical approval from the McMaster University Research Ethics Board, and all parents provided informed, written consent.

Parent-completed measures Nipissing district developmental screen-2011
The NDDS-2011 asks parents to indicate whether they have observed their child performing various motor, cognitive or language tasks. There are separate checklists for each of 13 age groups. The checklist for infants under 1 month old includes 4 items, while others include between 12 and 22 items. Milestones not yet observed by the caregiver are counted to produce a score. Current recommendations are for a health professional to follow up with any scores of 1 or higher. Before the 2011 revision, a cut-point of 2 or higher was used [12,17]. As the proportion of children identified at the 1+ threshold may be too large for some situations, we also explored the performance of the NDDS at the 2+ cut-point.

Criterion measures
As there is no single gold standard for assessing development in children, we designed a protocol using widely-used instruments with demonstrated reliability and validity. Given the broad age range covered by the NDDS, it was not possible to use the same criterion measure for all children. For children 3 years and under (Group A), we used the Bayley Scales of Infant Development, 3rd Edition (BSID-III; 19). The BSID-III produces a set of raw and normal scores for each of five domains: Cognition, receptive communication, expressive communication, fine motor, and gross motor. We identified as "mildly delayed" those children who scored below the "borderline" cut-point in one or more domains, and as "severely delayed" those with at least one score below the "extremely low" cut-point according the manual [19].
For children aged 4 to 6 (Group B), we selected three separate measures assessing development in motor coordination, cognition, and language: the Movement Assessment Battery for Children, 2nd Edition (M-ABC; 20); the Kaufman Brief Intelligence Test, 2nd Edition (KBIT-2) [20]; and the Pre-school Language Scale, 4th edition (PLS-4) [21,22], respectively. The M-ABC [20], PLS-4 [21], and KBIT-2 [23] have all shown good agreement with clinical evaluation and with other instruments. Children were identified as having "mild" or "severe" delay by using the 15th and 5th percentile cutpoints on each instrument. The M-ABC does not provide a 15th percentile cut-point; instead, the 16th percentile is recommended [20]. The K-BIT produces a standard score with a mean of 100 and an SD of 15. We therefore used cut-points of 84.5 and 75, which correspond to the 15th and 5th percentiles.
On the BSID-III, the published "borderline" cut-points produced a prevalence of 27 % in children under 1 and of only 5 % in those aged 2 or 3. It is unlikely that this reflects genuine variation within our sample, as we drew on the same sources to recruit all participants. Concerns over published BSID-III norms have also been raised previously [24]. We therefore produced a second set of classifications (i.e., cut-points to classify mild and severe delay) based on the distributions of raw scores. We repeated this process for the PLS-4, as the norms for this instrument identified only a single "case". The K-BIT and M-ABC produced plausible prevalence's, based on the literature, that did not vary markedly with child age.
To produce distribution-based indicators of caseness, we used quantile regression, with the scale score as the outcome and fractional polynomial transformations of age as the independent variables. These models yield equations that can be solved at any child age to calculate a cut-point at the designated quantile. For the BSID-III, we fit two models for the raw score of each subscale: One corresponding to the "borderline" (−1.33 SDs; 9.2nd percentile) and one to the "extremely low" (−2 SD; 2.275th percentile) cut-point. For the PLS-4, to be consistent with other measures used for older children, we estimated cut-points at the 5th and 15th percentiles. To do this analysis, we used the xmfp Stata program by Royston [25].

Statistical analysis
We measured test-retest reliability by calculating Spearman correlations for total scores and kappa statistics for agreement using scores of 1 and 2 as cut-points.
We compared the NDDS with the criterion measures by calculating sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV), along with exact binomial 95 % confidence intervals. We used Stata 13 for all analyses [26].

Results
We received initial referrals for 1012 parent-child pairs and have final data for 812: 594 children aged 1 month to 36 months (Group A) and 218 children aged 4 to 6 years (Group B). This represents an 80.2 % response rate from the total sample of referrals, and an 83.8 % response from eligible families. Figure 1 shows the stages of recruitment, participant exclusions, and consent rate. Parent demographics are shown in Table 1. In 98 % of cases, the NDDS was completed by the child's biological mother, and the 812 child-parent pairings were drawn from 572 families. The number of children in each NDDS age band varied from 41 to 98.

Criterion validity
We fit models to identify distribution-based cut-points for the BSID-III and PLS-4. In both cases, these resulted in higher prevalence than those derived using the published norms, and in prevalences that did not vary substantially with child age. Results of this analysis are illustrated in Fig. 2, which shows 'borderline' cases on the expressive communication subscale of the BSID-III according to the published cut-points (crosses) and according to our distribution-based model (all those below the regression line). Similar results were obtained for the other BSID-III subscales and for the PLS-4.

Group A (children 1 month to 3 years of age)
103 of 594 children (17.3 %) scored in the "borderline" range in one or more BSID-III domains. At the recommended 1+ cut-point (i.e., one or more "no" responses on the NDDS), the sensitivity of the NDDS was 59 % and the specificity 67 %. 17 children (2.9 %) scored in the "extremely low" range in at least one domain, and the sensitivity and specificity in this case were 65 % and 63 %, respectively (see Table 2).
Using distribution-based cut-points produced generally poorer agreement. 175 children (29 %) were below the "borderline" cut-point in at least one domain. For this outcome, the sensitivity of the NDDS at the 1+ cutpoint was 50 % and the specificity 68 %. 45 children (7.6 %) were below at least one "extremely low" cutpoint. The sensitivity and specificity in this case were 60 % and 64 %, respectively (see Table 2).

Group B (children 4 to 6 years of age)
Seven children (3.2 %) had incomplete or invalid results on one or more instruments, and were excluded from the analysis. Of the remaining 211 children, 40 (19 %) met norms-based criteria for mild delay. At the 1+ cutpoint, the NDDS had a sensitivity of 68 % and a specificity of 63 %. For the adjusted outcome, there were 57 cases (27 %). Sensitivity was 60 % and specificity 63 %.
Twelve children (5.7 %) met norms-based criteria for severe delay. The sensitivity of the NDDS was 67 % and the specificity 58 %. Using the adjusted measure produced a prevalence of 8.1 % (17 of 211), a sensitivity of 65 %, and a specificity of 59 % at the 1+ cut-point on the NDDS; (see Table 3).
For severe delay, all PPVs were under 20 %, implying a low probability that a child with a positive screen will meet reference criteria. In keeping with the higher prevalence, PPVs for moderate delay were higher, but still under 50 %. Using the alternative 2+ cut-point raised specificities to 81 %-84 %, but reduced sensitivities to 33 %-50 %.

Discussion
For screening purposes, it is generally recommended that sensitivity exceed 80 % and specificity 90 % [27]. Given the challenges of screening for developmental delay, lower thresholds (sensitivity of 70 %, specificity of 80 %) have been suggested in this context [28,29]. The NDDS, however, did not meet either set of criteria. On this basis, we cannot recommend that the NDDS be used on its own for identification of developmental delay in community or population-based settings. Our results are generally consistent with those of Dahinten and Ford [17] who reported 69 % specificity at the −2 SD cutpoint on the BSID-II (sensitivity was 100 %, but only 3 cases were identified). Nagy et al. [18] reported much better accuracy (sensitivity 83 %, specificity 95 %), but the criterion measure used in this study was also a parent-reported instrument [18]. Currie et al. reported sensitivity and specificity at the 1+ NDDS threshold to be 75 % and 78 %, respectively, and at the two flag rule, 75 % and 96 %, respectively [16]. As noted previously however, the sample size for this study was very small (n = 31), with only 4 children identified with delay. Moreover, the sample was drawn from a high-risk clinical referral group.
The test-retest reliability of the NDDS was also moderate. The retest took place after the clinical assessment, however, and parents of infants and young children (Group A) were often directly involved in the administration of the BSID-III (especially parents of children under 18-months). Parents' answers on the NDDS retest could therefore have been influenced by what they observed during testing. Especially in young children, it is also conceivable that new behaviours might be observed in a two-week period. It is possible to test whether the latter factor influenced change in parental reporting on the NDDS between test and retest by comparing the proportion of scores that increased (the number of flags indicating delay increased across administrations) versus those that decreased (indicating improvement in development). We found no clear differences in the direction of NDDS changes, however. As our results illustrate, the validation of measures of developmental delay is difficult, owing to many limitations and challenges in the field. For example, there are numerous possible sources of disagreement beyond faults in the measure being evaluated. While we chose validated, widely-used instruments, there are no definitive, gold standard measures for the identification of 'developmental delay'. In the case of the NDDS, however, other concerns are evident. First, a reading of items suggests that there is variation across the 13 age bands, resulting in implicit weighting of different domains. The variation in the number of items is another possible issue; endorsement of one item out of 14 on one age band may represent a different threshold than the same score on a version with 22 items. Finally, the NDDS age bands are very wide. The same items and thresholds are used for all 3-year-old children, for example, but substantial development can occur over this year.
Our results have important implications for policy and practice. The NDDS is currently used in a variety of settings to facilitate the identification of developmental delay. Evidence, however, does not support its use as the sole screening measure in any setting. Recommendations for Ontario's 18-month enhanced well-baby visit [13][14][15] are to use the NDDS as part of a more comprehensive assessment involving use of other tools (e.g., Rourke Well Baby Record; [30]), and this may be more appropriate. The instrument's systematic examination of milestones could help initiate discussions with parents and suggest areas for investigation. Given its poor agreement with reference measures, however, we suggest that caution is warranted. If the NDDS is used, it should probably be completed with the assistance of a trained administrator, and its usefulness should be monitored. This might be done, for example, by using administrative data to examine predictive validity.

Limitations
We evaluated the NDDS in a convenience sample drawn from a single geographical area, and our participating parents were somewhat better-educated than the national average. Although the NDDS consists of 13 separate sets of items, our sample was not large enough for  us to evaluate the validity of individual versions. There are also no consensus gold standards for the identification of developmental delay, and the limited age range covered by our primary reference (the BSID-III) obliged us to use different instruments for older children. Given these limitations, independent replication of these results would be valuable.

Conclusions
The modest test-retest reliability and generally poor agreement with criterion measures leads us to conclude that the NDDS should not be used on its own for the purposes of screening in 1 month to 6 year old children. At the same time, it is important to consider that reference instruments are themselves imperfect. Development is continuous and complex, and, except for clear cases of severe delay, it may be very difficult to construct an instrument relying solely on parental report that will accurately identify children who would benefit from an intervention. Longitudinal data, which make it possible to compare a screen with later health and development, may offer the best prospects in this regard.