Reliability and validity of the Caregiver Reported Early Development Instruments (CREDI) in impoverished regions of China

Background There is a great need in low- and middle- income countries for sound qualitative and monitoring tools assessing early childhood development outcomes. Although there are many instruments to measure the developmental status of infants and toddlers, their use in large scale studies is still limited because of high costs in both time and money. The Caregiver Reported Early Development Instruments (CREDI), however, were designed to serve as a population-level measure of early childhood development for children from birth to age three, and have been used in 17 low- and middle-income countries. This study aimed to examine the reliability and validity of the CREDI in China, which is still unknown. Methods The CREDI and the ASQ-3 was administered to a sample of 946 children aged 5–36 months from urban and rural communities, in which 248 children was administered with Bayley-III. Results The internal consistency of the CREDI was high, which indicates that the scale internal consistency reliability is quite good. The results also indicated that the concurrent validity of the CREDI with the Bayley-III scale was high in general. Ordinary least squares regression showed that the CREDI is highly consistent with previous widely used instruments in some key predictors (such as the home stimulation) of early childhood development level. Conclusions All the results in the current study indicate that the CREDI may be considered an appropriate instrument to measure early childhood development status on a large scale in impoverished regions of China.


Background
It is widely known that the emotional, social, and cognitive skills that emerge in early childhood are important prerequisites for success in school, employment, and potential income in the later stages of an individual's life [11,12,29]. The period from birth to 3 years is also the stage when development is most rapid and children at this stage begin reaching basic development milestones. Therefore, early childhood is very sensitive to environmental effects and is also a period suitable for interventions that alleviate effects of external risk factors [6,17,[34][35][36].
Early childhood development (henceforth, "ECD") has been recognized by governments and Non-Governmental Organizations as a window of opportunity to improve the level of individual development and the social and economic well-being of society as a whole [6]. Under this context, continuous monitoring of ECD outcomes using culturally and developmentally appropriate instruments can provide useful information for developing more effective intervention strategies [8]. Moreover, population-level measurement of ECD is necessary to improve ECD outcomes and reduce developmental inequality through national, regional, and global policies [25].
Although there has been significant progress in supporting and monitoring ECD and developing instruments assessing ECD, there are few effective and reliable instruments available to assess children's early development status at a large scale in different cultural environments [25]. As Kelly et al. [16] pointed out, children have different ways and times to acquire motor, cognitive, and language skills in different settings. Differences in preschool children's cognitive and social emotional skills at the national level were found to be related to the country's socioeconomic and nutritional status [24]. The assessment of ECD in different regions of the world helps us understand commonalities and differences in, and factors contributing to ECD, thus providing useful information for developing more effective intervention strategies.
In summary, population-level measurement and evaluation of ECD is a key issue that needs to be solved in the current era. Unlike individual assessment instruments, population-level measurement tools need to be simple and inexpensive to implement and require crosscultural comparisons [25]. In this context, a new instrument, the Caregiver Reported Early Development Instruments (CREDI), was developed. The CREDI was designed as a caregiver-reported, cross-culturally comparable, population-level measure of ECD for children under 3 years [25,26]. The goal of the CREDI is to provide low-cost, large-scale data to facilitate policy interventions and resource allocation, while tracking global progress in alleviating ECD-related disparities around the world [27]. The reliability and validity of the CREDI have been studied and its applicability to the evaluation of ECD levels in low-and middle-income countries has been confirmed [1,25,27], but there is still a lack of research about its application in China, and there are still no studies on the application of CREDI in a longer format (henceforth, "long form"). Based on this, this paper introduces and analyzes the application of the CREDI long form in China, with special focus on its reliability and validity, based on survey data from poverty-stricken areas in China.

Chinese context
In low-and middle-income countries, about 249 million children under the age of five are at risk of poor development, of whom 17.43 million (about 8%) are in China, ranking second in the world [21]. Studies show that concern for ECD in poverty-stricken areas of China is particularly acute [23,45,46,50]. About half of children in poor rural China are at risk for cognitive delays; 52% of children are at risk for language delays, and the risk of delay increases over time [48].
The Chinese government has made many efforts to promote appropriate ECD. At the Central Economic Work Conference in December 2018, "increasing investment in preschool education, early childhood development and vocational education in rural poverty areas" were listed as key tasks. In May 2019, the General Office of the State Council officially issued the "Guiding Opinions on Promoting the Development of Infant and Child Care Services under 3 Years of Age," and clearly established policies and regulations, standards and norms, and service supply systems to promote the development of childcare services. Childcare services will be implemented in various forms to gradually meet the people's needs. The Chinese government's policies reflect its determination to promote ECD and related public service systems. In this context, data collection and evaluation of population-level ECD is particularly urgent. The application of a population-level assessment instrument for ECD has also become an important element in guiding how the Chinese government can effectively implement childcare services policies.

Existing measures of ECD
Several international instruments have been developed to comprehensively measure ECD, such as the Griffith Mental Development Scales, the Denver Developmental Screening Test, and the Bayley Scales of Infant and Toddler Development, which are direct assessment tools usually done by clinically trained personnel [37] for screening and diagnosis of children with developmental disabilities or delay. Among these measures, the Bayley Scales of Infant and Toddler Development is more widely used in China. Although these individual screening tools have detailed, standard, and practical advantages in obtaining information on a child's developmental status, there are limitations in providing population-level measurement of ECD because the costs of copyright purchases, adaptation, administration time, and training of the administrators are often relatively high, making them unsuitable for largescale use [9]. Moreover, this assessment tool directly engages with the child, which may result in measurement errors caused by external factors, such as temperament of the child (e.g. some children might be too shy because of unfamiliarity with the testing environment and the tester) or the ability of children to understand the verbal instructions given.
Therefore, some indirect assessments, which are reported by the caregiver, such as the Early Development Index [15], PRIDI [43], IDELA [30], and the Early Childhood Index [41], may be more suitable for capturing the population's ECD at this age. Moreover, accessing the data in this way is scalable. However, these instruments are still limited because of the small number of measurement items or their age limits (somewhat concerned with older children rather than 0-3-year-old children). Then the Ages and Stages Questionnaire, third edition (ASQ-3) and the Caregiver Reported Early Childhood Development Instruments (CREDI) were developed for assessing ECD status of infants and toddlers. Compared to direct assessment tools, an instrument reported by caregivers requires less training and testing time, and such an instrument is less likely to be biased by children being unfamiliar with clinical assessments, potential behavior changes with strangers, or not understanding verbal instructions [9,37]. Previous studies about the reliability and validity of the ASQ-3 and the CREDI will be introduced in detail below.
With respect to the ASQ-3, it is a caregiver-reported measurement that asks caregivers to report different aspects of their children's behavior to assess development [40]. Although its items have good to acceptable internal consistency [39], previous studies on the validity of ASQ-3 have varied significantly. For sensitivity (the rate at which a screening instrument correctly identifies a developmental delay) and specificity (the rate at which a screening instrument correctly identifies children who perform within the normal range), respectively, the following values have been reported: 75 and 86% [38]; 66 and 84% [14]; 82 and 78% [19]; 87.50 and 84.48% [45,46]. Previous studies have also assessed the effectiveness of the ASQ-3 as a screening tool by comparing it with the Bayley-III and have found weak or moderate consistency between the two instruments [42,49]. Some studies show that the ASQ-3 is better able to assess children's development when children are older rather than when they are younger [31,32,40,49]. The inconsistent results may be related to differences in reference measurement, age of the sample, cultural environments, and item explanations. Furthermore, measurement errors caused by the reference instrument itself may also be an explanation [42]. In spite of the discrepancies, a study by Wei et al. [45,46] also pointed that ASQ-3 have good reliability and validity in mainland China, and can be used in the development screening and monitoring of eligible children in mainland China. Therefore, we also chose the ASQ-3 as a comparison scale.
The CREDI differs from the ASQ-3 as a populationlevel assessment tool because is not used to screen and diagnose an individual's specific developmental problems or developmental delays, but provides caregivers with feedback on the child's developmental status or tracks subtle changes in individual levels through intervention. This instrument is simpler and less burdensome to test, and parental involvement in testing may also help them gain important knowledge about their child's development and understand their child's performance at that age. It is an open resource that can be downloaded from the website https://sites.sph.harvard.edu/credi/ freely. It can serve to provide conceptually rich, developmentally informed, population-level data on global progress in alleviating ECD-related inequities and meeting target 4.2 of the UN Sustainable Development Goals [27].
The CREDI has been piloted and applied in many lowand middle-income countries around the world, and its reliability and validity have also been studied and analyzed. By analyzing 2481 caregivers of children aged 18-36 months in Tanzania, McCoy et al. [27] evaluated the acceptability, test-retest reliability, internal consistency, and discriminant validity of the newly developed CREDI items, subscales, and total scale. The results showed that the CREDI and its motor, cognitive, and socialemotional subscales had sufficient acceptability and internal consistency. It also found positive evidence for the validity of the CREDI by showing adequate criterion validity with the Bayley-III motor, cognitive, and communication subscales. The study also found that the CREDI can accurately distinguish differences in children's ages, nutritional status, disabled status, and home stimulation activities. In addition to providing positive evidence of validity, the CREDI has been found to be a more acceptable tool in low-income environments because it is easily understood and quickly implemented, which was indicated by trained field staff with the equivalent of a secondary education level only spending about 20 min finishing the test on average. Moreover, it was found that this kind of caregiver-reported instrument is beneficial for reducing errors resulting from the noncompliance issues caused by factors such as unfamiliarity with the test environment, fear of unfamiliar adults, and children's illnesses. However, coverage of the study in Tanzania was insufficient because it only included 18-to 36-month-old children, resulting in a lack of data regarding children younger than 18 months. The authors also suggest that before the CREDI is fully disseminated, more research in multilingual and cultural environments and lower age groups need to be conducted.
Another study about the application of the CREDI was also conducted by McCoy et al. [25] with 8022 participants from 17 low-and middle-income countries. The results showed that the CREDI short form is an effective, reliable, and acceptable population-level measure of ECD. Feedback from qualitative interviews with caregivers and field team members shows that participants have a good understanding of the CREDI, and it is easy to implement. Internal consistency was also sufficient. The results also show that the CREDI score differs among different social demographic subgroups. The criterion validity was also tested to be sufficient through the correlations between the CREDI and alternative ECD "gold standard" instruments. This study fully demonstrated that the CREDI short form is effective, reliable, and acceptable in measuring population-level ECD status in different cultural environments. Based on this, the authors suggest that the CREDI can be used as a useful tool to monitor ECD status in low-resource, low-cultural settings and in large-scale household surveys, while recommending the CREDI and other indicators be used together.
By using data from 1265 caregivers of infants and toddlers aged 0-35 months in Brazil, Altafim et al. [1] conducted a study to assess the acceptability, test-retest reliability, internal consistency, and discriminate and concurrent validity of the CREDI short form. The results of qualitative interviews showed that overall acceptance of the scale was high. Internal consistency was very high in the six age groups, with a coefficient greater than .8, but there were fewer participants in the 0-5 and 18-23 age groups, and therefore further research is required. Multivariate analysis of structural validity showed that some of the significant variations in CREDI scores could be explained by the child's gender and family characteristics, such as women's education levels, socioeconomic status, and stimulating activities of the family. Regarding concurrent validity, the CREDI score was significantly correlated with the PRIDI score, with a correlation coefficient of . 46. In summary, the results of the study in Brazil show that the CREDI short form has high validity, reliability, and acceptability, which suggested that it can be used for assessment of ECD status on a large-scale in Brazil.
These studies conducted in-depth research on the application of the CREDI in low-and middle-income environments and mainly focus on the CREDI short form. Nevertheless, the reliability and validity of the CREDI long form in China has not yet been studied. The Chinese government's recent commitment to the policy about childcare services urgently requires appropriate instruments to assess ECD status at the population level. Based on the previous review, the Bayley-III, although the gold standard, is not suitable for large-scale use because of its high cost of administering and administering requirements. Alternatively, although the ASQ-3 is suitable for large-scale use as a direct evaluation tool, there are still discrepancies in the data it provides. As an alternative measurement tool for the ASQ-3, the recently developed CREDI tool for the assessment of populationlevel ECD status in 17 low-and middle-income countries has been widely studied and recommended. However, before it is used in China, analysis of its reliability and validity is especially necessary. Therefore, the goal of this study was to evaluate the reliability and validity of the CREDI long form as a measure of ECD status at a population level in rural China. To do so, we administered the Bayley-III to a subsample of the total sample and administered the CREDI long form and ASQ-3 to their caregivers in the total sample. We then compared the outcomes of the CREDI test to those of the Bayley-III and the ASQ-3.

Data collection
The data used in this study were collected in a sample of 995 children aged 5-36 months from urban and rural communities in July 2018. It was based on data from a randomized controlled trial 1 that required implemented intervention on children and their caregivers. However, in China, the traditional custom is that children from 0 to 5 months of age rarely go outside. Therefore, children in this age group were not the targeted group of the intervention and thus not in our study sample. The sample area is representative of one nationally-designated poor county in the Qinba mountain region of China. Each toddler's primary caregiver was administered a detailed survey on parental and household characteristics, including each toddler's age, gender, gestational age, presence of any siblings, whether the mother is the primary caregiver or not, maternal education, and household economic status.
All infants/toddlers were administered the two different scales to measure developmental outcomes: the CREDI long form and the ASQ-3. Considering this was the first application of the CREDI in China, the author's research team engaged a professional translation company to conduct accurate translation and backtranslation for the items and relative materials of the CREDI. 2 The translation was welcomed by the CREDI team. The items translation was also sent to specialists in the field of child development for consultation and we conducted a pilot study in rural China before this survey to check whether the translation of the questions was clear and suitable for rural caregivers. Additionally, out of the 995 sample infants/toddlers, 258 were tested with the Bayley-III scales for their levels of cognitive, language, or motor development by enumeration teams.
It should be noted that the final sample for analysis is less than the full sample because of missing data. It comes from two sources: first, there is a very small proportion of missing values in key measures. We imposed strict quality control during data collection to avoid missing data and there were no missing data at the item level of the CREDI and Bayley-III, and less than 2% missing values at the item level of the ASQ-3. Second, there is less than 0.05% of child and family characteristics data missing in our sample, thus, we excluded this small proportion of missing sample in all analyses. Due to the small size of missing data and negligible effects on analysis, we excluded them from the analysis.

Ethics
All study protocols were approved by institutional review boards (IRBs), both at Stanford University (No. 46564) and the West China School of Medicine, Sichuan University, China (No. K2018074). Caregivers provided written consent for their own participation and the participation of their children after a field worker read the consent form out loud and answered any questions. All study staff were trained and monitored in IRB-approved procedures for identifying participant needs.

The Bayley scales of infant and toddler development, third edition (Bayley-III)
As one of the most widespread scales used to measure developmental status of infants and toddlers aged between 0 and 42 months, the Bayley-III is considered the gold standard in the field. The Bayley-III includes 326 items divided into five domains: cognition, receptive communication, expressive communication, fine motor, and gross motor. As each item is administered, the examiner records the child's response and stops when there are five consecutive items wrong. The child gets 1 if he or she met the scoring criteria [3]; or else, the child gets 0. Then the sum of scaled scores for a given composite is calculated for each child in the normative sample from the Unites States. The scaled score with a mean of 10 and a standard deviation of 3 to composite score with a mean of 100 and a standard deviation of 15 equivalent is a linear conversion [4]. The Chinese version we used in the current study has been widely used in many researches on the early childhood development in China [23,[48][49][50]. The Chinese version is properly translated and back-translated by professional team.

The ages and stages questionnaire, third edition (ASQ-3)
The ASQ-3 is another widely used instrument to measure the developmental status of infants and toddlers aged between 1 and 66 months. The ASQ-3 includes 21 questionnaires, and each questionnaire consists of 30 simple, straightforward questions about five domains of childhood development: problem-solving, communication, gross motor, fine motor, and personal-social. The answer to each question is selected from three possible responses: "yes," "sometimes," or "not yet." Caregivers should select "yes" if the child shows a specific behavior, "sometimes" if the specific behavior is occasional or new, or "not yet" if the question refers to a behavior the child has not yet shown. The total score of each domain is determined by calculating the score of six questions in each domain. By comparing the total score of the five domains with the threshold value (It equals the mean value minus 2 standard deviations of each domain) of the corresponding domain obtained by empirical research, the development status of the child can be determined. The Chinese version we used in the study is validated by Bian et al. [5] and Wei et al. [45,46].

Caregiver reported early childhood development instruments (CREDI)
Newly developed scales used to measure ECD status at the population level, the CREDI, 3 aim to provide an accurate and easy-to-administer assessment of ECD for children between 0 and 35 months that functions across a wide variety of cultural, linguistic, and socioeconomic contexts [1,25]. CREDI is directly tested by the child's primary caregiver using a scale that is answered "yes" or "no" (If caregivers are unsure of their response, they may also choose to respond by saying "don't know"). The CREDI team also set up the credi package in the software program R to guide users scoring the CREDI long form [26].
As part of a larger project, both a short and a long form of the CREDI were developed from the same broad item set. The long form produces a score for each of the domains, namely cognitive, language, motor, and socialemotional developmental status. The goal of the long form is to provide detailed information to researchers interested in measuring specific developmental domains. The long form consists of a total of 108 items, and the starting point is determined according to the child's age, and the ending point is determined according to a fivelink error/uncertainty factor. In contrast, the short form can produce a total score for the child's overall developmental status, which contains 20 items selected to characterize children's development within predefined six-month age bands. For research and evaluation projects, the long form will provide domain-specific details about the child development, which can capture differences in the specific skills that help the design of the intervention targeted to improve child development. In the current study, therefore, the long form was used and evaluated.
In sum, all of the tests can capture the children's developmental status for each domain. The age range covered in each test is different. The age period of the CREDI is the shortest, from 0 to 35 months; the age period of the Bayley-III is moderate, from 0 to 42 months; the age period of the ASQ-3 is the longest, from 1 to 66 months. The CREDI can cover younger children, while the ASQ-3 can cover older children compared to the CREDI and Bayley. In terms of administration, compared to the Bayley-III, both the CREDI and the ASQ-3 are shorter in duration and easier to administer. Actually, during our survey, the CREDI was very simple and clear enough to be answered by a caregiver with minimal formal education. The cost of administration was also lower than the Bayley-III.

Statistical analysis strategy
First, the descriptive characteristics of the sample were displayed to show the basic information about the corresponding sample and the ECD status measured by different instruments. Second, reliability was assessed with the items internal consistency and Cronbach's α coefficients were used to interpret the internal consistency of these three questionnaires. Next, internally standardized correlations among these measures were calculated to test concurrent validity. At the same time, to obtain the heterogeneous analysis results, the samples were divided into age cohorts of 5-11 months, 12-17 months, 18-23 months, 24-29 months, and 30-35 months when calculating the correlations. The samples were also divided according to the type of caregiver and the household wealth status, and the correlations calculated. Finally, an ordinary least squares (OLS) regression was conducted to check the relationship between a set of variables shown to be related to child development and the scores on these three instruments. This analysis tested whether the three measures have consistent predictive factors or not. It is one way to determine the similarity of the three measures in terms of how they identify developmental status of the children. All statistical analyses were performed using Stata 14.2 statistical software.

Results
First, the descriptive characteristics of the sample are displayed in Table 1. As shown in Table 1, 946 toddlers were included in our final analysis. Among these toddlers, only 248 were administered the Bayley-III. The distribution of child and family characteristics between the total sample and Bayley sample were mostly consistent. Generally, there were a slightly higher proportion of female toddlers; slightly over half had siblings; around 5% of the sample were born prematurely; the mother was identified as the primary caregiver for about 70% of the toddlers; the educational attainment of mothers was low overallaround half had junior high school and below. The household wealth status was moderate among all samples, and the wealth status of the Bayley sample was relatively better than the total sample. Second, the ECD results are shown in Table 2. The mean scores (SD) from the CREDI indicated that the overall developmental status of our sample was only moderate. In terms of the Bayley-III scores, the Bayley-III has not yet been administered to a healthy reference population in China. As such, we rely on reference populations from other widely accepted research, which reveals that, for a healthy population, the mean score (SD) is expected to be 105 (9.6) for the cognitive scale [20,33], 109 (12.3) for the language score [33], and 107 (14) for the motor score [7,20]. According to the above standards, the developmental status of our sample was slightly below average. With respect to the ASQ-3, the mean scores of each domain were a little lower than the referenced mean scores shown in the ASQ-3 user guide. In sum, the results obtained from the three tests were generally consistent with slightly better results from the CREDI.
Third, the internal consistency of the CREDI, Bayley-III, and ASQ-3 are shown in Table 3. Both the CREDI and Bayley-III have large Cronbach's α coefficients, which means the internal consistency of the two scales was high. For the CREDI, the Cronbach's α coefficients of each subscale ranged from .92 to .97. When the internal consistency was examined by age group, it was found that the Cronbach's α coefficients of each subscale decreased accordingly, but remained relatively high. For age 6-11 months, the Cronbach's α coefficients of each subscale ranged from .81 to .87; for age 12-17 months, it ranged from .83 to .91; for age 18-23 months, it ranged from .74 to .93; for age 24-29 months, it ranged from .66 to .91; for age 30-35 months, it ranged from .60 to .89. Overall, the Cronbach's α coefficients of the cognitive, motor, and social-emotional subscales decreased with age, but increased before 12 months. For the language subscale, the Cronbach's α coefficients decreased with age, but increased before 24 months. Besides, it should be noted that the CREDI has unacceptably low internal consistency reliability (Cronbach's α coefficients are below .7) in some places, such as the motor and social-emotional subscale within age 24-35 months.
For the Bayley-III, the Cronbach's α coefficients of each subscale ranged from .97 to .98. The Cronbach's α coefficients of each subscale decreased after the sample was divided by age group. Despite this, the internal consistency reliability indicated by the Cronbach coefficients was still very high. For the subscales of cognitive and fine motor skills, the internal consistency reliability increased with age between 6 and 23 months, while the internal consistency reliability decreased with age after 23 months. For the subscales of receptive communication and expressive communication, the internal consistency reliability increased with age between 6 and 17 months, while it decreased with age after 17 months. For the subscale of gross motor, the internal consistency reliability decreased with age across the five age groups.
In contrast, the ASQ-3 had a relatively lower scale internal consistency reliability. The Cronbach's α coefficients of each subscale ranged from .41 to .70. Among the five subscales, the internal consistency reliability of the gross motor subscale was the highest and its Cronbach coefficient was .70; the internal consistency reliability of the personal-social subscale was the lowest and its Cronbach coefficient was .41. When the sample was divided by age group, the Cronbach's α coefficients varied irregularly within different age groups.
Subsequently, the correlations between the CREDI and Bayley-III scores, the ASQ-3 and Bayley-III scores, and the CREDI and the ASQ-3 scores for each of their domains by age group were calculated respectively. Pvalues of the correlations were calculated by bootstrapping methods, with 1000 replications. As shown in Table 4, the results indicated that the concurrent validity of the CREDI with the Bayley-III scale was high in general. That is, CREDI cognitive, language, and motor subscales had strong correlations with the corresponding Bayley-III subscales. The correlation coefficients ranged Note: In China, junior high school refers to chu zhong (初中; 初级中学; literally low-level middle school) from grades 7 to 9; senior high school refers to gao zhong (高中; 高级中学; literally high-level middle school) from grades 10 to 12. The household wealth factor was constructed to indicate the wealth status of the sample. The factor was based on the following nine variables: annual agricultural and non-agricultural income in 2017, and seven dummy variables of being registered as a poor household, having a flush toilet, water heater, PC, internet access, air conditioning, and a truck/car at home. It was generated using the iterated principal-factor method to calculate factor loadings and derive a factor score for each household. Questions about home stimulation activities in our survey asked whether, in the past 3 days, the primary caregiver had engaged their children in any of the following activities: reading books or looking at picture-books, telling stories, singing songs, taking child outside home for playing, playing with the child with toys, spending time with the child in naming things, counting, or drawing. Referred to the clarification about this variable from Multiple Indicator Survey by UNICEF [10], it is treated as a binary variable here Note: The CREDI Z-scores were constructed by comparing the raw score in each domain to the average raw score in our CREDI reference population of a particular age. A z-score of 0 thus means that the child has exactly the same score on that particular domain as the average same-age child in the CREDI reference population. A score of "-1" means that the child's raw score is 1 standard deviation below the same-age average of the reference population (more details about the reference population can be found in the CREDI data management & scoring manual). The range score for each domain of Bayley-III composite scores is 55 to 145 for Cognitive, 47 to 153 for Language, and 46 to 154 for Motor, respectively. The raw score range of the ASQ-3 for each domain is from 0 to 60 from .84 to .90, among which the correlation between the CREDI motor subscale and the Bayley-III gross motor subscale was the largest. In contrast, although the correlation coefficients between the ASQ-3 communication subscale and the Bayley-III expressive communication and receptive communication subscales, and the ASQ-3 gross motor subscale and the Bayley-III gross motor subscale were significant at moderate levels, the concurrent validity of the ASQ-3 with the Bayley-III scale was relatively lower in general. With respect to the concurrent validity of the ASQ-3 with the CREDI, the results showed that only the correlation coefficients between the ASQ-3 communication subscale and the CREDI language subscale, as well as the ASQ-3 gross motor subscale and the CREDI motor subscale were significant at moderate levels, and the correlations in other domains were extremely weak. The heterogeneous analysis of the CREDI was also conducted, as shown in Tables 5, 6 and 7. The correlations were calculated among the Bayley-III, the CREDI, and the ASQ-3 by age group, primary caregiver, and wealth status. Table 5 shows that the correlations between the CREDI and Bayley-III varied with different age groups. In general, the correlation between the CREDI and the Bayley-III was strong before 18 months but was relatively weak at 18 to 23 months, and was moderate after 24 months. When the correlations between the ASQ-3 and Bayley-III by age group were examined, it was found that, generally, the correlation in the domain of communication between ASQ-3 and Bayley-III was better within 12-29 months than other age periods. With respect to the correlations between the CREDI and the ASQ-3 by age group, it was found that within 5-11 months, only the correlation between the CREDI language subscale and the ASQ-3 communication was significant and at a moderate level. After 12 months, the correlations between each domain of the CREDI and the ASQ-3 were significant and moderate.
When the correlations among the three tests were examined by caregiver type and household wealth status, it was found that, in general, regardless of whether the primary caregiver was the mother or the grandmother, or whether the household wealth status was poor or rich, the correlations between the CREDI and Bayley-III were large and statistically significant. The correlations between ASQ-3 and Bayley III were only significant and moderate in the domains of communication and gross motor, and the correlations between the CREDI and ASQ-3 were significant but relatively small. This is shown in Tables 6 and 7. To complete the analysis, the OLS regression results were reviewed to check whether the three instruments have consistent predictors. All the scores received from the Bayley-III, CREDI, and ASQ-3 were internally standardized before OLS regression.
As shown in Appendix Table, children from homes with higher stimulation obtained higher Bayley cognitive scores; the older the children, the higher the Bayley cognitive scores. When the same factors were used to predict children's CREDI cognitive scores, some consistencies with the Bayley-III results were evident. That is, the higher the home stimulation, the higher the CREDI cognitive scores; and the CREDI cognitive scores increased with the child's age. However, different from the Bayley cognitive, household wealth status was positively related to the CREDI cognitive with a very small effect (indicated by the small coefficient). The child's gender was negatively related to the CREDI cognitive, that is, girls had higher CREDI scores than boys. Additionally, when the same factors were used to predict children's ASQ-3 scores, in a similar way to the CREDI and Bayley III, home stimulation was positively related to ASQ-3 "Problem Solving". Consistent with the CREDI while inconsistent with the Bayley-III, household wealth status was positively related to the ASQ-3. Different from the other two tests, type of primary caregiver was significantly related to the ASQ-3. When the primary caregiver was the mother, the ASQ-3 "Problem Solving" score was higher. When the same factors were used to predict children's language scores, it was found that children with higher home stimulation obtained higher Bayley "Receptive Communication" and "Expressive Communication" scores; the older the children, the higher the Bayley scores; girls' Bayley scores were higher than boys' s. When the same factors were used to predict children's CREDI language scores, the results were consistent with the Bayley-III. However, different from the Bayley-III, household wealth status was positively related to CREDI language scores, while the correlation between household wealth status and Bayley III "Receptive communication" and "Expressive communication" was insignificant. When the same factors were used to predict children's ASQ-3 communication scores, the results were a little different. Just as with the CREDI and Bayley-III, home stimulation was positively related to ASQ-3 "Communication", and girls obtained higher scores than boys. Consistent with the CREDI while inconsistent with the Bayley III, household wealth status was positively related to ASQ-3. Different from the other two tests, the relationship between age and the ASQ-3 communication score varied with age group. Compared to age 5-11 months, only children aged 18 months and above were higher in ASQ-3 "Communication".
When the same factors were used to predict children's motor scores, the results were both consistent as well as inconsistent among the three instruments. Specifically, motor development measured by the three instruments was positively related to children's age. With respect to the predictor "home stimulation", the correlation between home stimulation and Bayley motor was insignificant, but home stimulation was positively related to CREDI motor and ASQ motor. The child's gender was significantly related to CREDI motor rather than Bayley motor and ASQ-3 motor. Whether the child was premature or not was significantly related to the Bayley fine motor results, rather than the CREDI motor and ASQ-3 motor.
In terms of predicting the development of children's social emotional data, only the ASQ-3 "Personal-Social" and CREDI "Social-Emotional" were assessed because of the lack of a Bayley-III "Social-Emotional" category in our study. The results showed that both the ASQ-3 and CREDI scores were positively related to home stimulation, and girls obtained higher social-emotional scores than boys.
Above all, there was high consistency in predicting ECD status among the three tests in some key  Since raw scores are increasing in age, we eliminated the age effect by internally standardizing raw scores within age (month) groups. This is done by computing age-adjusted z-scores using age-conditional means and standard deviations estimated by nonparametric regression. Compared to parametric procedures, the advantage of this non-parametric standardization method is less sensitive to outliers and small sample size within age category and yields normally distributed standardized scores with mean zero across the age range (in months) ([2] [22];). Standard Errors (SE) computed using bootstrap method. Bootstrapping allows estimation of the sampling distribution of almost any statistic using random sampling methods. Bootstrap is asymptotically more accurate than the standard intervals obtained using sample variance and assumptions of normality. RC Receptive Communication, EC Expressive Communication  predictors, such as home stimulation. That is, higher Bayley scores (except Bayley motor scores), CREDI scores, and ASQ-3 scores were positively related to home stimulation. There were also some consistent results in some individual and family characteristics. For example, the results showed the scores of the three tests indicating language and social emotional development level were higher for girls than boys. Nevertheless, there were inconsistent results in predicting ECD status in some individual and family characteristics. For example, children's gender was not significantly related to Bayley-III cognitive and motor scores, but was closely related to CREDI scores. Household wealth status was not significantly related to Bayley-III scores, but was positively related to CREDI and ASQ-3 scores indicating cognitive and language development level. With respect to caregiver type, the results of Bayley-III and CREDI scores suggested there was no significant association. However, the results for ASQ-3 scores indicated the caregiver type was connected to the ASQ-3 "Problem-Solving" scores. In general, the results showed relatively consistent predictors of scores about the level of ECD through different measurements. It can be concluded that as a caregiver-reported, population-level measurement for children's development, the CREDI is highly consistent with previous widely used instruments in some key predictors (such as home stimulation) concerning the ECD level. Moreover, the CREDI is highly consistent with indirect assessment, namely the ASQ-3, in some individual and family characteristics (such as the children's gender and household wealth status).

Discussion
From the above information, it can be concluded that the administration time, difficulty, and cost of the CREDI is more advantageous than the Bayley-III, and the internal consistency reliability and validity of the CREDI is also more advantageous than another indirect measurement, that is, the ASQ-3.
First, according to the results shown above, the Bayley scales have very good internal consistency, whereas the ASQ-3 has unacceptably poor internal consistency reliability. In contrast, the Cronbach's α coefficients of each CREDI subscale were large, despite declining when the sample was divided by age group, which indicates the internal consistency reliability of the CREDI was still good in general. However, it should be noted the CREDI has unacceptably low internal consistency reliability in the motor and social-emotional subscale within age 24-35 months. Second, concurrent validity analysis conducted using the Bayley-III as the criterion indicated generally high concurrent validity of the CREDI. In contrast, the concurrent validity of the ASQ-3 with the Bayley-III scale was low, and the concurrent validity of the two indirect assessments, the CREDI and the ASQ-3, was also low. Third, heterogeneous analysis generally showed that the correlation between the CREDI and the Bayley-III was strong before 18 months but relatively weak at 18-23 months, and was moderate after 24 months. In contrast, the correlation in the domain of communication between ASQ-3 and Bayley-III was better within 12-29 months than other age periods. In terms of the correlation between the CREDI and ASQ-3 by age group, within 5-11 months, only the correlation between the CREDI language subscale and ASQ-3 communication was significant and at a moderate level. After 12 months, the correlations between each domain of the CREDI and the ASQ-3 were significant and moderate. In addition, the heterogeneous analysis showed that there are no big differences in the correlations between the CREDI and Bayley-III by caregiver types and household wealth statuses. Finally, OLS analysis showed that, the CREDI was highly consistent with previous widely-used instruments in some key predictors (such as home stimulation) of ECD level. Furthermore, the CREDI was also highly consistent with the indirect assessment, namely the ASQ-3, in some individual and family characteristics (such as the children's gender and household wealth status). Compared with previous studies, the current study examined the reliability and validity of the CREDI long form in China, which hasn't been assessed before. Moreover, the coverage under 3 years of age is extensive and each age group are included except for 0-5 months. Consistent with previous research, the results in the current study suggested that the CREDI can be used as a useful tool to monitor ECD status in impoverished regions of China at large scale. Multivariate regression results are consistent with previous study that emphasizes the importance of home stimulation activities and family economic status. At present, the development of childcare services under 3 years old in China is lagging Table 6 Correlations among CREDI, Bayley-III, and the ASQ-3 by caregiver Bayley Table 4 footer behind. Systematic and effective childcare policies and services have not yet been formed, and the absence of supportive systems and the shortage of social services are prominent [47]. Especially after the implementation of the universal two-child policy, the establishment of the childcare policy system has attracted unprecedented attention. The information about the ECD status at the population level is the basis for the Chinese government in the implementation of childcare policies and services and the development of more effective intervention strategies. The current study makes implications for the use of the CREDI long form to monitor the ECD outcomes in impoverished regions of China. Besides, this study also indicates that the public services and support provided by the society cannot completely replace the function of the family, and the improvement of family members' parenting practices (reflected by the home stimulation activities) is conducive to effectively improving the early development of children in poor rural areas. Despite its merits, the current study has several limitations. First, there are some limitations on using the motor and social-emotional subscale with children aged 24-35 months. The possible reason needs to be explored further. Second, there was an issue with collection of concurrent gold-standard measures of child development with which to determine the CREDI's concurrent validity. Concurrent validity with direct observation, that is, using the Bayley-III, was tested only for just over two hundred children. Because of the small sample, which lacked corresponding representation, the conclusion of the current study cannot be generalized to the whole Shaanxi Province or China. The focus on a single geographic context for the sample also limits the generalizability of these results. Besides, the measure invariance is not assessed at this stage. Future studies should include samples from geographically, linguistically, developmentally, and culturally diverse contexts of China. Third, there was a lack of inter-rater reliability for the study coordinators who administered the CREDI. Although the two local study coordinators who administered the items of the screening tool by verbal interview were fluent in Mandarin, possible communication issues For the description of asterisks, please refer to Table 4 footer or varying levels of comprehension should be considered, particularly given the wide range in education backgrounds of the caregivers. Lacking of the test-retest reliability was also a limitation of the study, which should be done in our future studies. At last, our lack of a "gold standard" metric against which to compare our social-emotional items limits our understanding of their concurrent validity in the current study. Children aged 0-5 months old were not included in the study, which makes it impossible to verify and analyze the reliability and validity of the scale for children aged 0-5 months.

Conclusion
Providing high-quality ECD services in low-and middle-income countries would require joint efforts from all sectors, effective management, adequate funding, an ample workforce, community and parental collaboration, reliable data systems, continuous monitoring, and evaluation and improvement cycles [28]. It can be concluded from the current study that the CREDI is a feasible low-cost instrument for use in large-scale data collection for early developmental intervention. In China, due to the long-term urbanrural dual economic system and economic development gap, it has been found that there are significant urbanrural differences in early childhood development [18]. As a feasible population-level measurement of ECD, the use of the CREDI long form in China is beneficial to improve ECD outcomes and reduce developmental inequality through national, and regional policies and resource allocation. However, it should be noted the CREDI long Form still lacks the ability to provide information about individual children. It may also not be sensitive enough to detect smaller changes attributable to intervention. The CREDI team also pointed out that in spite of the value of the CREDI long form in intervention evaluation, a more detailed and domainfocused measure should be paired with whenever possible (Please refer to the CREDI User Guide). Besides, given the CREDI is a caregiver-reported scale, using a direct assessment (such as Bayley) as the triangulation of measurement is useful to address potential weakness in one approach versus another. Therefore, there is much more work that needs to be done in the future so that the instrument can be effectively used for population level monitoring and research purposes. The use of the CREDI long form to assess the interventions effects in China also should be evaluated in the future studies.