The Early Development Instrument: an evaluation of its five domains using Rasch analysis

Background Early childhood development is a multifaceted construct encompassing physical, social, emotional and intellectual competencies. The Early Development Instrument (EDI) is a population-level measure of five domains of early childhood development on which extensive psychometric testing has been conducted using traditional methods. This study builds on previous psychometric analysis by providing the first large-scale Rasch analysis of the EDI. The aim of the study was to perform a definitive analysis of the psychometric properties of the EDI domains within the Rasch paradigm. Methods Data from a large EDI study conducted in a major Irish urban centre were used for the analysis. The unidimensional Rasch model was used to examine whether the EDI scales met the measurement requirement of invariance, allowing responses to be summated across items. Differential item functioning for gender was also analysed. Results Data were available for 1344 children. All scales apart from the Physical Health and Well-Being scale reliably discriminated between children of different levels of ability. However, all the scales also had some misfitting items and problems with measuring higher levels of ability. Differential item functioning for gender was particularly evident in the emotional maturity scale with almost one-third of items (9 out of 30) on this scale biased in favour of girls. Conclusion The study points to a number of areas where the EDI could be improved.


Background
Early childhood development is a key indicator of future health and well-being [1]. It is a multifaceted construct encompassing physical, social, emotional and intellectual competencies. In the early years, child development is synonymous with child health, which can be defined as the extent to which children realise their full developmental potential [2].
From a population health perspective early childhood development is both an indicator of child health outcomes and a predictor of future health problems [3]. When compared to adult health it is also very susceptible to environmental influences. It is a dynamic process which changes rapidly over time, particularly between gestation and six years of age. As a result, measurement of early childhood development has to be age-specific and multi-dimensional [4].
The majority of measures of early childhood development have been designed by psychologists or educationalists and are clinically-based diagnostic tools, with the intention of determining whether an individual child has a disability or underlying condition [5]. A potentially greater burden of risk lies with the substantially larger number of children with less pronounced developmental delay [6]. In this context, a population-level approach which can measure the developmental health of children across the spectrum is required.
The Early Development Instrument (EDI) is a population-level measure designed at the Offord Centre for Child Studies, McMaster University, Hamilton, Ontario to measure the extent to which children have attained the physical, social, emotional and cognitive maturity necessary to engage in school activities [7]. The EDI is a community or population level measure, not an individual screening or diagnostic tool. The EDI follows a population model for health improvement: small modifications of risk for large numbers are more effective at producing change than large modifications for small numbers [8]. It can be retrospective, focusing on early childhood development outcomes; or predictive, informing school and child-health programmes [7]. It is based on a broad conceptualisation of school readiness which goes beyond language and cognitive ability to include the extent to which the child has gained the developmental maturity (physically, socially and emotionally, as well as cognitively) to engage in and benefit from school activities [9]. Children who score in the lowest 10 % of the study population in one or more of the five domains of the EDI are classed as 'vulnerable'. The 10 % cut-off has been recommended because it is usually higher than clinical cut-off points and should therefore include children who may be more difficult to diagnose [10].
The EDI is an internationally recognised measure of early childhood development at school entry age [11]. It has been used in 24 countries worldwide. In Australia, where it was administered as the Australian Early Development Index (AEDI) until 2014 when it became the Australian Early Development Census (AEDC), total population coverage has been achieved. Near-total population coverage has been reached in Canada. Its utility in informing regional and national policy on early childhood care and education and in tracking changes in child development outcomes over time is well recognised [12].
Extensive psychometric testing has been completed on the EDI in Canada and Australia [7]. It has high internal consistency with Cronbach's alpha coefficients of between 0.84 and 0.96 for the five domains [9]. In the current Cork study the EDI was shown to have similar internal consistency with Cronbach's alpha coefficients of between 0.8 and 0.96 [11]. In Australia, the AEDI was implemented alongside the Longitudinal Study of Australian Children (LSAC) in a subset of the population allowing for correlation with other teacher and parental administered instruments. Results showed strong correlations between the AEDI and other teacher-rated measures. However, correlations with parent-rated measures were weak [13]. Factor analysis was conducted on data from Canada, Australia, Jamaica and Washington State with items loading on to the correct factors across all countries [14]. In a further study of 26,005 children in British Columbia, confirmatory factor analysis was used to demonstrate the unidimensionality of each domain [15]. In examining the predictive validity of the EDI to fourth grade, D' Anguilli et al. [16] found that children who were vulnerable (i.e. in the lowest 10 % of the population in one or more domains of the EDI) in the first year of education were two to four times more likely to score below expectations in Grade 4. There was a linear increase in the risk of scoring below expectations with vulnerability in additional domains. Two studies examined the performance of the EDI across diverse populations and concluded that the EDI was fair and unbiased across gender, language and aboriginal status [6,17].
There is also some evidence questioning the validity of the EDI. Although correlations between the EDI language and cognitive development domains and the Peabody Picture Vocabulary Test (PPVT) showed similar levels of correlation across four countries, the results showed that low scores in the this domain did not indicate a high probability that a child would have a language problem [14]. A further study, conducted in Canada, comparing the EDI with four directly administered tests of school readiness found significant correlations at the level of the overall instrument but not at the domain level [18].
All the psychometric tests outlined above were conducted using traditional psychometric methods based upon Classical Test Theory (CTT). Only two studies have been conducted using more modern psychometric techniques. In 2004 a Rasch analysis of the EDI was conducted prior to its adaptation for use in Australia as the AEDI. That analysis showed the EDI had generally adequate scale properties within the Rasch paradigm but had disordered thresholds on all items with five response options [19]. The EDI was subsequently adjusted to include only two and three item responsesthis was the version used in the Irish study. A subsequent Rasch analysis of the new scales was conducted in a small sample of 116 children in Sweden [20]. This study took the approach of removing misfitting items, after which, all scales except physical health and well-being functioned well. However, the study had too low a sample size to perform a definitive analysis and should be considered an exploratory study [21].
This study builds on previous psychometric analysis by providing the first large-scale Rasch analysis of the current version of the EDI. Data from a large study conducted in a major Irish urban centre were used for the analysis [11].

Methods
A cross-sectional study of child development was carried out with children in their first year of formal education in 42 of the 47 primary schools in Cork City and a further five schools in an adjoining rural area in 2011. The five city schools which declined to take part in the study were representative of a cross-section of schools in the study area -one boys' school, one girls' school, one large mixed, middle income school, one designated disadvantaged school and one Irish-speaking schooland their omission would not have affected the representativeness of the demographic composition of the study.
All eligible children in the participating schools were invited to be included in the study. Eligibility criteria were: being in the latter half of the first year of formal education (i.e. having completed minimum of 4 to 5 months of education), being known by the teacher for more than 1 month and not having left the school.
Strengthening the Reporting of Observational studies in Epidemiology (STROBE) guidelines were adhered to in developing the study and a STROBE checklist compiled.

Data collection
The EDI is a teacher-completed questionnaire based on five months' observation of the children from the date when they start school. In the current study it was administered in the latter half of the first year of formal education. The teachers in this study were given a short period of training on the administration of the EDI and were each issued with an EDI guide book. Children were not present when the questionnaire was completed and no individual identifiers were recorded. Each child was assigned a unique identifier which was used on the questionnaire.

Ethical considerations
Passive consent was used in line with previous EDI studies in Canada. A total of seven parents opted not to participate. Ethical approval was granted by the Clinical Research Ethics Committee of the Cork Teaching Hospitals by whom the opt out consent mechanism was reviewed and approved. The physical health and well-being scale has 13 items. Seven items have two response options, scored 0 and 1, and six items have three response options, scored 0, 1 and 2. The social competence scale has 26 items, the emotional maturity scale has 30 items and the communication and general knowledge scale has 8 items. All items on these three scales have three response options, scored 0, 1 and 2. The language and cognitive development scale has 26 items all of which have two response options, scored 0 and 1. Lower scores on all items for all scales represent lower levels of the latent trait being measured.

Analysis
The Rasch model The Rasch model takes its name from the Danish mathematician Georg Rasch and refers to a group of statistical techniques used as a mathematical approach to assessing measurement scales [22]. The model assumes that the probability of a person responding in a certain way to an item on a psychometric scale is a logistic function of the difference between that person's ability and the individual item's difficulty [23].
Rasch theory is based on the assumption that some items are harder and require more of the underlying trait than others and that some people have more of the latent trait than others, thereby, having a greater probability of responding positively to the more difficult items. Furthermore, items conform to a Guttman structure whereby they are ordered in terms of difficulty on a continuum. In other words, if a child has a certain level of developmental ability it is assumed that they ought to score positively for all items which require less difficulty than they possess [24].
A key underlying demand of the Rasch model is invariance [25] This means that the relative location of any two persons on the scale is independent of the items used and conversely the relative location of any two items on the continuum is independent of the person on which they are measured. The item and person locations are estimated separately but on the same scale. The separation of items and persons is a key advantage of Rasch modelling over CTT as it allows for generalisation across samples and items. Rasch modelling also provides a range of unique tools for testing the extent to which items and persons produce data that fit the Rasch model [25].
The EDI was not designed for use at the individual level but is used to detect change at the level of the school or the community. However, regardless of the purpose to which a tool is put it has to adhere to scientific measurement properties. The EDI can therefore benefit from Rasch analysis in that the extent to which each of the five scales meet the basic measurement properties outlined above can be examined. In particular, invariance, consistency of the interval levels and the hierarchy of competencies can be determined.

Data analysis
The data were analysed with the unidimensional Rasch model using RUMM2030 software [26]. The Rasch model was used to examine whether the EDI scales met the measurement requirements of invariance, allowing responses to be summated across items. In order to allow different numbers of categories and different threshold values across items the unconstrained (partial credit) Rasch model was applied.
Three aspects of the EDI were analysed: scale to sample targeting; overall scale fit to the Rasch model; and the extent to which individual items satisfied Rasch criteria.

Scale to sample targeting
Person-item threshold distributions were examined to explore the relationship between the difficulty level of the items in each scale and the ability levels of those taking the test. These histograms, using the convention of Rasch analysis, are always centred at zero logits for the item location scale. Perfect targeting requires the item and person location means to both be zero.

Overall scale fit to the Rasch model
A number of tests were used to examine the extent to which each scale conformed to the Rasch model. Standardised mean and standard deviation (SD) values for item and person fit residuals are a way of representing the fit of both item and person data to the Rasch model. A mean value of zero with a SD of 1.0 would represent perfect fit (values less than 1.4 are considered acceptable for the SD). A further test examines the extent to which the hierarchical order of difficulty for items varies across class intervals of the measurement continuum. This is examined using a Chi-square statistic. A statistically significant Chi-square value (having performed a Bonferroni adjustment at the 0.05 probability level) indicates a problematic interaction between items and the latent trait being measured. A final test, known as the Person Separation Index (PSI) examines the extent to which the scale reliably discriminates between persons of different ability. The PSI can be produced with or without extreme values so that the extent of floor and ceiling effects on reliability can be examined. For scales which are intended to be used at the group level, a minimum PSI value of 0.7 is recommended.

Analysis of individual items
Threshold ordering One of the requirements of the Rasch model is 'category ordering'. This means that the hierarchical order of response options for particular items should accord with the latent variable in question. In other words, persons with higher levels of overall ability on a particular trait should be more likely than persons with lower ability to endorse item response options that are meant to capture higher levels of ability.

Item location
The location indicates the place on the continuum of difficulty where each item is located. Location is measured on the logit scale and lower scores represent lower levels of difficulty. The fit residuals provide an estimate of the extent to which the variance associated with each item is in accord with the Rasch model. The residuals shown are standardised and values between +/−2.5 demonstrate adequate fit. A test of itemtrait interaction is also available. As with the test of overall scale fit, the Chi Square test is used to analyse whether items perform consistently across the continuum of difficulty. The test is Bonferroni adjusted at the 0.05 level and statistically significant values indicate problematic item-trait interaction.
Local response dependency The Rasch model demands that responses to items on the same scale must be independent, that is, not conditional upon each other. For example, an item about spelling ability would be dependent on an item measuring ability to read implying that one of the items is redundant. Response dependency can be detected by examining the residual correlation between items after extraction of the Rasch model. Inter-item correlations greater than 0.4 are a strong signal for local response dependency.
Differential item functioning One of the advantages of Rasch modelling is the possibility of detecting Differential Item Functioning (DIF). DIF occurs when different groups respond differently to an item despite having the same levels of the overall trait being measured. For example, if boys were to consistently score higher than girls on a particular item in an intelligence test, despite there being no gender differences in overall intelligence as measured by the scale, then DIF would be present in that item.
Every item was examined for DIF between male and female children in the sample. DIF was explored in RUMM through an analysis of variance (ANOVA) of the standardized response residuals for each item between genders. A Bonferroni adjusted p-value was then used to determine statistical significance. Item characteristic curves were examined to determine the direction of bias introduced in items where significant DIF was detected.

Descriptive statistics
Data were available for 1344 children. Descriptive statistics for each scale are shown in Table 1. The mean and standard deviation (SD) for each scale is only provided for subjects with complete data on each scale (i.e. there has been no imputation). There was a strong positive skew on all five scales. There was also a marked ceiling effect on some scales with large numbers of children achieving the maximum possible score. This was most apparent for the communication skills and general knowledge scale where 34 % of children with complete items achieved the maximum score. The ceiling effect was least apparent for the emotional maturity scale (6 % of children with complete items achieved the maximum score).

Scale to sample targeting
For some scales the person-item histograms demonstrate a poor match between the difficulty levels of the items and the ability levels of those taking the test. In Fig. 1, the mean person location is 2.7 (SD = 1.5) for the physical health and well-being scale. The difficulty range for item locations (−1.63 to 1.23) is inconsistent with the ability range observed in the sample (−1.78 to 4.39). This implies that there is higher ability in the sample than the difficulty levels measured by the items on the physical health and well-being scale and suggests that additional items at the higher levels of difficulty are required.
The social competence scale also demonstrate a mismatch between persons and items. The mean person location on the logit scale is 2.7 (SD = 2.0) and the difficulty range for item locations (−1.50 to 1.26) is inconsistent with the ability range observed in the sample (−3.72 to 5.47). This suggests a need for additional items at both the lower and higher ranges of difficulty.
In Fig. 2, the emotional maturity scale demonstrates a better match between sample and items. The highest levels of ability are still not addressed by the item set but this covers a smaller group of children. The mean person location is 1.6 on the logit scale (SD = 1.5) and the difficulty range for item locations (−1.27 to 1.99) is a better match with the ability range observed in the sample (−2.52 to 5.27).
Items on the language and cognitive development scale cover a very wide range of difficulty. The mean person location on the logit scale is 3.3 (SD = 2.1) and the difficulty range for item locations (−3.86 to 4.86) is a good match with the ability range observed in the sample (−4.99 to 5.86) but is still not enough to cover the highest levels of ability in the sample.
There is a poor match between persons and items on the communication and general knowledge scale. The mean person location on the logit scale is 1.9 (SD = 2.5) and the difficulty range for item locations (−1.11 to 1.03) is a poor match with the ability range observed in the sample (−4.46 to 4.39).
Overall fit to the Rasch model Table 2 displays summary Rasch model statistics for the five scales. These give an overall analysis of the extent to which the EDI successfully measures the sample according to the Rasch model paradigm.
All five EDI scales demonstrate problematic fit to the Rasch model. For all scales, item residual standard deviations are larger than 1.4. and there is evidence of statistically significant item-trait interaction in all scales, signalling some room for improvement in the content of each scale. On the other hand, all scales apart from physical health and well-being demonstrate an ability to reliably discriminate between persons of different ability as measured by the PSI.
In a separate analysis it is possible to identify the number of persons within the sample who fit the Rasch model. This gives a sense of the extent to which each scale has adequately measured the sample. The physical health and well-being scale performed very poorly on this metric with 452 persons (33.6 %) providing extreme standardised person-fit residuals (defined as outside the +/−2.

Local response dependency
Only one instance of local response dependency was observed for the physical health and well-being scale, between item 8 ('proficiency with pen') and item 9 ('manipulate objects'). The items are very close conceptually and have an intuitive causal relationship.
Four instances of local response dependency were observed for the social competence scale. These were items 1 and 2 ('overall social/emotional development and 'get along with peers'), items 3 and 4 ('plays and works with others' and 'plays with various children'), items 9 and 10 ('respect for adults' and 'respect for children') and items 14 and 15 ('completes work on time' and 'works independently').
Twenty-three item-pairs demonstrated local response dependency on the emotional maturity scale which suggests a problem with many item relationships. The pairs were There was only one instance of local response dependency in the language and cognitive development scale. This was between item 2 ('interested in books') and item 3 ('interested in reading'). The items are very close conceptually and have an intuitive causal relationship.  There were no instances of local response dependency on the communication skills and general knowledge scale.

Differential item functioning
DIF for gender is evident for two items on the physical health and well-being scale. Item 3 ('late'; F = 18.03) and item 9 ('manipulates objects'; F = 12.28) displayed significant DIF by gender (Bonferroni adjusted p values <0.001282). Analysis of the item characteristic curves revealed that at equivalent levels of physical health and well-being boys were more likely than expected to be rated positively on item 3 (i.e. to not be late), whereas girls were more likely than expected to be rated positively on item 9 (i.e. to be able to manipulate objects).
DIF for gender on the social competence scale is outlined in Fig. 3. Item 4 ('play with various children'; F = 13.65), item 7 ('self-control; F = 14.17) and item 18 ('curious about world'; F = 16.24) displayed significant DIF by gender (Bonferroni adjusted p values <0.000641). At equivalent levels of social competence boys were more likely than expected to be rated as able to play with various children, girls were more likely than expected to be rated as having self-control, and boys were more likely than expected to be rated as being curious about the world. Most of this item bias favoured girls. At equivalent levels of emotional maturity, girls were more likely than boys to be rated as likely to help someone hurt, comfort a crying child, avoid physical fights, not kick/bite/hit, not be restless, not fidget, be obedient, not be impulsive, and to be able to settle. On two items (likely to pick up objects and likely to not be shy) the direction of bias favoured boys. DIF for gender was evident for only one item on the language and cognitive scale. Item 23 ('recognise 1-10'; F = 13.50) showed significant DIF by gender (Bonferroni adjusted p value <0.000641). At equivalent levels of language and cognitive development boys were more likely than expected to be rated as able to recognise numbers between 1 and 10. No significant DIF by gender was present for any item on the communication skills and general knowledge scale.

Summary of findings in relation to each scale
The findings in relation to each scale can be summarised as follows:

Physical health and well being (13 items)
The scale did not discriminate well between children of differing ability and showed evidence of item-trait interaction. In total 33.6 % of children showed extreme person fit residuals. There was a mismatch between ability and item difficulty with additional items needed at the upper end of the scale. One item showed disordered thresholds. Seven items had extreme fit residuals and seven showed item-trait interaction. One local response dependency between items was observed. Two items displayed DIF by gender with one showing item bias favouring girls and the other favouring boys.

Social competence (26 items)
The social competence scale reliably discriminated between children of different abilities. However, there was evidence of item-trait interaction at the scale level and 17.9 % of children showed extreme fit residuals. There were similar levels of person-item mismatch to the physical health and well-being scale. Fourteen items had extreme fit residuals and ten showed item-trait interaction. Four instances of local response dependency between items were observed.
Three items displayed DIF by gender with two showing item bias favouring boys and one favouring girls.

Emotional maturity (30 items)
The emotional maturity scale reliably discriminated between children of differing abilities and had item   items showed item-trait interaction. One instance of local response dependency between items was observed and one item displayed DIF by gender with the bias favouring boys.

Communication skills and general knowledge (8 items)
The communication skills and general knowledge scale discriminated between children of differing ability, but did show item-trait interaction. The percentage of children with extreme fit residuals was 34.5 %. The ceiling effect, which was apparent across all scales, was most marked for this domain. Six items demonstrated extreme fit residuals and six showed item-trait interaction. There was no instance of local response dependency between items and no DIF by gender.

Discussion
This paper used Rasch analysis to explore the psychometric properties of the five domains of the EDI in a sample of 1344 children in Ireland. The aim of the study was to determine the psychometric properties of the EDI within the Rasch paradigm. Every scale demonstrated some elements which are of concern. However, the Rasch criteria are very demanding and they have to be taken as a whole Pallant and Tennant [22]. No one criterion is disqualifying.
All scales had an inadequate number of items for measuring ability at the higher levels with a marked ceiling effect. Similar patterns were observed in the Australian and Swedish Rasch analysis of the EDI [19,20]. In the Australian study, Andrich and Styles [19] took the view that, as the instrument was developed for the explicit purpose of identifying children at risk (at the lower end of the spectrum), it was not necessary to discriminate between children who were performing above this level. However, the ceiling effects observed in this study create three important problems that persist regardless of the focus of the instrument. First, it has implications for the use of an arbitrary cut-off point of 10 %. If the domain in question has a large ceiling effect it implies that children with high absolute scores may be classified as relatively 'at risk'. In other words, the standard for what constitutes 'at risk' becomes higher and there is the danger that children who would be considered within the normal spectrum of development on other measures are classified as at risk on the EDI. The EDI would eventually become synonymous with overdiagnosis in such a scenario. One way to address this problem is to use Rasch Modelling to set the cut points so that they take account of both the distribution of score and the hierarchy of competencies. Second, the ceiling effect is problematic for studies that aim to use the EDI to compare populations as it will lead to an underestimate of the difference between geographical areas with high and low levels of developmental deprivation. Third, the EDI is used extensively to measure changes over time resulting from early childhood interventions. It is essential, therefore, that the full range of possible improvements at the domain level can be detected. The concept of healthy child development, which underpins the EDI, needs to be fully articulated at all levels of ability. Hobart et al. [27] outline the need for a bottom-up approach to instrument development which would begin with a construct theory onto which items would be mapped using both qualitative and quantitative methods. This approach could serve well as a detailed evaluation of the EDI.
The DIF for gender, which is particularly evident in the emotional maturity scale, also needs attention. For the most part, DIF for gender is not unexpected and can achieve a balance between items that favour girls and boys. However, in this instance, almost one-third (9 out of 30) items on this scale are biased in favour of girls meaning that despite having the same overall levels of emotional maturity as boys, girls score better than expected on these items. Gendered differences in emotional and social expression are evident from an early age [28] and need to be addressed in the context of the measurement of early childhood development.
The nine items which were biased in favour of girls were primarily associated with pro-social behaviours and inattentive behaviours. Girls were more likely than boys (at the same level of emotional maturity) to be rated as likely to help someone hurt, comfort a crying child, avoid physical fights, not kick/bite/hit, not be restless, not fidget, be obedient, not be impulsive, and to be able to settle. This may be indicative that there are certain areas where boys and girls express their emotional immaturity in different ways and that the EDI is picking this up. However, it may reflect gender pre-conceptions among teachers. Further qualitative research is needed to explore this.
The emotional maturity scale requires attention, particularly at the level of the individual items. It is the longest scale consisting of 30 items. In addition to DIF, 23 pairs of local response dependency were observed. Item 5 (comforts a crying child), item 3 (helps someone hurt), item 4 (helps other children) and item 8 (helps sick children) all interact with each other. Moreover, items 3 and 5 showed gender DIF favouring girls. All of these items are indicators of helping behaviour. Another group of items which show a marked degree of response dependency are item 15 (restless), item 16 (distractible), item 17 (fidgets), item 20 (impulsive) and item 22 (can't settle). Again, items 15, 17, 20 and 22 showed DIF favouring girls. These are two instances where the instrument may benefit from qualitative work with teachers and others in the field of education with a view to item reduction.
In order to improve the EDI scales a range of options need to be considered. First, qualitative work to explore how various items are rated would be useful. This would deepen our understanding of issues such as the causes of the high level of DIF displayed by the emotional maturity domain. Second, delete problematic items to determine whether or not the EDI scales can be made to better fit the Rasch model. This quantitative approach should only be performed in conjunction with qualitative research however, as it is just as important to understand the source of misfit as it is to eliminate it. Third, performing qualitative and quantitative research to produce additional items to fill obvious gaps would be particularly useful for the higher levels of ability on all the scales.
The findings highlight the value of Rasch analysis in the psychometric evaluation of rating scales. The EDI had demonstrated sound psychometric properties when evaluated using traditional psychometric tests. However, traditional methods are concerned with total scores on scales. As a result, poorly functioning individual items can remain undetected [25]. This study has allowed a detailed examination of the items which make up the five scales of the EDI.

Limitations
The Rasch analysis outlined above is the first step in a process of refining the EDI for use in the Irish context. It did not involve any adjustment to the instrument. Further qualitative and quantitative research will be required to test the impact of removing or adding items to the scales.
The authors approached the implementation of the EDI in Ireland from a population-health perspective and the need for an instrument which could identify populations or communities of children at risk, thereby informing policy and services supporting early childhood development. In this context it was essential that we examine the psychometric properties of the EDI. We have identified a number of areas of concern but will not make adjustments to the instrument without detailed consultation with specialists in early education and particularly with Professor Janus of the Offord Centre who developed the instrument and who has been involved with its international adaptation. This level of work was beyond the scope of this study.

Conclusion
The study points to a number of problems with the EDI which should be addressed in further research. If the EDI is to be implemented at a national level in Ireland, it would benefit from further refinement which could in turn inform the international implementation of the EDI.
Abbreviations AEDC: Australian early development census; AEDI: Australian early development index; ANOVA: analysis of variance; CSGK: communication skills and general knowledge; CTT: classical test theory; DIF: differential item functioning; EDI: early development instrument; EM: emotional maturity; LCD: language and cognitive development; LSAC: longitudinal study of Australian children; PHWB: physical health and well-being; PPVT: peabody picture vocabulary test; PSI: person separation index; SC: social competence; SD: standard deviation; STROBE: strengthening the reporting of observational studies in epidemiology.

Competing interests
The authors do not have any competing interests financial or otherwise with regard to this manuscript.
Authors' contributions MC is the primary author. She conducted the research, worked on the data analysis and produced the manuscript. JB conducted the RASCH analysis and worked on the preparation of the data tables and the manuscript. AS advised on the research design and methods of analysis. He also contributed to the manuscript. IP was the instigator of the research. He oversaw the process, contributed to the design, methodology and production of the manuscript and data tables. All authors have read and approved the final manuscript.