Comparing very low birth weight versus very low gestation cohort methods for outcome analysis of high risk preterm infants

Background Compared to very low gestational age (<32 weeks, VLGA) cohorts, very low birth weight (<1500 g; VLBW) cohorts are more prone to selection bias toward small-for-gestational age (SGA) infants, which may impact upon the validity of data for benchmarking purposes. Method Data from all VLGA or VLBW infants admitted in the 3 Networks between 2008 and 2011 were used. Two-thirds of each network cohort was randomly selected to develop prediction models for mortality and composite adverse outcome (CAO: mortality or cerebral injuries, chronic lung disease, severe retinopathy or necrotizing enterocolitis) and the remaining for internal validation. Areas under the ROC curves (AUC) of the models were compared. Results VLBW cohort (24,335 infants) had twice more SGA infants (20.4% vs. 9.3%) than the VLGA cohort (29,180 infants) and had a higher rate of CAO (36.5% vs. 32.6%). The two models had equal prediction power for mortality and CAO (AUC 0.83), and similarly for all other cross-cohort validations (AUC 0.81–0.85). Neither model performed well for the extremes of birth weight for gestation (<1500 g and ≥32 weeks, AUC 0.50–0.65; ≥1500 g and <32 weeks, AUC 0.60–0.62). Conclusion There was no difference in prediction power for adverse outcome between cohorting VLGA or VLBW despite substantial bias in SGA population. Either cohorting practises are suitable for international benchmarking. Electronic supplementary material The online version of this article (doi:10.1186/s12887-017-0921-x) contains supplementary material, which is available to authorized users.


Background
Very premature and very low birth weight (VLBW) infants are at high risk of mortality and morbidities. Effective outcome prediction and benchmarking, for parental counseling, quality improvement and informing the wider community, have their foundation in the outcome statistics of infant cohorts [1]. There are two established methods of cohorting high-risk infants, by birth weight (for example, VLBW, <1500 g) or by gestational age (for example, very low gestational age [VLGA], <32 weeks), with the relative advantages of each yet to be determined. There has been increasing acceptance of gestational age (GA) based cohorting in recent literature [2][3][4][5], following studies such as that by Arnold et al. in 1991 [6] and Blair et al. in 1996 [6, 7], which raised concerns that VLBW cohorts may be inherently biased.
Birth weight (BW) is dependent on two separate influences; GA at birth and fetal growth rate [8]. It follows that a VLBW cohort may contain infants at any point along a spectrum from very preterm and sized appropriately for their GA (AGA) to small for gestational age (SGA). There is an inherent selection bias toward SGA infants in VLBW cohorts, which becomes more pronounced at higher gestations, as the birth weights of AGA infants become greater than 1500 g [6,7,9]. This disproportionate SGA percentage is exemplified in published studies that have used VLBW cohorts, wherein 19% to 40% of infants were SGA [10][11][12][13][14]. A skewing of risk toward poorer outcome would be expected even in multivariate analyses because high-risk SGA infants lack an equivalent AGA control for adjustment within the cohort. In comparison, GA is independent of BW and fetal growth rate [7], and hence fetal growth and BW for GA show a normal distribution in VLGA cohorts [6]. SGA proportion, by definition, will remain close to 10% across all published cohorts, where SGA percentage ranged from 9.2-12% [15,16].
Meaningful international examination of neonatal outcomes is currently limited by the variations in reporting between nations [17], as direct comparison of neonatal outcomes through benchmarking requires prior standardization of the infant cohorting method used for data collection and reporting. The World Health Organization changed its standard cohorting practice to GA-based in 1961 [18], but some studies and analyses persist in the use of BW criteria [1,[19][20][21].
The overall aim of this study was to evaluate and compare the predictive power of prediction models developed using VLGA and VLBW-based cohorts. It was hypothesized that predictive power of the VLGA-based models would be significantly better than that of the VLBW-based models across all networks because it would reduce the selection bias introduced by the disproportionately high number of SGA infants.

Methods
De-identified clinical data were obtained from the Australian and New Zealand Neonatal Network (ANZNN), Canadian Neonatal Network (CNN) and Swedish Neonatal Quality Register (SNQ) for all infants born either at <32 weeks gestational age or with birth weights <1500 g, who were admitted to participating NICUs in between January 2008 and December 2011. Networks were selected because of their intrinsic similarity, with comparable demographics and healthcare systems. All three networks have the registration criteria for data collection if admitted infants are either <32 weeks or <1500 g. Infants were also excluded if they were moribund (died within the first day of admission without being offered mechanical ventilation or intensive care) or had major congenital anomalies.
The parameters for data collection in each of the network databases were compared. Definitions of outcomes and variables to be analyzed were standardized by consensus a priori. National preterm BW percentiles were examined for each network, and found to be very similar in Australia [22] and Canada [23], however in-utero growth charts were used in the SNQ [24], and therefore the Swedish percentiles were not comparable. For this reason, Canadian BW percentile charts were applied to all infants to define SGA and BW z score.
For the study period, ANZNN data comprised all 29 tertiary hospitals in Australia/New Zealand; CNN comprised 28 of 30 tertiary hospitals in Canada; and SNQ all 25 hospitals with neonatal units in 6 of the 7 health care regions of Sweden. Study data were available through the iNeo (International Network for Evaluating Outcomes in Neonates) project housed at Mother-Infant Care Research Center, Mount Sinai Hospital, University of Toronto, Canada.
The primary outcome studied was in-hospital mortality. The secondary outcome was composite adverse outcome (CAO), defined as in-hospital mortality or a pre-discharge diagnosis of any major neonatal morbidities of chronic lung disease (CLD), serious neurological injuries (SNI) including intraventricular hemorrhage grade III or IV [25] or periventricular leukomalacia, severe retinopathy of prematurity stage 3 or more (ROP) [26] and radiologically or pathologically proven necrotizing enterocolitis (NEC) [27], Consensus outcome definitions are provided in Additional file 1: Table S1. Nosocomial infection was not included in CAO but was included in descriptive analyses, as rate of NI may be used as a marker of patient safety and healthcare effectiveness and outcomes, and hence has relevance for comparison between international cohorts [28].
Data from all networks were amalgamated and formed into two overlapping cohorts of infants less than 32 weeks (VLGA) and/or less than 1500 g (VLBW). Originating network was added as a covariate for subsequent analysis. Of the two overlapping VLBW and VLGA cohorts, two-thirds (balanced for network) were randomly selected, using a split sample method [29], to form the derivation samples for development of two prediction models. The remaining one-third of infants from each cohort formed the internal validation samples, for assessment of predictive power on independent samples. Prediction models were developed for mortality and CAO by multivariable logistic regression with backwards procedures using exclusion criteria of 0.05, according to methodology validated in previous population studies [30,31]. The interaction of BW z-score and GA was also included as a covariate in multivariable analysis to adjust for the varying confounding effects of growth status and maturity.
Analysis for prediction power was conducted for each model on both the VLGA and the VLBW validation samples, which consisted of the VLBW and VLGA validation samples, and two mutually exclusive "extreme" subcomponents of infants <1500 g but ≥32 weeks, and infants <32 weeks but ≥1500 g. Prediction power was assessed using area under the Receiver Operating Characteristic (ROC) curve [32][33][34]. An AUC of >0.80 is generally accepted as excellent prediction [35]. AUC of each prediction was compared. Goodness-of-fit was determined by use of the Hosmer-Lemeshow test [13] to test for systematic over or underestimation of outcomes by the model [36]. Data management and analyses were performed using SAS 9.3 [37] and R 2.10.15 [38]. A twosided significance level of 0.05 has been used without adjustment for multiple comparisons.
ANZNN data collection, access and use of deidentified data for audit and research was approved by all relevant institutional research ethics committees of each NICU hospital (see list of hospitals in Acknowledgement) in Australia, and by the New Zealand Multiregional Ethics Committee for all the New Zealand hospitals listed. For the CNN and SNQ, de-identified data collection was approved at each site by either an institutional ethics board or quality improvement committee of the hospitals listed. All participating networks have obtained ethics/regulatory approval or the equivalent from their local granting agencies to allow for de-identified data to be collated. De-identified ANZNN, CNN and SNQ data were amalgamated at the iNeo collaboration centre where analysis occurred. Approval for this project was obtained from the South Eastern Sydney Local Health District Human Research Ethics Committee and approval for data transfer was obtained from all three networks executive committees. The ethics committees waived the requirement for the consent. Data from all networks were amalgamated and used for this study.
The Coordinating Centre has been granted Research Ethics Board approval for the development, compilation, and hosting of the dataset, and all 3 networks have signed data transfer agreements with the Coordinating Centre. Privacy and confidentiality of patient and unitrelated data will be of prime importance to the iNeo collaboration, and data collection, handling, and transfer will be performed in accordance with the Canadian Privacy Commissioner's guidelines, the Personal Information Protection and Electronic Documents Act, and any other local rules and regulations. No data identifiable at the patient level will be collected or transmitted, and only aggregate data will be reported. For all stages of the project, participating units will be assigned a code by their own network prior to data transfer into the iNeo dataset so that units remain anonymous within the iNeo collaborative. Following data analysis, findings will be disseminated within networks by their own network coordination team and not by the iNeo central team.
Following completion of the study in 2017, the data will be kept at the iNeo Coordinating Centre for a further 2 years before being returned to the originating networks unless otherwise agreed by the member networks.

Results
The derivation of the study cohort is detailed in Fig. 1. The final study population contained 31,940 infants; 14,954 infants from the ANZNN (46.8%), 13,297 (41.6%) from the CNN and 3689 (11.5%) from the SNQ. The VLBW cohort was made up of 24,335 infants (76.2% of study population) and the VLGA cohort of 29,180 infants (91.4% of study population).
The majority of infant characteristics were similar between the VLGA and VLBW cohorts (see Table 1), with the expected exception of SGA percentage, which was more than double in the VLBW cohort (20.4%) compared to the VLGA cohort (9.3%). For the VLGA cohort, mean GA was marginally higher for ANZNN infants (28.5 weeks) than for CNN or SNQ infants (28.3 weeks). Antenatal steroid use was significantly lower in the CNN than ANZNN and SNQ. Significantly fewer infants required exogenous surfactant in the ANZNN than CNN or SNQ. Similar disparities between networks were seen in the VLBW cohort.

VLGA and VLBW prediction models for mortality and composite adverse outcome
Mortality prediction models developed using the VLBW and VLGA derivation cohorts are shown in Table 2. AUC was analogous for the VLBW (0.830) and VLGA (0.828) based models, and both models had equally good discriminatory power. The CAO prediction models (Table 3) included similar variables to the mortality prediction models. Both the VLGA and VLBWbased models showed equal discrimination, with an AUC of 0.83.

Application of prediction models to validation samples
When applied to the validation samples, the predictive power of both the VLBW-based and VLGA-based models remained excellent (AUC 0.81-0.85) for prediction of mortality and CAO. Cross-comparison showed equivalent performance between the VLBW and VLGA-based models for application to both the VLBW and VLGA developmental samples (Additional file 1: Table 3). Statistical significance (p < 0.05) was reached between the VLGA and VLBW-based CAO prediction models for the VLBW validation sample. Predictive power of the VLBW and VLGA-based models remained excellent across networks and exhibited a narrow range in AUC of 0.81 to 0.86. This demonstrated the applicability of the developed models to the three included networks. Due to the large sample sizes, statistical significance was shown between the VLGA and VLBW models for some comparisons despite very small differences.

Application of prediction models to extreme subsets
There are two mutually exclusive subset of the cohorts: 2759 infants in the VLBW cohort whose gestation was 32 weeks or above and 7603 infants in the VLGA cohort whose birthweight was 1500 g or more. Predictive power decreased when the models were applied to these two extreme subset of the VLBW and VLGA cohorts (AUC 0.50-0.62) ( Table 4). Neither model demonstrated consistently better performance for the prediction of mortality or CAO in either of the extreme subsets. In the VLBW and ≥32 week extreme subsets, most (2273/2759, 82.4%) were SGA, and conversely in the extreme VLGA and ≥1500 g subset a smaller proportion (713/7603, 9.3%) were SGA and a considerable number of infants (1300/7603, 17.1%) were large for gestation age (LGA). Both extreme subsets had consistently lower crude morbidity and composite adverse outcome rates than those of the total cohorts (Additional file 1: Table 4).

Discussion
This study is the first to systematically assess the comparative predictive power of VLBW and VLGA cohorting methods. Belief in the superiority of gestation-based cohorts has grown amongst many investigators [2,3,[39][40][41] in response to suggestion that VLBW cohorts are limited by their innate confounding of growth status and maturity [6-8, 17, 42], but have not been formally validated. In this   Multiple logistic regression models were applied to obtain the final predictive models using stepwise variable selection procedure with inclusion and exclusion criterion of 0.05; # = excluded by the variable selection procedure Notes: BW z score Birth weight z score, GA x BW z score interaction between GA and Birth weight z score, Estimate estimated coefficient of the covariate, SE standard error retrospective population study of 31,940 neonates from Australia/New Zealand, Canada and Sweden, we identified that outcome prediction models derived from VLGA and VLBW cohorts perform equally well for prediction of inhospital mortality and CAO in these high-risk preterm infants.
The expected skewing of risk toward poor outcome in VLBW cohorts was confirmed by the higher rate of CAO in this group (36.5%) compared to the VLGA group (32.6%). The VLBW study cohort had higher rates of NI (18.1% vs. 16.0%) and CLD (21.7% vs. 19.2%) than the VLGA cohort across all networks [48] confirming previous studies that SGA infants have higher risk of CLD [49][50][51][52][53] and NI [50,52,54] compared to AGA infants of the same GA. Previous studies have also suggested higher mortality [49][50][51] and NEC [52] rates in SGA infants, yet inconclusive as to whether SGA groups have excess risk of severe ROP and SNI [48,50,52,54]. The current international study has the largest sample size of any research examining these morbidities and thus has the statistical power to determine small differences in outcome. The smaller than hypothesised outcome difference found between the VLGA and VLBW groups is likely related to improvement of SGA outcomes associated with advances in contemporary clinical practice. The protective effect (negative coefficient) found for vertex presentation for both VLBW and VLGA cohorts suggests other presentations such as breech, transverse or others are associated with a less favourable outcome.
No clinically significant difference in predictive performance was found between the VLGA and VLBW models in this study. The higher SGA percentage within the VLBW cohort did not affect the discrimination power of the VLBW model, suggesting adequate control within the model for the confounding effect present. We propose two explanations for the rejection of our hypothesis. First, in previous VLBW cohort publications, many infants may not have had accurate prenatal gestation assessments, primarily due to substantial limitations in accessibility to early dating ultrasound. In comparison, GA assessment in the three networks of this contemporary study was robust, as all three networks have national healthcare access with nearly universal ultrasound Multiple logistic regression models were applied to obtain the final predictive models using stepwise variable selection procedure with inclusion and exclusion criterion of 0.05; # = excluded by the variable selection procedure Notes: BW z score Birth weight z score, GA x BW z score interaction between GA and Birth weight z score, Estimate estimated coefficient of the covariate, SE standard error examinations of pregnancies. The accurate GA data in both the VLGA and VLBW cohorts improved the accuracy of the models in this study, compared to expectations from previously published VLBW cohort data. Second, the research methodology of this study allowed for inclusion of non-linear relationships, such as the GA and BW zscore interaction. In the VLBW models, this adjusted for growth status and maturity through a balanced shift in the coefficients for BW z-score and GA as well as the negative coefficient in their interaction being the protective confounding effect of growth status and maturity. The noninclusion of these covariates in the VLGA models likely reflects that similar adjustment for SGA infants was not needed, as expected in keeping with the consistent 10% SGA. Consequently, the large sample sizes of this study combined with sophisticated modelling allowed development of models able to effectively control for confounding and bias, leading to the null findings.
Comparison of the models' usefulness for prediction in the two 'extreme' subsets of VLGA-not-VLBW and VLBW-not-VLGA tests the scope of application. It was found that the power of all models fell when applied to the <1500 g BW ≥32-week GA infants, who would almost all be moderately or severely SGA. Predictive power also dropped for both the VLGA and VLBW models when used for CAO prediction in the BW ≥1500 g and GA <32week subset, but remained excellent for mortality prediction. This clearly confirms that both mortality prediction models perform well for an extreme cohort containing no SGA infants [23]. The finding that the CAO prediction model did not perform as well as expected could indicate increased vulnerability of large for GA infants to morbidities. The findings suggest that separate prediction models may need to be developed for infants on the extreme subsets of established cohorts where there is a high proportion of SGA or LGA infants, as standard statistical modelling derived from either VLGA or VLBW may not be appropriate for use.
This study is reliable due to its large sample size of 31,940, and the population based nature of the data [55]. Relative to the size of the samples there were very few missing or incorrect data, attesting to the high quality of the originating databases. The international collaboration allowed validation of study findings across three neonatal networks, and was made more effective by choosing networks with similar databases. Additionally, Canada, Sweden and Australia/New Zealand have high coverage with early dating ultrasound and thus accurate GA data, in contrast to other studies that have combined last menstrual period and ultrasound dating, thus applying GA estimations that differ by up to 3 weeks [8,56]. Through the examination of both mortality and CAO, this study will be useful as survival at lower GA becomes possible and prediction of survival without major morbidities becomes increasingly vital.
This study was limited to the analysis of variables collected uniformly across all network databases for the complete study period, but the similarity and quality of the network databases included curtailed the effect of this limitation. The observational, retrospective design meant that no causal mechanisms can be imputed..
The conclusion that VLBW cohorts perform as well as VLGA cohorts for prediction of mortality and morbidity will have ramifications at the international and population levels. Comparison of population outcome may now be considered valid regardless of the cohorting method used to obtain data, providing both GA and BW z scores are included in analytic models. This represents a major advancement in international benchmarking. This study also provides evidence to justify the continued use of BW-based cohorting in some nations provided accurate GA data are included. A further corollary of this study is clarification of the literature on VLGA and VLBW neonates. The findings elucidate both the external validity of research based on one cohort for application to the other, and the appropriateness of comparing data or conclusions based on disparately cohorted groups.
Further investigation is warranted into whether the findings of this study can be extrapolated to countries with poorer access to antenatal care, in particular early dating ultrasound, and hence less accurate GA estimation. Moreover, further studies should compare predictive power for longer-term outcomes such as neurodevelopment, where differing SGA proportions would be expected to have greater effect.

Conclusion
Outcomes of high-risk neonates are commonly reported either by gestational age or by birth weight. Compared to gestation-based cohorts, birth weight cohorts are more prone to selection bias toward small-for-gestational age infants, who are at high risk of adverse outcomes. However, this study found cohorts based on VLBW or VLGA were equally effective when used to generate prediction models for mortality and morbidity across the three national neonatal networks of Australia/New Zealand, Canada and Sweden. Both models had excellent predictive power when applied to VLGA and VLBW groups, illustrating that either model is appropriate for use, provided GA and BW parameters included in the modeling have been collected well. Neither model performed well at the extremes of BW for GA, particularly where it contained a high proportion of SGA or LGA infants. The findings of this study may facilitate comparisons for international benchmarking and subsequent quality improvement, and provides support for continued adherence to BW-based cohorting in appropriately designed population studies.

Funding
This study and the Australian and New Zealand Neonatal Network were supported from the ANZNN members' contribution and in part by a grant from Leslie Steven Grant for Newborn Care, Sydney Children's Hospital Foundation, Australia, and Royal Hospital for Women Foundation, New South Wales, Australia. The Canadian Neonatal Network is supported by funding from the Canadian Institutes of Health Research and from the Ministry of Health and Long-Term Care, Ontario, Canada for infrastructure support to the Maternal-Infant Care Research Centre, which is the coordinating center for the network. Additional funding was provided by individual participating hospitals. The Swedish Neonatal Quality Register is supported by the Swedish Association of Local Authorities and Regions, with additional funding by the participating hospitals. The iNeo collaboration at the Mother-Infant Care Research Centre, Toronto conducted the data collation and analyses for this study, is funded by the Canadian Institute of Health Research.

Availability of data and materials
The datasets generated during the current study are not publicly available. Supplying individual raw data outside the network research framework is in conflict with the applicable legislation of all three participating countries. The Research Ethics Committee approval requires all electronic research data, including de-identified patient data, to be stored in designated password protected computers within the facility of the researchers.
Authors' contributions LIMKS conducted the literature review, participated in review of data analysis and interpretation of results and prepared the final manuscript. KL conceptualised the study and design, contributed to the interpretation of results and critically reviewed the final manuscript. PSS contributed to the conception and study design, coordinated transfer of Canadian Neonatal Network data, contributed to review and interpretation of results and critically reviewed the final manuscript. GS contributed to the conception and study design, co-ordinated transfer of Swedish Neonatal Quality Register data, contributed to the review and interpretation of results and critically reviewed the final manuscript. XYY designed the statistical analytic approach, carried out the statistical modelling and analysis and critically reviewed the final manuscript. YAW reviewed statistical modelling and analysis and critically reviewed the manuscript. SSWC was involved in transfer and data checking of ANZNN data and contributed to the analytic approach. SKL contributed to the study conception and was involved in review of statistical analyses. BAD contributed to the conception of the study, collaborated on study design, and critically reviewed the final manuscript. SH contributed to the conception of the study and critically reviewed the final manuscript. All authors read and approved the final version of the manuscript.