Identification of improvement areas
In the field of neonatology there are no available gold standards in the sense of “best available test or benchmark under reasonable conditions”. It is therefore difficult to define good quality. Instead, one has to rely on the comparison between units which is prone to bias because not all units work under the same conditions. Some have a higher risk for mortality or morbidities than others because of the nature of the collective they treat. We therefore believe that a comparison should not classify a unit with such crude a label as performing with good or bad quality. Instead, we propose a concept where units performing worse in areas where others excel can profit from the latter and improve their quality without losing face. It helps that the detection tool is sensitive enough to show that every unit has areas to improve and that Switzerland is small enough for all participants to know each other well. We have thus adopted two important aspects of the Vermont Oxford Network’s innovative NICQ system where a small number of units respectfully help each other by objectively communicating their results and holding themselves accountable [11].
Areas where at least one Swiss unit differs significantly from the combined Swiss total and which thus display improvement potential lay in the rates of caesarean section, prenatal steroids, mortality, early onset sepsis, late onset sepsis, growth and measured UapH. Berger et al. (2012) previously reported that factors other than baseline population demographics or differences in the interpretation of national recommendations (for children born at the limit of viability) influence survival rates of extremely preterm infants in the individual units which also suggests the presence of areas for improvement [18].
Differences between PDA, PPHN, mechanical ventilation, CPAP, CPAP without mechanical ventilation, and surfactant usage on the other hand are suspected to be reflections of clinical treatment strategy, different diagnostics or geographical location and are therefore of limited use for quality assessment. Yet they can be important when investigating the most likely cause of a quality problem in another measurement.
Quality of data set
The variables used for the benchmarking and quality indicator calculations were chosen because of their capacity to describe clinically important and/or modifiable processes and outcome [11]. Many of them also appear in the Baby-MONITOR, a composite indicator for quality recently published by Profit et al. (2011) that both an expert panel as well as practicing clinicians agreed upon as having high face validity [19, 20]: prenatal steroids, late onset sepsis, oxygen at 36 weeks postmenstrual age, growth velocity and in-hospital mortality. However, in order to complete Profit et al.’s choice of measures included into the Baby-MONITOR, the network would need to add timely ROP exam, pneumothorax, human milk feeding at discharge and hypothermia on admission. Incidentally, all except the latter are routinely collected by the network and will therefore be included in the near future.
Of the 20 criteria required for quality indicators according to QUALIFY [7], the network applies 14 as instructed and 3 in a modified version: Reliability would best be tested using a test-retest or an inter-rater procedure. This is however not possible because of the limited funding available. Instead, we established an algorithm designed to flag data that are selected for a partial test-retest procedure. The other modifications were necessary because of the relatively small collective size in Switzerland where some of the units only have ca. 30 cases per year: The ability for statistical discrimination in QUALIFY requires limits as of which an outcome switches from good to poor quality in order to calculate the minimal amount of patients required by a participating unit to guarantee a secure statistical statement. Such limits are not yet available in neonatology. Since the low collective size in Switzerland cannot be modified and the network does not have the intention to define good or bad quality, but rather to identify possible areas of improvement, we instead optimize statistical reliability by pooling years and optimize finding relevant results by offering the same data for consecutive pooled years in three different collectives (very preterm, very low birth weight and extremely preterm). This way, large and potentially relevant outcomes are sometimes discussed even if they have not yet reached statistical significance. Finally, risk-adjustment has been simplified to reflect only the units’ individual distribution into gestational age groups. Any additional risk-adjustment would stratify the small collectives into even smaller groups making no more statistical sense.
The network omits 3 of the QUALIFY criteria: Sensitivity and Specificity calculation require the presence of gold-standards which have not yet been established for the variables observed in this study. Comprehensibility and interpretability for patients and the interested public have been omitted as we believe the network’s quality cycle to require too much expertise to be distributed to the general public.
Serviceability for quality improvement
Ellsbury et al. (2010) [10] maintain that despite the complexity of the NICU environment, significant improvements can be accomplished by use of basic QI methodology. The network can provide several aspects of the required methodology postulated by Ellsbury et al. The tools to identify a clinically important and modifiable outcome, the setting for establishing a goal for improvement and the structure for securing a long-time establishment of the change by continuous data collection and review. Hulscher et al. (2013) particularly point out the requirement of latter as they observed that if teams remained intact and continued to gather data, chances of long-term success were higher [21].
The remaining aspects of the methodology required according to Ellsbury et al. (2010) however need to be provided by the units directly: a team that finds the “vital few” causes for the problem according to the Pareto principle (as opposed to the “trivial many” causes) and that is motivated to implementing the change, preferably a system change as opposed to tinkering [10].
Starting from a different vantage point, Lloyd (2010) describes milestones required for reaching quality improvement [22]. The network observes these milestones: The aim of the network’s quality cycle is clearly specified, it follows a concrete concept, the items and how they are measured are well defined, a well-developed data collection plan exists and the data are analysed both statically and analytically. We however prefer Plsek’s p-charts over the run or Shwehart charts proposed by Lloyd due to the latter’s complexity which makes them difficult to program for them to be produced automatically, and because of the limited population size in some of the participating units which would limit the explanatory power of the run or Shwehart charts. Noteworthy however is that Lloyd’s sequence for improvement parallels our proposed quality cycle (Figure 3) if rotated anti-clockwise by 90 degrees. His “act-plan-do-study” translates into our “guideline (plan)-perform (do)-falsify (study)-reform (act)” which again is listed in Ellsbury et al. (2010) congruently as “plan-do-study-act” and is said to be a simple feedback cycle with a long history of successful use in improvement activities in industry and many other fields [10, 11].
Falsification
Kelle et al. (2010) maintain that constant doubt is a basic tenor in evidence based medicine and conclude that this doubt can be used to detect typical misperceptions and erroneous conclusion [23]. Swiss Level III neonatology units apply evidence based guidelines and in centre or multicentre based studies also develop such guidelines using random controlled trials [24]. Under the premises that neither scientific research nor clinical performance are immune to human error, in particular when working with fully established and proven evidence based guidelines, we propose that a long-term constant monitoring of key clinical measurements will help in the establishment of useful guidelines versus ineffective ones because it allows observing the effect of the guidelines on the everyday clinic from a so far un-established vantage point. In other words, we believe constant doubt is a prerequisite for evidence based medicine and therefore its application should be continuously tested (respectively falsified) in order to secure that the knowledge gained by statistical interpretation of probabilities really is a reflection of the true nature of the problem for which the evidence based solution was found. The network’s tool however cannot be seen as a final answer to this dilemma, merely as a step in the direction of accepting the constant doubt.
This is another reason why we maintain that the network’s goal is not to classify good or bad quality but is instead designed to detect possible errors by performing constant falsification. Obviously this is open for improvement by augmenting the range of observed measurements and further refining its methodology.
Limitations
Statistical discrimination requires large numbers or large differences. Swiss neonatology units offer neither: The units have approximately 30 to 160 cases per year and comparable quality standards. That is why we need to pool years and deviate somewhat from the recommendations made by QUALIFY.
The choice of items measured by the network is so far dependent on routinely collected variables for research. New items can be added but the network has limited itself to performing changes in the data collection only every five years in order not to risk the quality of the data collection of the existing items. Data pooling and the necessity to gather twice the amount of data to perform our reliability exam, result in a productive routine integration of a new item only after 4 years. The waiting however can be shortened, if need be, by replacing the reliability test through a test-retest method. Also, preliminary data can be observed on a unit’s level from the beginning of data collection with limited explanatory power. Nevertheless, due to its complexity, the network’s tool is not very flexible.
As the risk-adjustment for each unit is different, the units’ values cannot be directly compared to each other in the QI chart. We however deem this as irrelevant, as we are interested in each unit’s performance vs. the collective and not in the direct competition between units.
The Swiss Neonatal Quality Cycle is still in its beginning phase. The effects listed at the end of the results section result from preliminary meetings held during the development of the quality cycle. We are currently monitoring the measures undertaken to improve quality in order to be able to concretely report on observable effects over time. But even if we can report on a significant change attributed to the quality cycle, we will not be able to empirically prove that the observed change is in fact caused by the quality cycle, as for instance recommended by Schouten et al. (2008 and 2013) [21, 25]. In essence, we can never rule out that other simultaneous changes (such as new medication or evidence based measures) are in fact responsible. The setup does not fulfil the criteria met by controlled trials and has no intention to do so.