The influence of gestational age in the psychometric testing of the Bernese Pain Scale for Neonates

Background Assessing pain in neonates is challenging because full-term and preterm neonates of different gestational ages (GAs) have widely varied reactions to pain. We validated the Bernese Pain Scale for Neonates (BPSN) by testing its use among a large sample of neonates that represented all GAs. Methods In this prospective multisite validation study, we assessed 154 neonates between 24 2/7 and 41 4/7 weeks GA, based on the results of 1–5 capillary heel sticks in their first 14 days of life. From each heel stick, we produced three video sequences: baseline; heel stick; and, recovery. Five blinded nurses rated neonates’ pain responses according to the BPSN. The underlying factor structure of the BPSN, interrater reliability, concurrent validity with the Premature Infant Pain Profile-Revised (PIPP-R), construct validity, sensitivity and specificity, and the relationship between behavioural and physiological indicators were explored. We considered GA and gender as individual contextual factors. Results The factor analyses resulted in a model where the following behaviours best fit the data: crying; facial expression; and, posture. Pain scores for these behavioural items increased on average more than 1 point during the heel stick phases compared to the baseline and recovery phases (p < 0.001). Among physiological items, heart rate was more sensitive to pain than oxygen saturation. Heart rate averaged 0.646 points higher during the heel stick than the recovery phases (p < 0.001). GA increased along with pain scores: for every additional week of gestation, the average increase of behavioural pain score was 0.063 points (SE = 0.01, t = 5.49); average heart rate increased 0.042 points (SE = 0.01, t = 6.15). Sensitivity and specificity analyses indicated that the cut-off should increase with GA. Modified BPSN showed good concurrent validity with the PIPP-R (r = 0.600–0.758, p < 0.001). Correlations between the modified behavioural subscale and the item heart rate were low (r = 0.102–0.379). Conclusions The modified BPSN that includes facial expression, crying, posture, and heart rate is a reliable and valid tool for assessing acute pain in full-term and preterm neonates, but our results suggest that adding different cut-off points for different GA-groups will improve the BPSN’s clinical usefulness. Trial registration The study was retrospectively registered in the database of Clinical Trial gov. Study ID-number: NCT 02749461. Registration date: 12 April 2016.


Background
Acute painful status in preverbal infants is assessed and interpreted by observing measurable behavioural and physiological indicators. An infant who undergoes an invasive procedure may react to pain that is not caused solely by the painful stimulus [1,2]. Incorporating individual contextual factors, like gestational age (GA) and gender, into pain assessment tools might make them more accurate [3,4]. The physiological and behavioural dimensions of pain in neonates are measured by several multidimensional pain assessment tools developed over the last three decades [4][5][6], but experts agree that behavioural, physiological and cortical measures of pain do not converge to reliably depict and assess the phenomenon of pain in such a vulnerable population [7,8]. Discrepancies and low-to-moderate associations between behavioural (e.g., facial expression) and physiological (e.g., changes in heart rate) indicators of pain [9][10][11][12] have sparked ongoing debate about the appropriate dimensionality of pain scales [7]. Infants may also display nonspecific physiological and behavioural pain indicators during stressful experiences that are not painful, which makes it more challenging to accurately assess pain in neonates [13,14].
Many pain assessment tools are used in neonatal intensive care unit (NICU) settings. Most add behavioural and physiological indicators to a summary score that is then measured against a cut-off that separates pain from no pain [4]. Rigorous psychometric testing has been applied only to a few [15] (e.g., the Premature Infant Pain Profile [16]). Most were validated for a specific GA in tests that assessed acute pain in full-term and healthy preterm infants with higher GA [4]. However, neurodevelopment and the associated ability to react to painful stimulus varies greatly among early and late preterm infants and full-term neonates: neonates with lower GA express less behavioural pain than more mature neonates [17][18][19][20][21][22]. In neurologically impaired and very ill neonates, and in neonates on medications (e.g., sedatives), pain may be faintly expressed, or not at all [13,23].
The Bernese Pain Scale for Neonates (BPSN) is a multidimensional pain assessment tool that includes seven subjective items (sleeping, crying, consolation, skin colour, facial expression, posture, and breathing) and two physiological items (changes in heart rate and oxygen saturation) [24]. The BPSN has been used by clinicians since 2001; 46% of Swiss NICUs rely on this tool to assess pain in neonates [25]. The results of the first validation study in the year 2004 suggested that the BPSN is a valid and reliable scale for assessing acute pain in full-term and preterm neonates with different GAs [24]. However, clinical experts have said the tool is less useful for assessing pain in extremely preterm neonates who, for example, always score very low. This feedback and the increasing scientific evidence which indicates that neonates' pain reaction is influenced by individual contextual factors [1] have motivated us to re-evaluate the tool with sophisticated psychometric tests to assess its accuracy across all GAs.
This study is the first part of a comprehensive BPSN validation and extension study, designed to develop a modified version of the BPSN that includes relevant individual contextual factors in pain assessment. In this first part, we evaluated the BPSN with psychometric tests. The second part of the study will explore the influence of individual contextual factors (e.g., medication, or number of previous painful experiences) on variability in pain reactions across repeated measurement points.
We used psychometric tests to determine the applicability of the BPSN across neonates who ranged from 24 to 42 weeks of GA. We evaluated interrater reliability, the underlying factor structure of the BPSN, and the internal consistency of the scale. We also assessed concurrent validity with the Premature Infant Pain Profile-Revised (PIPP-R; [26]), construct validity, specificity and sensitivity, and determined the relationship between behavioural and physiological indicators of pain. GA groups and gender were considered as individual contextual factors.
Based on the results of the first validation study of the BPSN [24], we hypothesized that the BPSN is a valid and reliable tool for assessing pain in preterm and full-term neonates. Due to feedback from clinical experts concerning difficulties in pain assessment in extremely preterm neonates and the increasing scientific evidence that indicates neonates' pain reaction is influenced by individual contextual factors [1], we assumed that we will find a difference in pain reaction depending especially on neonates' GA. Furthermore, we hypothesized only a low-to-moderate association between behavioural and physiological indicators of pain.

Sample and settings
This was a prospective multisite validation study with repeated measurement design. It was conducted in three university hospital NICUs in Switzerland (Basel, Bern and Zurich). The study was approved by the Ethics Committee Bern, the Ethics Committee northwest/central Switzerland, and the Ethics Committee Zurich. Recruitment and data collection were ongoing, from January 1 to December 31, 2016. Data collection was extended in Bern until January 31, 2017, because we needed to recruit more extremely premature neonates. We included premature neonates born between 24 0/7 and 36 6/7 weeks of gestation, if they were expected to undergo 2-5 routine capillary heel sticks in their first 14 days of life. We included full-term neonates born between 37 0/7 and 42 0/7 weeks of gestation, if they were expected to have at least two routine capillary heel sticks during their first 14 days of life. We needed parental permission to include preterm and full-term neonates. We excluded neonates if they had had a high-grade intraventricular haemorrhage (grades III and IV), if they had a severe life-threatening malformation or suffered from any condition that caused partial or total loss of sensitivity, if they had an arterial cord pH < 7.15 at birth, if they had surgery for any reason, or if they had a congenital malformation that affected brain circulation and/or cardiovascular system.

Recruitment and data collection procedures
Neonates were recruited by consecutive sampling and then stratified according to GA at birth [27]. Trained study assistants in each study centre identified potentially eligible neonates and informed their parents of the aim and purpose of the study. After parents granted written informed consent, trained study assistants videotaped neonates (using a HC-V757 high-definition camcorder manufactured by Panasonic, Osaka, Japan) during their next 1-5 routine capillary heel sticks. For each heel stick, we produced three video sequences: baseline, heel stick, and recovery phases. Each video sequence began by focusing on the face of the neonate for at least 1 minute to allow adequate assessment of facial activity and cry. Thereafter, the infant's body was recorded for at least 1 minute. Bedside nurses were asked not to handle the neonates before the baseline phase was recorded, to avoid additional distress that could change the measurement. During the heel stick procedure, the neonates were lying in their incubator (or crib) and the position of the infants was unchanged for the video recording. The baseline phase was recorded 2 to 3 min before the beginning of the heel stick procedure. Afterwards, the bedside nurse warmed the neonate's heel and gave the infant a dose of 24% oral sucrose (0.2 ml/kg bodyweight) to relieve pain [28]. When the nurse disinfected the neonate's heel, the recording of the heel stick phase began. First, the neonate's face was recorded, until the nurse finished the heel stick procedure, which lasted at least a minute. Then the infant's body was recorded for at least one more minute. The recovery phase began immediately after the heel stick phase was recorded. During each phase of the heel stick procedure, our study assistants recorded the infant's highest heart rate and lowest oxygen saturation measurement from the infant's monitors, which tracked this data continuously.
Each video sequence was checked for quality and digitally elaborated by trained study assistants in Final Cut Pro X [29] video editing software. We removed any information that could have revealed the heel stick phase to the raters to ensure continued blindness. The video sequences were uploaded onto a web-based rating tool developed for our study. Uploaded sequences were randomized by sequence number, phase, and presentation order. Five nurses who were working in a NICU and were experienced in using the BPSN (Mean = 8.3 years of experience, SD = 6.1, Range = 3.5-15 years) retrieved the video sequences from the web-based platform and independently rated the behavioural pain expression of the neonates using the BPSN and the PIPP-R. The nurses were trained to use and score the PIPP-R.

Measures
Pain reaction was measured with the BPSN [24] and the PIPP-R [26]. Each of the nine items of the BPSN is rated on a 4-point Likert scale (0, 1, 2, and 3), and then the scores are summed. On the BPSN total score, which includes seven subjective items (i.e., sleeping, crying, consolation, skin colour, facial expression, posture, and breathing), and two physiological items (i.e., changes in heart rate and oxygen saturation), the scores of 11 or more points indicate pain (BPSN total scores range from 0 to 27). In a first validation study in the year 2004 [24], the BPSN showed good construct validity among neonates with GAs between 27 and 41 weeks (n = 12); BPSN scores were significantly higher during painful (M = 15.96, SD = 5.7) compared to non-painful (M = 2.32, SD = 1.6, p < 0.001) situations. Furthermore, the correlations between the BPSN and the Visual Analog Scale (VAS; r = 0.855, p < 0.0001) and the PIPP (r = 0.907, p < 0.0001) were high, as well as the interrater (r = 0.86-0.97) and intrarater reliability (r = 0.98-0.99) of the BPSN [24]. In our study, five independent blinded raters watched the videos to rate the seven subjective items. Both physiological indicators were captured from the neonate's monitoring records during video recordings. Because the raw data on heart rate, oxygen saturation and breathing rate in the baseline phase was used to calculate differences during the heel stick and recovery phases, we set the baseline scores of these items to zero, and retrospectively converted the raw data between baseline, heel stick, and recovery phase into BPSN scores that ranged between 0 and 3.
The PIPP-R is a well validated pain assessment tool for use with premature and full-term neonates, widely used in North America in clinics and for research [16,26,30,31]. The PIPP-R includes three behavioural indicators (brow bulge, eye squeeze, and naso-labial furrow) and two physiological indicators (heart rate and oxygen saturation). Each indicator is rated on a 4-point Likert scale (0, 1, 2, and 3). The PIPP-R accounts for GA and baseline behavioural state as contextual factors. Neonates with younger GAs and neonates in quiet sleep state score the highest, but they are only factored in if the infant's behavioural and physiological sub score is ≥1 [26]. Zero points indicate no pain or perhaps no response to pain, 1-6 points indicate low pain, 7-12 points indicate moderate pain, and ≥ 13 severe pain. Total PIPP-R scores range from 0 to 21 for neonates with GA < 28 weeks in a quiet and sleep baseline behavioural state, and from 0 to 15 for full-term neonates in an active and awake baseline behavioural state [26]. The PIPP-R shows beginning construct validity [30]; PIPP-R scores were significantly higher during painful (M = 6.7, SD = 3.0) compared to non-painful (M = 4.8, SD = 2.9; p < 0.001) procedures among full-term and preterm neonates with GAs as young as 26 weeks of gestation (n = 202). In addition, the PIPP-R showed good interrater reliability between nurses and pain experts (R 2 = 0.87-0.92; p < 0.001), and nurses reported that the PIPP-R is a feasible and appropriate pain assessment tool [30]. In our study, both physiological indicators were captured from the neonate's monitoring records and converted into PIPP-R scale values like the physiological indicators of the BPSN. The behavioural indicators and behavioural state were rated from the videos by the same five independent raters. We calculated interrater reliability of the three behavioural items with a two-way random-effects, absolute agreement, single measure model that ranged from 0.750 to 0.842 (Mdn = 0.803) in the heel stick phases of the five measurement points.
We retrieved individual contextual factors retrospectively from patient charts [27] and will publish a separate paper describing their influence on the variability of pain reaction across repeated measurement points.

Sample size and power
Our target sample size of 150 neonates was based on an a priori power analysis of the hypothesized association between the BPSN and GAs at baseline. That analysis was based on data from a previous study (n = 71; [32]) and a descriptive-explorative analysis (n = 23); it assumed a Type I error probability of 5%, a power of 80%, and at least three documented baseline heel sticks per study infant.

Data analysis
Factor analyses explored the structure of the BPSN and measurement invariance. Psychometric tests examined interrater reliability, internal consistency, construct validity, concurrent validity with the PIPP-R [30], association between behavioural and physiological items, and sensitivity and specificity. Because the sample was heterogeneous, we also conducted analyses for different GA-groups. We used the statistics programs SPSS [33] and R [34] for all analyses. Space restriction limit us to reporting mainly our results from the heel stick phases. In this comprehensive validation study, we did multiple testing of outcome data arising from individual neonates. Correction of p-values with Bonferroni adjustment [35] would not have rendered findings non-significant. Therefore, all p-values are presented uncorrected for multiple testing unless otherwise specified. A p-value < 0.05 was considered statistically significant.

Preliminary analyses
Exploratory analyses described the data and looked for anomalies that could reduce the validity of the data analysis. We used descriptive and frequency statistics to describe sample characteristics and each rater's pain scores.

Missing values
We analysed the ratings of the 1′817 video sequences for the volume and pattern of missing data, since single items of the BPSN and the PIPP-R could be rated "non-evaluable". Because it is impossible to compute BPSN and PIPP-R sum scores when an item was not rated, we used multiple imputation [36] and the R-package partykit [37] to derive those scores by replacing the values of non-rated items with random substitutes generated from conditional inference regression trees [38]. We generated five data sets, so there were five variants on the BPSN and PIPP-R sum scores.

Interrater reliability
Intraclass correlation coefficients (ICCs) and their 95% confidence intervals were calculated to determine interrater reliability of the seven subjective BPSN-items [39,40]. Since pain reaction of a neonate is rated by a single nurse in the clinical setting, and pain level scores were central to our outcome, we assessed interrater reliability with a two-way random-effects, absolute agreement, single measure model [41]. ICC coefficients were also calculated with a two-way random-effects, absolute agreement, average measure model, to generate more information about the reliability of the mean ratings provided by the five raters [40]. Each phase of the five measurement points was analysed separately, resulting in 120 ICC coefficients (8 rating scores * 3 phases * 5 measurement points) per model.

Measurement construct
Multiple group longitudinal confirmatory factor analysis [42] was used to evaluate the extent to which individual items correlated with the unobservable pain construct, the predictive performance of the construct, and whether factor loadings were invariant across time and raters. The R-package lavaan [43] was used for this analysis. Full maximum likelihood estimates were based on the assumption that data were missing at random. Figures 1 and 2 show the structures of our confirmatory factor analysis (CFA) models for the subjective and physiological subscales. For item selection, we used only data from the heel stick phases of the five measurement points. Measurement invariance tests were based on data from all phases (baseline, heel stick, and recovery) and all measurement points (t1-t5).

Model specification
The longitudinal structure of the data was accounted for by implementing covariances between factors (Fig. 3, structure of the subjective subscale). The covariance structure of factors for the physiological subscale or additional phases or measurement points was implemented as shown.
For the subjective subscale, we stacked the data records of raters, and used the rater as a grouping variable. This specification of this model made it impossible to model covariances between values of the same child measured by different raters. We chose this specification because it did allow us to test invariance of model parameters within and across raters.

Analytical procedure
We selected items to improve the fit of the CFA model. At estimation, to remove inconsistent items, we restricted loadings of a given item to a common value across raters and measurement points. For both subscales, we estimated several model configurations with at least two items, resulting, for the subjective subscale with 7 items, in 120 models. For the physiological subscale, we used only one model since it included only two items. Selecting the final model was a three-step process. First, we excluded several models with loadings < 0.3 and also excluded models with root mean square errors of approximation (RMSEA) > 0.06, Comparative Fit Indices (CFI; [44]) < 0.95 and Tucker-Lewis Indices < 0.95 (TLI; [45]). The minimal loading size of 0.3 was inspired by Brown [46], and the combinations of cut-offs for the RMSEA, CFI and TLI were inspired by Hu and Bentler [47,48]. Second, we chose from the remaining models those with the highest number of parameters because we wanted to keep as many appropriate items as possible. Third, we planned to select the model with the highest CFI if Step 2 left us with more than one candidate, but this step turned out to be unnecessary. We found no suitable factor model for the physiological subscale and therefore, we used regression analysis to pick the item most sensitive to pain.
We continued factor analysis by examining measurement invariance across time points within-raters and overall measurement invariance. Only loading (weak) invariance was considered, because other parameters like intercepts and variances could be expected to vary over time and phases. Measurement invariance was examined with Satorra and Bentler's likelihood ratio test [49] and tests based on the RMSEA, CFI and TLI that used Cheung and Rensvold's critical values [50].

Reliability and validity of the modified BPSN
The results of our factor analyses showed that only the behavioural items crying, facial expression, and posture had consistently high factor loadings over time. The physiological items heart rate and oxygen saturation did not load on a common factor and did not correlate with each other. Further analyses showed that the item heart rate was more sensitive to pain than oxygen saturation. We thus decided to exclude the items sleeping, consolation, skin colour, breathing, and oxygen saturation from the BPSN. In following examinations, we used a modified version of the BPSN that included facial expression, crying, and posture, as a behavioural subscale, and heart rate as an additional physiological indicator. Because the results of the measurement invariance analyses showed that the measurement construct measured with the modified behavioural subscale works differently for different raters, we accounted for differences between the raters by either including the raters in the model, or by

Internal consistency and corrected item-total correlation
We evaluated the internal consistency of the modified version of the behavioural subscale that included items facial expression, crying and posture by calculating Cronbach's α. We calculated corrected item-total correlations to analyse correlations between single items and the behavioural subscale. In addition, we calculated the resulting Cronbach's Alpha when an individual item is removed from the scale (Cronbach's Alpha if Item Deleted) [51]. Data from each rater were analysed separately, resulting in 75 analyses (5 raters * 3 phases * 5 measurement points), and then we used cocron [52], a web interface, to statistically compare the Cronbach's Alpha coefficients calculated for each rater.

Correlations between behavioural and physiological indicators of pain
Pearson product-moment correlation coefficients were calculated to establish the association between the modified behavioural subscale of the BPSN and heart rate. Data from each rater were analysed separately, resulting in 50 analyses (5 raters * 2 phases * 5 measurement points). Afterwards, for each phase we examined at each measurement point whether the correlation coefficients calculated for the five raters were statistically different, using the χ 2 -statistics of Steiger [53].

Construct validity
We compared the level of pain scores between the three phases (baseline, heel stick and recovery) to determine construct validity of the BPSN. We analysed the modified behavioural subscale and heart rate in a linear mixed effect analysis that used the R-package lme4 [54]. Linear mixed effect analysis allowed us to control variance created by multiple measurement points per subject [55]. The three phases, five measurement points, GA at time of birth, and gender were fixed effects in the model. Neonates and raters were random intercepts. Likelihood Ratio Tests tested the effect of the three phases on the level of pain scores [55].

Concurrent validity
Pearson product-moment correlation coefficients were calculated to establish concurrent validity between the modified total scores of the BPSN (facial expression, crying, posture, heart rate) and the PIPP-R. Separate analysis were performed for the data of each rater, resulting in 75 analyses (5 raters * 3 phases * 5 measurement points), and afterwards, we examined for each phase at each measurement point if the correlation coefficients calculated for the five raters were not statistically different, again using the χ 2 -test of Steiger [53].

Specificity and sensitivity analysis
A Receiver-Operating Characteristic (ROC) curve analysis was used to evaluate the ability of the modified BPSN total score to detect pain in neonates and to determine the cut-off value that maximized both sensitivity and specificity [56]. The PIPP-R was the reference value that allowed us to determine sensitivity and specificity; PIPP-R values of ≤6 characterized neonates as experiencing no or low pain; values ≥7 characterized neonates as experiencing moderate to severe pain. We tested whether the area under the curve (AUC) was greater than 0.5 and calculated sensitivity and specificity of the BPSN by using the cut-off values the ROC curve suggested. We performed this analysis separately for the heel stick phases of the five measurement points and the five raters, resulting in 25 ROC curves analysis (5 raters * 5 measurement points), and we averaged the values calculated for each rater.

Secondary analyses by GA-groups
Infants that ranged from 24 2/7 to 42 5/7 GA at time of birth were included in the primary analyses. Because the sample was heterogenous, we reanalysed the data separately for four GA-groups [57]: extremely preterm neonates (24 0/7-27 6/7 weeks GA); very preterm neonates (28 0/7-31 6/7 weeks GA); moderate to late preterm neonates (32 0/7-36 6/7 weeks GA); and, full-term neonates (37 0/7-42 6/7 weeks GA). Analyses remained the same with exception of the factor and linear mixed model analyses. We could not reanalyse the factor analysis for different GA-groups separately because the sub-samples were too small. In the linear mixed model analyses, GA was already considered as a fixed effect. We did not use Bonferroni adjustment in this subgroup analyses because we exploratively analysed if there were any obvious differences between the four GA-groups.

Missing data and sample characteristics
We enrolled a total of 162 neonates in the study; 8 were excluded from data analysis because video sequences were missing or of poor quality. Figure 4 illustrates the flow of recruitment and data collection. For the five raters, ≤ 1.0% data was missing for the BPSN items sleeping, crying, consolation, skin colour and posture; for facial expression, 0.1 to 4.0% (Mdn = 0.8%) data was missing, and for breathing, 0.3 to 8.7% (Mdn = 1.9%) was missing. For the PIPP-R, 0.5 to 3.3% (Mdn = 1.0%) of data was missing for brow bulge, 0.4 to 3.6% (Mdn = 0.7%) for eye squeeze, 0.6 to 28.3% (Mdn = 4.3%) for naso-labial furrow, and 0.1 to 0.9% (Mdn = 0.4%) for behavioural state. Less than 1% of data was missing for the physiological items heart rate and oxygen saturation.
Mean GA at birth of the total sample was 30.85 (SD = 4.5) weeks and ranged from 24.29 to 41.57. Demographic and medical characteristics of the sample are summarized in Table 1.

Results of descriptive and preliminary analysis
Means of the BPSN total-scale, subjective subscale, and items are summarized in Table 2. Physiological items are Fig. 4 Flow diagram of the recruitment and data collection process not included in this table because they were captured from the neonates' monitoring records during video recordings and the raw data was retrospectively converted into BPSN scores between 0 and 3. The mean scores for heart rate ranged from 0.47 to 0.76 (Mdn = 0.72) during the five heel stick phases, and from 0.03 to 0.11 (Mdn = 0.09) during the five recovery phases. The mean scores for oxygen saturation ranged from 0.77 to 1.25 (Mdn = 0.86) during the five heel stick phases, and from 0.51 to 0.71 (Mdn = 0.61) during the five recovery phases.

Interrater reliability
We derived the results of our interrater reliability analyses by calculating two-way random-effects, absolute agreement models. The results are summarized in Table 3. We again excluded heart rate and oxygen saturation. Interrater agreement for the items crying, consolation, facial expression, and posture tended to decrease across the five measurement points.

Item selection
First, we used all items and heel stick phases of the five measurement points to estimate the multiple group confirmatory factor models for the subjective and physiological subscale. No parameter restrictions were applied, so that loadings could vary across measurement points and raters. To compare the loadings of all items, we restricted factor variance to 1. Figure 5 shows the estimated factor loadings of the model for the subjective subscale and Fig. 6 for the physiological subscale. For the subjective subscale, loadings for breathing (range = − 0.167-0.110) and skin colour (range = − 0.034-0.293) are low, while loadings for sleeping vary widely between raters (range = 0.096-0.982). Loadings of the remaining items, consolation, crying, facial expression, and posture, seem consistent, but they tend to decrease over time. Rater D's loadings often conflict with other raters and vary over time.
For the physiological subscale, two loadings exceed by far a value of 1, indicating poor fit between model and data. Additional analyses showed no association between heart rate and oxygen saturation. Pearson product-moment correlations between heart rate and oxygen saturation ranged from r = − 0.028 to 0.106 (Mdn = 0.017; p > 0.05) during the heel stick phases of the five measurement points. Large loadings are probably numerical artefacts and should not be over-interpreted. Because the physiological items did not load on a common factor or correlate with each other, we discarded all but one of the physiological items based on their sensitivity to pain. We analysed the sensitivity to pain of heart rate and oxygen saturation by calculating linear mixed effect models (see next section). We selected items of the subjective subscale by estimating several configural models with at least two items. In contrast to the model presented in Fig. 5, we restricted factor loadings of a given item to a common value across time points and raters. We excluded models with factor loadings < 0.  Note. N = number of neonates included in the analysis. This number varies because of differences in the amount of missing data between the raters at each measurement point and differences in the number of neonates included at each point of measurement This improves the CFI and the TLI indices from about 0.8 to 0.95.

Physiological items' sensitivity to pain
Because the factor analysis indicated that the physiological items heart rate and oxygen saturation do not fit the data well, we next examined these items for their sensitivity to pain. We calculated linear mixed models that included the variables phases, measurement points, GA at time of birth, and gender as fixed effects, and neonates as random intercept. We used Likelihood Ratio Tests to compare a model without the heel stick and recovery phases to a model that included the phases. There was a significant effect of phase on heart rate (χ 2 (5) = 172.91, p < 0.001). Heart rate scores during the recovery phases were, on average, 0.646 point lower than scores during the heel stick phases (SE = 0.09, t-value = − 7.383). Phase also significantly affected oxygen saturation (χ 2 (5) = 33.658, p < 0.001).
Oxygen saturation scores were, on average, 0.258 points lower during the recovery phases than during the heel stick phases (SE = 0.12, t-value = − 2.136). We

Measurement invariance
Measurement invariance was examined only for the subjective subscale, since the physiological subscale contained one item. In this analysis, we re-estimated the final model that included crying, facial expression and posture. We used different parameter restrictions: (Free) = all parameters are free; (WRLInv) = within-rater loadings invariance was assumed by restricting loadings of items across time but not across raters; (OLInv) = overall loadings invariance was assumed by restricting loadings across time and across raters. We already applied the OLInv assumption to select items. We next asked if the restricted models fit the data as well as the unrestricted models, and whether factor loadings are (partially) invariant. We performed the same analysis but used only data from the heel stick phase of the five measurement points. Then we used data from all phases and measurement points. Table 5 shows differences between fit indices of the unrestricted and restricted models, including the likelihood ratio test. At a 5% significance level, the zero hypothesis of equal fit or loadings invariance is not rejected for within-rater invariance when we used only data from the heel stick phases, but it was otherwise rejected, most sharply for overall loading invariance (OLInv). Differences between the fit indices RMSEA, CFI and TLI yield different test results. Using the 1% level rejection areas [50] for the RMSEA, measurement invariance is rejected when the difference is > 0.013, for the CFI, it is rejected when it is < − 0.0085, and, for the TLI, when it is < − 0.0078. Accordingly, within-rater loadings invariance (WRLInv) is never rejected, but overall measurement invariance (OLInv) is always rejected with CFI and TLI, and never with RMSEA.
The tests strongly suggest that the pain measurement construct under consideration works differently for different raters. For within-rater invariance, invariance is not rejected during the heel stick phases; for all data, it is rejected by the χ 2 -test but not by RMSEA, CFI and TLI. We may assume approximate invariance, while keeping in mind the results.

Reliability and validity of the modified BPSN
Our factor analysis and analysis of the physiological items' sensitivity to pain led us to adopt a modified version of the BPSN for our next analyses. The modified BPSN includes a behavioural subscale (facial expression, crying, and posture) and adds heart rate as a pain indicator.

Cronbach's alpha and corrected item-Total correlation
Cronbach's Alpha, corrected item-total correlation coefficients and the resulting Alpha when an individual item is removed from the scale (Alpha if Item Deleted) for the modified behavioural subscale are summarized in Table 6. During the heel stick phases of the five measurement points, Cronbach's Alpha coefficients of the five raters

Correlations between behavioural and physiological indicators of pain
We examined the associations between behavioural and physiological indicators of pain with the modified behavioural subscale of the BPSN including the items crying, facial expression, and posture, and the physiological item heart rate. See Table 7 for the correlation coefficients of these analyses. At measurement point 3, the correlation coefficients differed significantly between the five raters (p = 0.008), while the correlation coefficients were approximately the same during the other measurement points (p > 0.05). When we considered a Bonferroni adjusted p-value (p < 0.05/10), none of the correlation coefficients would differ significantly between the five raters.

Construct validity
To determine construct validity of the BPSN, we compared levels of pain scores of the modified behavioural subscale between the three phases. The residual variance of this analysis was σ 2 = 1.708 (SD = 1.307); variances of the random effects were σ 2 = 0.354 (SD = 0.595) for neonates and σ 2 = 0.391 (SD = 0.625) for raters. Phases significantly affected the level of behavioural pain scores (χ 2 (10) = 864.18, p < 0.001). Behavioural pain scores in the heel stick phases averaged 1.04 higher than pain scores in the baseline phases, and 1.13 higher than pain scores in the recovery phases. More results are summarized in Table 8. The same analysis was performed for the item heart rate ( Table 8). The residual variance of this analysis was σ 2 = 0.588 (SD = 0.767) and variance of the random effect neonates was σ 2 = 0.037 (SD = 0.191). GA at time of birth significantly affected behavioural pain scores (SE = 0.01, t = 5.488) and heart rate (SE = 0.01, t = 6.145). Gender had no effect on behavioural pain scores (SE = 0.10, t = − 0.170) or on heart rate (SE = 0.05, t = 0.051).

Concurrent validity
We examined the concurrent validity between the modified total score of the BPSN and the PIPP-R. See Table 9 for the correlation coefficients of these analyses. The correlation coefficients of the five raters were the same in about half of the cases. They differed significantly at measurement point 1 (p = 0.010) and measurement point 4 (p = 0.045). With a Bonferroni adjusted p-value (p < 0.05/15), none of the correlation coefficients differed significantly between the five raters.

Sensitivity and specificity
The results of the ROC analyses to examine sensitivity and specificity of the modified BPSN total score (including crying, facial expression, posture, and heart rate) are shown in Table 10. During the heel stick phases of the five measurement points, a cut-off of 1.5 points fits best to reach a sensitivity of approximately 80% and a specificity of similar accuracy.

Results of the psychometric testing of the BPSN separated by GA-groups
Interrater reliability ICCs coefficients of the four different GA-groups are summarized in Table 11. Interrater reliability of the items facial expression, posture and consolation tended to improve as GA increases.

Internal consistency of the modified behavioural BPSN subscale
Cronbach's Alpha calculated separately for the four GA-groups, are summarized in Table 12. Most Cronbach's Alpha coefficients were in the range of acceptable to excellent [58] during the heel stick phases of the five measurement points.

Correlations between behavioural and physiological indicators of pain
During the heel stick phases of the five measurement points and among the five raters, correlations between the modified behavioural subscale of the BPSN and the item heart rate ranged from r =

Sensitivity and specificity
The results of the ROC analyses to examine sensitivity and specificity of the modified BPSN total scale separately for each GA-group are provided in Table 13. We found cut-off points needed to increase along with GA to reach about 80% sensitivity and similarly high specificity.

Discussion
After rigorous statistical testing, we significantly reduced the number of items in the original BPSN, leaving only three behavioural items: facial expression, crying, and posture. We included only one physiological item, heart rate, in the new version. Psychometric properties of these four items indicate convincing validity across all GA groups, but GA should be considered in pain assessment because different GA-groups require different cut-off points.

Factor structure and reliability of the BPSN
The factor analysis showed that a model that includes the items crying, facial expression, and posture fits the data best. In fact, facial expression, crying, and body movement are widely studied indicators for pain assessment in neonates and are considered the most sensitive behavioural indicators of pain [4,59,60]. Facial expression is considered the most reliable and sensitive indicator for pain assessment in both preterm and full-term neonates [4]. Facial expressions extremely preterm neonates are likely to show include brow bulge, eye squeeze, nasolabial furrow, and vertical mouth stretch [20]. The BPSN more generally assesses facial expression, which aids in assessing preterm infants who wear CPAP masks and tapes to fix tubes to the skin, which can make it difficult to assess specific components of expression, like nasolabial furrow. The PIPP-R item nasolabial furrow was the least frequently rated item in our study, often because it was obscured by CPAP masks or tapes.
Crying is a common pain response in neonates and is included in several pain scales (e.g., [27,[61][62][63]), but some have questioned crying as an indicator of pain because it cannot be assessed in some neonates [21,59]. Mechanical ventilation, inhibiting drugs, severe illness, and other reasons may limit the ability to cry. Although crying is not specific to pain [59], it may be the first indication a caregiver has that an infant is in pain [64]. Preterm neonates with immature facial muscles are less able to communicate their pain through facial expressions, so crying can alert their caregivers [17].
Several pain assessment tools include one or more items that assess body movements (e.g., [9,61,65,66]. Holsti, Grunau, Oberlander and Whitfield [67] analysed behavioural pain reaction of early preterm neonates with the Newborn Individualized Development Care and Assessment Program (NIDCAP). They found that neonates flexed and extended their arms and legs, put their hands     Note. Median = Median of the Pearson product-moment correlation calculated for each rater separately; * Bonferroni adjusted p-value < 0.001 on their faces, fisted, and finger splayed more often during the heel stick procedure. Morison et al. [68] found neonates with lower GA at birth made more specific body movements but had less facial expression at 32 weeks post-conceptional age, which suggests assessing body movements could provide useful supplementary information about preterm neonates. The BPSN more generally assesses body movement by evaluating a neonate's posture on a 4-point Likert-scale, ranging from relaxed body to permanent tension. Our results suggest that posture is a sensitive indicator for assessing pain across GA-groups. We found that heart rate and oxygen saturation did not load on a common physiological factor or correlate with each other. Because heart rate was more sensitive to pain and more strongly associated with the three behavioural indicators of pain, we included heart rate in the new version of the BPSN. The results of our analyses confirm previous findings that correlations between behavioural and physiological indicators of pain were low [69][70][71], behavioural indicators were more sensitive to pain than physiological indicators [69,72], and heart rate was more sensitive to pain than oxygen saturation [71].
Though factor loadings of crying, facial expression, and posture did not vary within raters during the heel stick phases, they did vary between raters. This result suggests that different raters assess pain differently, an assumption further supported by the results of our interrater reliability analysis. There was good to excellent interrater agreement on crying, but agreement on facial expression and posture ranged from poor to good [73], depending on the measurement point and the model to calculate ICCs. The differences in interrater reliability could be explained by differences in the way raters defined the items. Crying may be a more objective and reliable item than facial expression or posture because it considers duration. Improving the guidelines and training for applying the BPSN may improve interrater agreement.
The first validation study of the BPSN [24] used Cronbach's Alpha reliability coefficient to calculate interrater reliability, and found interrater reliability of the subjective subscale of the BPSN (r = 0.77-0.97) was high. Cronbach's Alpha determines if the ratings of two or more persons are consistent, but it does not measure absolute agreement [74]. Since the cut-off differentiates between a painful and non-painful state, agreement between nurses and other caregivers about an infant's level of pain is crucial. We thus decided to use the more stringent absolute agreement model to calculate interrater reliability.
Interrater agreement and factor loadings of the items crying, facial expression, consolation, and posture tended to decrease over time. Cronbach's Alpha and corrected item-total correlations of the items crying, facial expression, and posture tended to decrease too. This accords with the results of another study that showed high within-subject variability among preterm neonates' pain reaction across repeated measurement points [75]. Interrater reliability was high during the heel sticks 1-3 and decreased during heel sticks 4-5. These findings cannot be explained by rater fatigue, because the video sequences were analysed in random order. The variability in pain reactions might be explained by the Note. The PIPP-R was the reference value, with a cut-off point of 6.5 that discriminated between no/low pain (≤ 6 points) and moderate to high pain (≥ 7 points); AUC = Area under the curve; [95% CI] = 95% confidence intervals of the AUC; the results were originally computed separately for each rater and aggregated assuming normal distribution of the parameters; bold-set font = cut-offs with sensitivity and specificity nearest 80% Table 9 Pearson product-moment correlation coefficients of the correlations between the total scores of the modified Bernese Pain Scale for Neonates and the PIPP-R  Note. Correlation coefficients were calculated for the heel stick phases of the five measurement points (t1-t5); Median = Median of the Pearson product-moment correlation coefficients that were calculated separately for each rater; **p < 0.01; * p < 0.05 influence of individual contextual factors and needs to be investigated [1,2,20,21].

Validity of the modified BPSN
The modified BPSN that includes crying, facial expression, posture, and heart rate showed good construct validity and concurrent validity with the PIPP-R. Pain scores on the behavioural subscale averaged more than one point higher during the heel stick than during the baseline and recovery phases. Pain scores on heart rate averaged 0.65 points higher during the heel stick phase than during the recovery phase. Neonates' GA at time of birth influenced their pain scores. With every additional week of GA, pain scores on the behavioural subscale (crying, facial expression, posture) increased about 0.063 points. If we apply this result on our study sample with a wide range of GAs (24 2/7-42 5/7 weeks of GA), behavioural pain reaction of the neonate with Heart rate of the neonate with the highest GA was also about 0.76 points higher than heart rate of the neonate with the lowest GA. Like other studies that analysed the relationship between gender and pain reaction in neonates (e.g., [76][77][78]), we found gender had no effect on the level of pain scores.

Sensitivity and specificity of the modified BPSN
The results of the sensitivity and specificity analyses suggest that a cut-off of 1.5 points (total overall score = 12 points) would discriminate between no to low pain and moderate to high pain (measured with the PIPP-R). For the original BPSN scale, the cut-off was much higher, at 10.5 points (total overall score = 27 points). We found that the mean of the BPSN total scale that included nine Note. The PIPP-R was the reference value, with a cut-off point of 6.5 that discriminated between no/low pain (≤ 6 points) and moderate to high pain (≥ 7 points); AUC = Area under the curve; [95% CI] = 95% confidence intervals of the AUC; the results were originally computed separately for each rater and aggregated assuming normal distribution of the parameter; Range = heel stick phases of measurement points t1-t5; bold-set font = cut-offs with sensitivity and specificity nearest 80% items varied widely and depended on the rater, but it did not reach the cut-off value of 11 points during the heel stick phases of the five measurement points. The preliminary dose of oral sucrose administered to neonates before each heel stick may have lowered pain scores in our study [28]. In the first validation study of the BPSN, neonates received no pain relieving intervention before the heel stick, and BPSN total scores increased significantly during the heel stick, averaging 15.96 points (SD = 5.7) [24]. The relief provided by sucrose should be factored into the decision about a new cut-off value for the modified BPSN.

Comparison of different GA-groups
Neonates with younger GA at birth had lower pain scores than more mature infants. The results of the separate sensitivity and specificity analyses for the four GA-groups indicated as GA increases, so should the cut-off of the BPSN that discriminates between no to low pain and moderate to high pain (measured with the PIPP-R). To reach a sensitivity and specificity of approximately 80%, extremely preterm neonates require a cut-off value of 0.5 points, very preterm neonates require 1.5 points, moderate to late preterm neonates require 2.5 points, and full-term neonates require 3.5 points. Our ROC analysis showed that the modified BPSN was least able, but still moderately good [41], to discriminate between neonates who experience no or low pain and neonates who experience moderate to high pain in the group of extremely preterm neonates and increases with increasing GA. Extremely preterm neonates' pain expression may be less apparent because their immature nervous system and facial muscles prevent them from expressing a robust pain reaction [20,21,60,68]. Understanding the difficulty this poses for accurate pain assessment in extremely preterm neonates could be helpful when establishing cut-off values for the BPSN. Based on our study results, we recommend differentiating between GA-groups and establishing cut-off values based on GA. The PIPP-R already includes GA in pain assessment; the younger the GA, the more points PIPP-R adds to the pain score [26]. The other analyses we conducted separately for the four GA-groups showed that concurrent validity of the modified BPSN total score with the PIPP-R was highest for full-term neonates (r = 0.814-0.834) and lowest, but still good, for extremely preterm neonates (r = 0.631-0.710). Interrater agreement on facial expression and posture tended to improve as GA increased.

Limitations
This study is limited, first, by our decision to rate neonates' pain expression from video sequences. Characteristics of the videos may have affected the reliability of the ratings (e.g., poor lighting conditions, quality of the raters' screen, position of the neonate, several assistants for video recording). Second, different nurses performed the heel sticks, and their individual characteristics may have influenced neonates' pain reaction. Third, particularly during the baseline and recovery phases, where the scores of the items were low, floor effects may have influenced our study results. For example, we considered a variety of extensions of the model specification in our factor analysis but discarded them because of convergence problems likely related to floor effects, when upper categories were almost or completely left empty. Treating the rating scores as numeric did not resolve floor effect problems, or rather the opposite [79], but allowed to obtain results. Floor effects may also have lowered interrater agreement, especially during the baseline and recovery phases. Fourth, our later hypothesis testing may be compromised by measurement error caused by low interrater agreement [40]. We compensated for this possible problem by either including the raters in the model, or by conducting separate analyses for each rater and then pooling the results. Fifth, pain reaction was measured during the heel stick, so our results cannot be generalized to other acute painful procedures or more persistent or chronic pain. The BPSN is used for routine pain assessment in NICUs and should therefore be sensitive to repeated and more prolonged and chronic pain, so future validation studies should assess and compare the level of pain scores during different painful situations.

Conclusions
The modified version of the BPSN that includes facial expression, crying, posture, and heart rate is a promising tool for assessing acute pain in full-term and preterm neonates across gestational ages, but our results suggest that adding different cut-off points for different GA-groups will improve the BPSN's clinical usefulness. professional support for video-data elaboration. We thank Dr. Kali Tal for her editing work. Finally, we thank the research assistants, without whose help and support the data collection in this study would not have been feasible.

Funding
This research was funded by the Swiss National Science Foundation (SNF 320030_159573).

Availability of data and materials
The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.
Authors' contributions EC conceived of and designed the study and was the primary investigator. DB, SS, MN, and BS co-authored the study proposal and supported EC in designing the study. DB, SS, and MN provided access to the research field. The doctoral student KS was responsible for all tasks of the data collection process and data entry, management and analysis. She was also responsible for reporting and disseminating the outcomes in peer-reviewed journals and conferences. LS recruited at the University Hospital NICU in Bern and supported the doctoral student in the data collection process. RB supported the doctoral student in the data analyses. All authors read and approved the final manuscript.

Ethics approval and consent to participate
The study was approved by the Ethics Committee Bern (2015-238), the Ethics Committee northwest/central Switzerland EKNZ (2015-385) and the Ethics Committee Zurich (2015-563). Written informed consent was obtained from parents according to the protocol approved by the ethics committees. We did not expose infants to additional painful situations. No heel sticks were performed solely for research purposes. We upheld the current standard of care in pain prevention by giving oral sucrose to all infants before the heel stick procedure.

Consent for publication
Not applicable.