Reproducibility and inter-observer agreement of Greulich-Pyle protocol to estimate skeletal age among female adolescent soccer players

Background Skeletal age (SA) is considered the best method of assessing biological maturation. The aim of this study was to determine intra-observer (reproducibility) and inter-observer agreement of SA values obtained via the Greulich-Pyle (GP) method. In addition, the variation in calculated SAs by alternative GP protocols was examined. Methods The sample was composed of 100 Portuguese female soccer players aged 12.0–16.7 years. SAs were determined using the GP method by two observers (OB1: experience < 100 exams using GP; OB2: experience > 2000 exams using several methods). The radiographs were examined using alternative GP protocols: (wholeGP) the plate was matched to the atlas as an overall approach; (30-boneGP) bone-by-bone inspections of 30-bones; (GPpmb) bone-by-bone inspections of the pre-mature bones only. For the 30-boneGP and GPpmb approaches, SA was calculated via the mean (M) and the median (Md). Results Reproducibility ranged 82–100% and 88–100% for OB1 and OB2, respectively. Inter-observer agreement (100 participants multiplied by 30 bones) was 92.1%. For specific bones, agreement rates less than 90% were found for scaphoid (81%), medial phalange V (83%), trapezium (84%) and metacarpal V (87%). Differences in wholeGP SAs obtained by the two observers were moderate (d-cohen was 0.79). Mean differences between observers when using bone-by bone SAs were trivial (30-boneGP: d-cohen less than 0.05; GPpmb: d-cohen less than 0.10). The impact of using the mean or the median was negligible, particularly when analyses did not include bones scored as mature. Conclusion The GP appeared to be a reasonably reproducible method to assess SA and inter-observer agreement was acceptable. There is evidence to support a recommendation of only scoring pre-mature bones during later adolescence. Further research is required to examine whether these findings are consistent in younger girls and in boys. Supplementary information Supplementary information accompanies this paper at 10.1186/s12887-020-02383-4.


Background
The study of growth status, biological maturation, and physical performance is central to sports sciences, human biology, and pediatrics. Growth refers to changes in body size and has implications for proportionality, shape, and composition [1]. Maturation is more difficult to define and refers to the progress toward the adult (mature) state. This occurs in all tissues and organs at different timing and rates, affecting functions and metabolism. Skeletal age (SA) refers to the degree of skeletal maturation and can be examined via a standardized radiographs (usually of the left hand and wrist). Although several indicators of biological maturation are available (e.g. secondary sex characteristics, age at peak velocity, predicted percentage of mature stature), SA is frequently considered the gold standard indicator of biological maturity, partly because it can be applied from fetal life through childhood and the second decade of life [1].
Assessment of biological maturation is common in youth sports. It has been recognized that maturity status impacts performance [2][3][4][5], injury [6][7][8], and selection [9]. Many studies have assessed the SA of adolescents participating in team sports (e.g. soccer, hockey) and found, in general, that players were early, maturing, taller, heavier and stronger [10,11]. Malina et al. [12] suggested that one reason for this observation is that, among male adolescent soccer players late maturers tended to be systematically excluded during years of maximal growth while those classified as early or average maturing are selected and/or promoted by coaches and club administrators and this became even more evident as the players got older and specialized in one sport. Although the same rationale could be applied to female adolescent athletes, the evidence is scarce.
The Greulich-Pyle (GP) method is often used to obtain a SA estimate and involves comparing each individual bone against pictorial standards [13]. Across published literature, however, there are various versions of the GP protocol and often a lack of detail regarding the methods used to estimate SA. For example, one study [14] stated "skeletal maturation was evaluated by the determination of bone age (BA) according to the GP method" (page 626) and did not detail the GP procedures, specifically whether the radiograph was examined as a whole (wholeGP) or via bone by bone. A more recent study [15] described the biological maturity variation in body size, functional capacities, and sport-specific technical skills of 60 male Brazilian adolescent soccer players. The following was stated in the methodology: "the left handwrist radiographs were obtained in a specialized laboratory and the GP method was adopted to estimate SA" (page 465). The original authors prescribed that after identifying the standard which most closely resembles the film being assessed, one should proceed to make a more detailed comparison of the individual bones [13]. In practice, GP SAs seemed not to be properly obtained based on the SA of the standard plate which the film of a young person most closely matches, thus excluding variation among bones. Research is required to ascertain the error associated with different methods of examination to inform future studies.
The present study aimed to examine the intraobserver reproducibility and inter-examiner agreement using a variation in GP protocols: a) overall (wholeGP) or bone-by-bone approach; b) inspection of all bones (30-boneGP) or solely the pre-mature bones (GPpmb) c) calculating SA using the mean or the median values. It was hypothesized that agreement rates would be higher when observers follow a wholeGP and bone-by-bone approaches (30-boneGP; GPpmb).
The present study was aimed to examine the intraobserver reproducibility and inter-examiner agreement following concurrent GP protocols. Firstly, estimates obtained using an overall (wholeGP) or a bone-by-bone approach were compared. While using inspection by individual bones, estimates derived from all bones (30-boneGP) or uniquely from the pre-mature bones (GPpmb) were also compared. Finally, intra-individual mean differences were tested after calculating SAs using the mean or the median values from examined bonespecific SAs to obtain an individual SA estimate. It was hypothesized that agreement rates would be different when observers follow a wholeGP or bone-by-bone approaches.

Ethics and procedures
This cross-sectional, descriptive study was approved by the Ethics Committee for Sports Sciences in the University of Coimbra (Reference CE/FCDEF-UC/00122014). All data were collected in the Porto Sports Medicine Center as part of the medical exams for registration in the Portuguese Soccer Federation (Law 204/2006; act 11/ 2012). An institutional agreement was signed between the University of Coimbra and the Portuguese Institute of Sports [IPDJ/FCDEF.UC/2017-01]. Parents or legal guardians were informed about the aims, testing protocols, risks and provided informed consent. Participants were also informed about the nature of the research and that they were allowed to withdraw from the study at any time.
A standardized radiograph of the left hand-wrist was obtained by an experienced technician. SA was assessed using the GP method, which is often called the atlas method [13]. It is an inspectional protocol that was developed from a study of children from high socioeconomic families in Ohio (Cleveland, USA). The method involves the matching of a specific radiograph of an observed participant to the closest plate from the collection of illustrations (photographs) representing a sequence of biological (skeletal) maturation. The estimate of SA refers to the CA of the child from the Brush Foundation Study whose plate was classified as the closest to the one under-examination. Thus, if the radiograph of a 13-year-old female soccer player matches the standard plate of the atlas obtained from a 11-year-old girl, the SA of the participant is 11 years. Each film was rated twice by an observer (OB1) who had completed a 45-h post-graduation course including the anatomy of hand and wrist, the biological basis of skeletal tissue, and the sequence of changes for each of the 30 bones assessed by the GP method. In parallel, measurements were also completed by a trained examiner (BO2) who had experience of conducting over 2000 assessments over the previous 3 years using the GP method in addition to other protocols (e.g. Tanner-Whitehouse; Fels). Examiners did not know the CA of the participants prior to applying the GP method.

Participants
The sample for this study were 100 Portuguese female soccer players aged 12.0-16.7 years. To be included in the study participants were required to have played competitive soccer for at least 2 years in a club affiliated to the Portuguese Soccer Federation. Exclusion criteria were: (i) ≥17 years of age; (ii) any traumatic bone injury in the hand and left wrist that causes radiopaque lines or areas; (iii) previous/current exposure to medicines (e.g. steroids, growth hormone) that affects growth acceleration.

Determination of SA
Each radiograph film was evaluated using three alternative protocols: inspection of the whole plate with the closest atlas photograph retained as the SA of the participant (wholeGP); 30 bones were individually examined and included in calculations of the SA (30-boneGP); 30 bones were individually examined and calculations to obtain the SA of the participant were based on premature bones only (GPpmb). For 30-boneGP and GPpmb, SA was calculated using the mean (M) or, in alternative, the median (Md). Consequently, the following scores were produced: 30-boneGP-M, 30-boneGP-Md, GPpmb-M, GPpmb-Md.

Analyses
Intra-observer reproducibility rates for each of the 30 bones were reported for each participant. Afterwards, inter-observer agreement was calculated for each bone and analyzed by maturity status (that is, agreement between two observers was examined separately for premature and mature bones). The error (OB1 minus OB2) was calculated for each individual bone. Mean differences of the SAs rated by OB1 and OB2 were calculated and magnitudes of the differences interpreted as follows [16]: < 0.20 (trivial), 0.20 to 0.59 (small), 0.60 to 1.19 (moderate), 1.20 to 1.99 (large), 2.0 to 3.9 (very large), and > 4.0 (nearly perfect). SAs produced by OB1 and OB2 were plotted using scatter diagrams. Overlapping variance among examiners was determined using Pearson correlation coefficients which were interpreted as follows [16]: trivial (r < 0.1), small (0.1 < r < 0.3), moderate (0.3 < r < 0.5), large (0.5 < r < 0.7), very large (0.7 < r < 0.9), and nearly perfect (r > 0.9). Intra-class correlation coefficients were calculated to examine the variance between measurements for each observer. All analyses were performed using SPSS version 20 (SPSS, Inc. IBM Company; NY, USA) and Graphpad Prism (version 5 for Windows, GraphPad Software, San Diego California USA, www.graphpad.com).

Results
Tables 1 and 2 summarizes the intra-observer agreement rates, respectively for OB1 and OB2. The less experienced observer (OB1) agreement rates were < 90% (i.e. for radius: 82%; ulna: 86%; metacarpal V: 89%; proximal phalange II: 89%; proximal phalange IV: 89%). Of these 66.8% were positive (time-moment 1 minus time-moment 2) and 87.5% of the total number of errors were − 1 or + 1 plates of the atlas (two consecutive stagessee Table 1). Identical analysis was performed by the more experienced observer (OB2) and results summarized in Table 2. When 30 bones were individually scored, intra-observer agreement rate was < 90% uniquely for proximal phalange I (88%) and 99% of the errors emerged from variation among two consecutive stages with a trend for lower SA values in the second examination (discrepancies − 1 stage: 61.1%; discrepancies + 1 stage: 23.3%). In other words, increasing practice, particularly in OB1, tended to produce slightly lower SA scores.
Inter-observer agreement rates are summarized in Table 3 for each bone and by maturity status (premature or mature). The two observers had 100% agreement when scoring mature bones. The number of participants who had pre-mature bones were greater for radius (n = 70), ulna (n = 52) and distal phalanges I-V (n = 65). The agreement rates among the two observers were lower for the pre-mature carpals (capitate, hamate, triquetral, lunate, scaphoid, trapezium, trapezoid). Excluding the pisiform and adductor sesamoid, disagreement between observers for the 28 bones was 7.9% (221 occasions) of the observations. Table 4 showed that 80.3% of disagreements were − 1 stage and + 1 stage, 17.1% were − 2 stages and + 2 stages and 2.5% were − 3 stages and + 3 stages. In general, when SAs were not identical, the less experienced observer (OB1) tended to score higher SAs (69.2%). For the total sample, mean SAs obtained from OB1 and OB2 attained identical values with mean differences classified as trivial, except for triquetral (d = 0.450; small), scaphoid (d = 0.390; small), metacarpal (d = 0.364; small) and medial phalange IV (d = 0.201; small).
Mean SAs calculated by the two observers are presented in Table 5. The mean SAs derived by the overall inspection (wholeGP) was 16.83 ± 1.30 years and 15.38 ± 1.22 years, respectively for OB1 and OB2 (d = 0.79; moderate mean differences). The bone-by-bone approaches attenuated differences between observers. When using the median, differences between observers resulted in no significant values and were considered as trivial (d = 0.10, for the pre-mature bones only; d = 0.05 for all 30 bones). Finally, independent from the observer, 30-boneGP using the median resulted in higher SAs compared to using the mean: + 0.36 years for OB1; + 0.38 years for OB2. When analyses used pre-mature bones only, mean and median Table 1 Intra-observer error (observer 1) on SAs estimates among female adolescent soccer players (n = 100) values did not differ (Fig. 1d and e). The correlation coefficient between observers ranged from 0.841 (95% CI: 0.690 to 0.922) to 1.00 with the narrowest overlapping variance occurring when the exams followed bone-by-bone approach and calculations included both pre-mature and mature bones using the mean (30-boneGP-M).

Discussion
Although Greulich-Pyle method has been often used to estimate SA from hand-wrist radiographs, little attention has been given to the impact of adopting different methodological approaches. The current study examined the reproducibility and agreement between two observers who assessed SAs of 100 female adolescent soccer players using the GP protocol. Disagreement between observers mostly occurred on carpal bones. Intraobserver agreement rates were acceptable, although the reproducibility was slightly lower for the less experienced. When differences existed, lower SAs were more likely to be derived in the second time-measurement. Finally, comparison between observers noted that the more experienced observer tended to produce slightly lower SA scores. A previous study [17] of North-American children aged 4-15 years noted a lack of intra-observer agreement among carpal bones. It is plausible that agreement rates were associated with age, particularly before round bones (i.e. carpals) reach the mature state [18]. Carpals are more complex to rate compared to long bones [19], whose examinations concentrate on the centers of ossification and fusion of the epiphyseo-diaphysial. The examination of the carpal bones includes the inspection of the shape and radiopaque lines or zones which may help explain the poorer inter-observer agreement rates in the present study. Although mean differences between examiners in bone-specific SAs tended to be trivial or small, the number of disagreements (> 10%), were particularly Table 3 Inter-observer agreement rates on the SAs according to skeletal classification as not mature or mature and for the total number of examinations among female adolescent soccer players (n = 100)    apparent in the scaphoid and trapezium, metacarpals IV-V, and proximal phalange V. In general, the less experienced observer overestimated the ratings when compared to the experienced observer. Thus, less experienced examiners may need to adopt a conservative decision (i.e. when unsure match the radiograph Fig. 1 Scatterplot of the SA estimated by Observer 1 (y-axis) and Observer 2 (x-axis) in the whole inspection (a), a bone-by-bone approach using the mean (b) and, alternatively, the median (c) to calculate individual SA from all bone-specific SAs; and uniquely considering SAs of the premature bones (d using the mean; e using the median) to the younger of the two standard plates in the atlas). The literature includes considerable discussion about the exclusion of the carpal bones when a hand-wrist radiograph is assessed to obtain an estimate of SA [18][19][20]. The atlas method involves assigning a SA to each of the 30 bones of the hand-wrist [13]. The literature is not consistent regarding the appropriate utilization of the protocol. Consequently, GP SA is often assessed on the basis of an overall approach (that is, matching a film ignoring potential variation among bones [20,21]. The current study examined alternative approaches such as only including 30 bones or pre-mature bones. The exclusion of mature bones is common in the literature [22,23]. For example, Todd [24] recommended retention of the most advanced bones when calculating SA. In the present study, within each observer, the SA using the mean did not show fluctuations when considering the mature bones or not. Among Australian females, differences between GP SA mean from all bones and GP SA mean excluding the carpals were 0.02 and 0.05 for 12-and 13-year-old groups, respectively [22]. In the present study, the inter-individual variance was substantially reduced when the calculations were based on the mean and included mature bones (standard deviation was 0.36-0.37 years, depending of the observer). The largest standard deviation was seen in the overall (wholeGP) approach (standard deviation: 1.22-1.30 years). Recently, the median has been recommended as an alternative to the mean [1] to obtain the final SA from bone-specific SAs: "The SA of the standard plate is the assigned SA of the bone in question. The process is repeated for all bones that are visible in the hand-wrist x-ray and the child's SA is the median of the skeletal ages of each individually rated bone" (pages 279-280).
There are a few limitations to note. The present study only included adolescent females, many of whom has mature carpals. Future studies need to consider younger samples and males. However, it should be noted that sport tends to focus on the middle and later adolescent years [2,3,6,9], particularly in team sports such as soccer where competitive and organized participation tends to start after 11 years. Future research should focus on whether the impact of observer and methodological approach differs via child maturity status (i.e. early, average or late maturing).

Conclusions
In summary, the GP method showed acceptable reproducibility and agreement between observers, suggesting that 45-h of training (rating 100 radiographs) is adequate. Where an observer is less experienced, he/she should be encouraged to select the younger age of two standards when the decision is not obvious. The BbB approach has a greater inter-observer agreement compared to the overall approach. Observers should organize their readings following a particular bone, instead of completing the scores by participant. Finally, the estimate of individual SA for each youth participant should use either the mean or the median of the pre-mature bones.