We provide a tutorial on the construction and evaluation of Markov decision processes (MDPs), which are powerful analytical tools used for sequential decision making under uncertainty that have been widely used in many industrial and manufacturing applications but are underutilized in medical decision making (MDM). We demonstrate the use of an MDP to solve a sequential clinical treatment problem under uncertainty. Markov decision processes generalize standard Markov models in that a decision process is embedded in the model and multiple decisions are made over time. Furthermore, they have significant advantages over standard decision analysis. We compare MDPs to standard Markov-based simulation models by solving the problem of the optimal timing of living-donor liver transplantation using both methods. Both models result in the same optimal transplantation policy and the same total life expectancies for the same patient and living donor. The computation time for solving the MDP model is significantly smaller than that for solving the Markov model. We briefly describe the growing literature of MDPs applied to medical decisions.
Breast cancer risk estimation with artificial neural networks revisited
Discrimination and calibration
Turgay Ayer MS,
- Industrial and Systems Engineering Department, University of Wisconsin, Madison, Wisconsin
- Department of Radiology, University of Wisconsin, Madison, Wisconsin
Oguzhan Alagoz PhD,
- Industrial and Systems Engineering Department, University of Wisconsin, Madison, Wisconsin
Jagpreet Chhatwal PhD,
- Health Economic Statistics, Merck Research Laboratories, North Wales, Pennsylvania
Jude W. Shavlik PhD,
- Department of Computer Science, University of Wisconsin, Madison, Wisconsin
Charles E. Kahn Jr MD, MS,
- Department of Radiology, Medical College of Wisconsin, Milwaukee, Wisconsin
Elizabeth S. Burnside MD, MPH, MSCorresponding author
- Industrial and Systems Engineering Department, University of Wisconsin, Madison, Wisconsin
- Department of Radiology, University of Wisconsin, Madison, Wisconsin
- Department of Biostatistics and Medical Informatics, University of Wisconsin School of Medicine and Public Health, Madison, Wisconsin
- Department of Radiology, University of Wisconsin Medical School, E3 of 311, 600 Highland Avenue, Madison, WI 53792-3252
Discriminating malignant breast lesions from benign ones and accurately predicting the risk of breast cancer for individual patients are crucial to successful clinical decisions. In the past, several artificial neural network (ANN) models have been developed for breast cancer-risk prediction. All studies have reported discrimination performance, but not one has assessed calibration, which is an equivalently important measure for accurate risk prediction. In this study, the authors have evaluated whether an artificial neural network (ANN) trained on a large prospectively collected dataset of consecutive mammography findings can discriminate between benign and malignant disease and accurately predict the probability of breast cancer for individual patients.
Our dataset consisted of 62,219 consecutively collected mammography findings matched with the Wisconsin State Cancer Reporting System. The authors built a 3-layer feedforward ANN with 1000 hidden-layer nodes. The authors trained and tested their ANN by using 10-fold cross-validation to predict the risk of breast cancer. The authors used area the under the receiver-operating characteristic curve (AUC), sensitivity, and specificity to evaluate discriminative performance of the radiologists and their ANN. The authors assessed the accuracy of risk prediction (ie, calibration) of their ANN by using the Hosmer-Lemeshow (H-L) goodness-of-fit test.
Their ANN demonstrated superior discrimination (AUC, 0.965) compared with the radiologists (AUC, 0.939; P < .001). The authors' ANN was also well calibrated as shown by an H-L goodness of fit P-value of .13.
The authors' ANN can effectively discriminate malignant abnormalities from benign ones and accurately predict the risk of breast cancer for individual abnormalities. Cancer 2010. © 2010 American Cancer Society.
Successful breast cancer diagnosis requires systematic image analysis, characterization, and integration of many clinical and mammographic variables.1 An ideal diagnostic system would discriminate between benign and malignant findings perfectly. Unfortunately, perfect discrimination has not been achieved, so radiologists must make decisions based on their best judgment of breast cancer risk amid substantial uncertainty. When there are numerous interacting predictive variables, ad hoc decision strategies based on experience and memory may lead to errors2 and variability in practice.3, 4 That is why there is intense interest in developing tools that can calculate an accurate probability of breast cancer to aid in making decisions.5-7
Discrimination and calibration are the 2 main components of accuracy in a risk-assessment model.8, 9 Discrimination is the ability to distinguish benign abnormalities from malignant ones. Although assessing discrimination with area under the receiver-operating characteristic (ROC) curve (AUC) is a popular method in the medical community, it may not be optimal in assessing risk prediction models that stratify individuals into risk categories.10 In this setting, calibration is also an important tool for accurate risk assessment of individual patients. Calibration measures how well the probabilities generated by the risk prediction model agree with the observed probabilities in the actual population of interest.11 There is a trade off between discrimination and calibration, and a model typically cannot be perfect in both.10 In general, risk-prediction models need good discrimination, when their aim is to separate malignant findings from benign ones, and good calibration, when their aim is to stratify individuals into higher or lower risk categories, to aid in decisions and communication.11
Computer models have the potential to help radiologists increase the accuracy of mammography examinations in both detection12-15 and diagnosis.16-20 Existing computer models in the domain of breast-cancer diagnosis can be classified under 3 broad categories: prognostic, computer-aided detection (CAD), and computer-aided diagnostic (CADx) models. Prognostic models, such as the Gail model,21-24 use retrospective risk factors such as a woman's age, her personal and family histories of breast cancer, and clinical information to predict breast cancer risk during a time interval in the future for treatment or risk-reduction decisions.24 These models provide guidance for clinical trial eligibility, tailored disease surveillance, and chemoprevention strategies.25 Because risk stratification is of primary interest in prognostic models, the performance of these models is assessed principally by calibration measures.11Detection or CAD models12-15, 26-28 are developed to assist radiologists in identifying possible abnormalities in radiologic images, leaving the interpretation of the abnormality to the radiologist.29 Because discrimination is most important, and calibration is less critical in detection, the performance of CAD models is typically evaluated in terms of ROC curves.11 Diagnostic or CADx models30-39 characterize findings from mammograms (eg, size, contrast, shape) identified either by a radiologist or a CAD model29 to help radiologists classify lesions as benign or malignant by providing objective information, such as the risk of breast cancer.40 CADx models are similar to prognostic models in 1 way; they estimate the risk of breast malignancy to help physicians and patients improve decisions.29 On the other hand, CADx models differ from prognostic models in the sense that their risk estimation is based on mammography findings and at a single time point (ie, at the time of mammography) to aid in further imaging or intervention decisions. Both discrimination and calibration are important features of a CADx model. High discrimination is needed because helping radiologists to distinguish malignant findings from benign ones is the primary purpose of CADx models.11 In addition, good calibration is needed to stratify risk and communicate the risk with patients as in the example of prognostic models.11
However, existing CADx studies that use ANNs to assess the risk of breast cancer have ignored calibration and focused only on discrimination ability.31, 36, 38, 39 Most of these studies have good discrimination but may be very poorly calibrated.41 For example, 4 such models report that no cancers would be missed if the threshold to defer biopsy was set to 10%-20%.31, 35, 37, 42 By suggesting a threshold in this range to defer biopsy, these models not only substantially exceed the accepted biopsy threshold in clinical practice of 2%,43 but they also indicate a systematic overestimation of malignancy risk. This discrepancy is likely attributable to suboptimal calibration.
In addition, existing studies have several potential limitations that make them impractical for clinical implementation. First, the size of training datasets used for building ANNs in these previous studies has been relatively small (104-1288 lesions)31, 35, 36, 38, 39 to obtain reliable models. Second, the majority of these studies developed models by using only findings that underwent biopsy,30, 31, 35-37, 39 or were referred to a surgeon,38 and excluded other findings in their analysis, which may lead to biased models.
Our research team has developed 2 CADx models that use the same dataset to discriminate malignant mammography findings from benign ones.33, 34 This study differs from our previous research in 2 different ways. First, this study uses a different modeling technique (an artificial neural network [ANN]) than our previous research, which used logistic regression and a Bayesian network. Second, this study considers calibration, whereas our previous research, like many other CADx models, did not evaluate calibration but only evaluated discrimination.
The purpose of our study is to evaluate whether an ANN trained on a large prospectively collected dataset of consecutive mammography findings can discriminate between benign and malignant disease and accurately predict the probability of breast cancer for individual patients.
MATERIALS AND METHODS
The institutional review board exempted this Health Insurance Portability and Accountability Act (HIPAA)-compliant, retrospective study from requiring informed consent. The data used in this study have been presented in our previous studies33, 34 and is repeated here for the convenience of the reader.
All of the screening and diagnostic mammograms performed at the Froedtert and Medical College of Wisconsin Breast Care Center between April 5, 1999 and February 9, 2004 were included in our dataset for retrospective evaluation. We consolidated our database in the National Mammography Database (NMD) format, a data format based on the standardized Breast Imaging Reporting and Data System (BI-RADS) lexicon developed by the American College of Radiology (ACR) for standardized monitoring and tracking of patients.44, 45 The study comprised 48,744 mammograms belonging to 18,269 patients (Table 1).
|No. of mammograms||477 (1)||48,267 (99)||48,744 (100)|
|Age groups, y|
|<45||66 (13.84)||9529 (19.74)||9595 (19.68)|
|45-49||49 (10.27)||7524 (15.59)||7573 (15.54)|
|50-54||56 (11.74)||7335 (15.2)||7391 (15.16)|
|55-59||71 (14.88)||6016 (12.46)||6087 (12.49)|
|60-64||59 (12.37)||4779 (9.9)||4838 (9.93)|
|≥65||176 (36.9)||13,084 (27.11)||13,260 (27.20)|
|Predominantly fatty||61 (12.79)||7226 (14.97)||7287 (14.95)|
|Scattered fibroglandular||201 (42.14)||19,624 (40.66)||19,825 (40.67)|
|Heterogeneously dense||174 (36.48)||17,032 (35.29)||17,206 (35.30)|
|Extremely dense tissue||41 (8.6)||4385 (9.08)||4426 (9.08)|
|1||0 (0)||21,094 (43.7)||21,094 (43.28)|
|2||13 (2.73)||10,048 (20.82)||10,061 (20.64)|
|3||32 (6.71)||8520 (17.65)||8552 (17.54)|
|0||130 (27.25)||8148 (16.88)||8278 (16.98)|
|4||137 (28.72)||364 (0.75)||501 (1.03)|
|5||165 (34.59)||93 (0.19)||258 (0.53)|
Each mammogram was prospectively interpreted by 1 of 8 radiologists. Four of these radiologists were general radiologists, 2 of them were fellowship trained in breast imaging, and the other 2 had extensive experience in breast imaging. These radiologists had between 1-35 years of experience interpreting mammography. Each radiologist reviewed 6994 mammograms on average (median, 2924; range, 49-22,219) in our dataset.
Each mammographic finding, if any, was recorded as a unique entry in our database. In case of a negative mammogram, a single entry showing only demographic data (age, personal history, prior surgery, and hormone replacement therapy) and BI-RADS assessment category was entered. If an image had more than 1 reported finding with only 1 of them being cancer, we considered the other findings as false positives. Throughout the current article, the term “finding” will be used to denote the single record for normal mammograms or each record denoting an abnormality on a mammogram. Both radiologists (for mammography findings) and technologists (for demographic data) used PenRad (Minnetonka, Minn) mammography reporting/tracking data system, which records clinical data in a structured format. (ie, Point-and-click entry of information populates the clinical report and the database simultaneously.) We included in our ANN model all of the demographic risk factors and BI-RADS descriptors that were routinely collected in the practice and predictive of breast cancer (Table 2). We obtained the reading radiologist's information by merging the PenRad data with the radiology information system at the Medical College of Wisconsin. We could not assign 504 findings to a radiologist during our matching protocol. We elected to keep these unassigned findings in our dataset to maintain its consecutive nature.
|Age groups, y||<45, 45-50, 51-54, 55-60, 61-64, ≥65|
|Hormone therapy||None, <5 y, >5 y|
|Personal history of BCA||No, yes|
|Family history of BCA||None, minor (nonfirst-degree family members), major (1 or more first-degree family members)|
|Breast density||Predominantly fatty, scattered fibroglandular, heterogeneously dense, extremely dense|
|Mass shape||Circumscribed, ill-defined, microlobulated, spiculated, not present|
|Mass stability||Decreasing, stable, increasing, not present|
|Mass margins||Oval, round, lobular, irregular, not present|
|Mass density||Fat, low, equal, high, not present|
|Mass size||None, small (<3 cm), large (≥3 cm)|
|Lymph node||Present, not present|
|Asymmetric density||Present, not present|
|Skin thickening||Present, not present|
|Tubular density||Present, not present|
|Skin retraction||Present, not present|
|Nipple retraction||Present, not present|
|Skin thickening||Present, not present|
|Trabecular thickening||Present, not present|
|Skin lesion||Present, not present|
|Axillary adenopathy||Present, not present|
|Architectural distortion||Present, not present|
|Prior history of surgery||No, yes|
|Postoperative change||No, yes|
|Popcorn||Present, not present|
|Milk||Present, not present|
|Rodlike||Present, not present|
|Eggshell||Present, not present|
|Dystrophic||Present, not present|
|Lucent||Present, not present|
|Dermal||Present, not present|
|Round||Scattered, regional, clustered, segmental, linear ductal|
|Punctate||Scattered, regional, clustered, segmental, linear ductal|
|Amorphous||Scattered, regional, clustered, segmental, linear ductal|
|Pleomorphic||Scattered, regional, clustered, segmental, linear ductal|
|Fine Linear||Scattered, regional, clustered, segmental, linear ductal|
|BI-RADS category||0, 1, 2, 3, 4, 5|
We analyzed discrimination and calibration accuracy at the finding level because this is the level at which recall and biopsy decisions are made in clinical practice. We believe this is the level at which computer-assisted models will help radiologists improve performance. However, because conventional analysis of mammographic data is at the mammogram level (where findings from a single study are combined), we also calculated the cancer detection rate, the early stage cancer detection rate, and the abnormal interpretation rate at the mammogram level for comparison. We specify whether analyses in this study are based on mammograms or findings.
Data obtained from the Wisconsin Cancer Reporting System (WCRS), a statewide cancer registry, was used as our reference standard. The WCRS has been collecting information from hospitals, clinics, and physicians since 1978. The WCRS records demographic information, tumor characteristics (eg, date of diagnosis, primary site, stage of disease), and treatment information for all newly diagnosed breast cancers in the state. Under data exchange agreements, out-of-state cancer registries also provide reports on Wisconsin residents diagnosed in their states. Findings that had matching registry reports of ductal carcinoma in situ or any invasive carcinoma within 12 months of a mammogram date were considered positive. Findings shown to be benign by biopsy or without a registry match within the same time period were considered negative.
We built a 3-layer, feed-forward, neural network by using Matlab 7.4 (Matlab, The Mathworks, Natick, Mass) with a backpropagation learning algorithm46 to estimate the likelihood of malignancy. The layers included an input layer of 36 discrete variables (mammographic descriptors, demographic factors, and BI-RADS final assessment categories as entered by the radiologists; Table 2), a hidden layer with 1000 hidden nodes, and an output layer with a single node generating the probability of malignancy for each finding. We designed our ANN to have a large number of hidden nodes, because ANNs with a large number of hidden nodes generalize better than networks with small number of hidden nodes when trained with backpropagation and “early stopping”.47-49 (See Discussion, this article).
To train and test our ANN, we used a standard machine-learning method called 10-fold cross-validation, which ensures that a test sample is never used for training. In our 10-fold cross-validation, the data was divided into 10 subsets that were approximately equal in size. In the first iteration, 9 of these subsets were combined and used for training. The remaining 10th set was used for testing the performance of our ANN on unseen cases. We repeated this process for 10 iterations until all subsets were used once for testing. In addition to 10-fold cross-validation, to assess the robustness of our ANN, we performed the following supplementary analyses: 1) we trained our ANN on the first half of the dataset and tested on the second half, 2) we trained our ANN on the second half of the dataset and tested on the first half.
We used “early stopping (ES)” procedure to prevent our ANN from overfitting and to keep it generalizable to future cases.50, 51 Generalizability is the ability of a model to demonstrate similar predictive performance on data not used for training but consisting of unseen cases from the same population. A model lacks generalizability when overfitting occurs, a phenomenon whereby the model “memorizes” the cases in the training data but fails to generalize to new data. When overfitting occurs, ANNs obtain spuriously good performance by learning anomalous patterns unique to the training set but generate high error resulting in low accuracy when presented with unseen data.52 We performed ES by using a validation (tuning) set, in addition to a training and a testing set, to calculate the network error during training and to stop training early if necessary to prevent overfitting.50-52
We evaluated the discriminative ability of our ANN against radiologists at an aggregate level and at an individual-radiologist level. We plotted the receiver-operator characteristic (ROC) curve for our ANN by using the probabilities generated for all findings by means of our 10-fold cross-validation technique. We constructed the ROC curves for all radiologists individually and in aggregate by using BI-RADS assessment categories assigned by the radiologists to each finding. We ordered BI-RADS assessment categories by the increasing likelihood of malignancy (1<2<3<0<4<5) for this purpose. We measured area under the curve (AUC), sensitivity, and specificity to assess the discriminative ability of our ANN and the radiologists (in aggregate and individually). We used a 2-tailed DeLong method53 to measure and compare AUCs because it accounts for correlation between the ROC curves obtained from the same data.
We calculated sensitivity and specificity of our ANN and the radiologists at recommended levels of performance: sensitivity at a specificity of 90% and specificity at a sensitivity of 85%, as they represent the minimal performance thresholds for screening-mammography.54 When calculating the sensitivity and specificity of the radiologists, we considered BI-RADS 0, 4, and 5 positive, whereas BI-RADS 1, 2, and 3 were designated negative.45 We used 1-tailed McNemar test to compare sensitivity and specificity between the radiologists and our ANN.55 A McNemar test accounts for correlation between the sensitivity and specificity ratios and is not defined when the ratios are equal, nor when 1 of the ratios is 0 or 1. We used the Wilson method to generate confidence intervals for sensitivity and specificity.56 We considered P < .05 to be the level of statistical significance.
We assessed the calibration of our ANN by calculating the Hosmer-Lemeshow (H-L) goodness-of-fit statistic57 and plotting a calibration curve. The H-L statistic compares the observed and predicted risk within risk categories. A lower H-L statistic and a higher P value (P > .05) indicate better calibration. For the H-L statistic, the predicted risks of findings were rank-ordered and divided into 10 groups, based on their predicted probability. Within each predicted risk group, the number of predicted malignancies was accumulated against the number of observed malignancies. The H-L statistic was calculated from this 2 × 10 contingency table. The H-L statistic was then compared with the chi-square distribution, with degrees of freedom equal to 8. We also plotted a calibration curve to visually compare calibration of our ANN to the perfect calibration in predicting breast malignancy risk. In a calibration curve, a line at a 45° angle (line of identity) indicates perfect calibration. Data points to the right of the perfect calibration line represent overestimation of the risk, and those to the left of the line represent underestimation.58 Although a calibration curve does not provide a quantitative measure of reliability for probability predictions, it provides a graphical representation of the degree to which predicted probability of malignancy by our ANN corresponds to actual prevalence.58, 59 The calibration curve shows the ability of the model to enable prediction of probabilities across all ranges of risk.
After matching to the cancer registry, our final matched dataset contained a total of 62,219 findings [510 (0.8%), malignant and 61,709 (99.2%) benign], in 18,269 patients (17,924 women and 345 men). The mean age of the female patients was 56.5 years (range, 17.7-99.1; SD, 12.7). Women were, on average, 2 years younger compared with men, whose mean age was 58.5 years (range, 18.6-88.5; SD, 15.7).
Our analysis at the mammogram level showed that 14% of the mammographic abnormalities occurred predominantly in fatty tissue, 41% in scattered fibroglandular tissue, 36% in heterogeneously dense tissue, and 9% in extremely dense tissue (Table 1). At the findings level, the cancers included 246 masses, 121 microcalcifications, 27 asymmetries, 18 architectural distortions, 86 combinations of findings, and 12 other.
Cancer registry match revealed a detection rate of 8.9 cancers per 1000 mammograms for the radiologists at the mammogram level (432 cancers for 48,744 mammograms—33 patients had more than 1 cancer resulting in 510 total cancers). The abnormal interpretation rate (considering BI-RADS 0, 4, and 5 abnormal) was 18.5% (9037 of 48,744 mammograms). Of all the 432 cancers, 390 had staging information from the cancer registry, and 42 did not. Of the detected cancers with staging information, only 26.7% (104 of 390) had lymph node metastasis, and 71% (277 of 390) were early stage (ie, stage 0 or 1).
Following training and testing using 10-fold cross-validation, the AUC of our ANN, 0.965, was significantly higher than that of the radiologists in aggregate, 0.939 (P < .001), at the finding level, which implied that our ANN performed better than the radiologists alone in discriminating between benign and malignant findings. The ROC curve of our ANN (aggregate level) dominated the combined ROC curve of all radiologists at all cutoff thresholds (Fig. 1). This trend was preserved when the ANN was trained on the first half of the dataset and tested on the second half (ANN AUC, 0.949; radiologists AUC, 0.926; P < .001) or when trained on the second half of the dataset and tested on the first half (ANN AUC, 0.966; radiologists AUC, 0.951; P < .001). At the individual radiologists level, 4 of 8 comparisons were not statistically significant (Table 3). Of the 4 significant differences, our ANN outperformed the radiologists in all except a single, low-volume reader (Radiologist 8, Table 3).
At a specificity of 90%, the sensitivity of our ANN was significantly better (90.7% vs 82.2%; P < .001) than that of the radiologists (in aggregate; Table 4). Our ANN identified 44 more cancers when compared with the radiologists at this level of specificity (Table 5, part A.). At a fixed sensitivity of 85%, the specificity of our ANN was also significantly better (94.5% vs 88.2%, P < .001) than that of the radiologists (in aggregate; Table 4). Our ANN decreased the number of false positives by 3941 when compared with the radiologists' performance at this level of sensitivity (Table 5, part B). In terms of specificity, all statistically significant comparisons revealed the ANN to be superior with the exception of 1 low-volume reader (Radiologist 8 in Table 4). In terms of sensitivity, all statistically significant comparisons revealed the ANN to be superior; however, 1 low-volume reading radiologist demonstrated the opposite trend (Radiologist 1 in Table 4).
|1||3312||77||93.5 (84.8, 97.6)||88.4 (78.4,94.1)||.0625||94.4 (93.6, 95.2)||96.9 (96.4, 97.5)||<.001|
|3||18953||180||78.3 (71.4, 83.9)||90.0 (84.5, 93.8)||<.001||85.0 (84.4, 85.5)||95.0 (94.7, 95.3)||<.001|
|4||26690||171||82.4 (75.7, 87.6)||93.0 (87.8, 96.1)||<.001||85.6 (85.1, 86.0)||96.4 (96.1, 96.5)||<.001|
|6||6796||36||83.3 (66.5, 93.0)||86.1 (69.7, 94.7)||.999||88.4 (87.6, 89.1)||94.5 (93.9, 95.0)||<.001|
|7||3637||29||75.8 (56.0, 88.9)||72.5 (52.5, 86.5)||.999||79.9 (78.6, 81.2)||86.2 (85.0, 87.2)||<.001|
|8||1695||9||77.7 (40.1, 96.0)||66.7 (30.9, 90.9)||.999||86.7 (85.0, 88.3)||80.7 (78.7, 82.5)||<.001|
|Unassignede||497||7||100.0 (56.1, 100.0)||100.0 (56.1, 100.0)||ND||98.3 (96.7, 99.2)||99.6 (98.4, 99.9)||0.015|
|Total||61709||510||82.2 (78.5, 85.3)||90.7 (87.8, 93.0)||<.001||88.2 (87.9, 88.5)||94.5 (94.3, 94.6)||<.001|
|Radiologists||419 (400-435)||91 (75-110)|
|ANN||463 (449-475)||47 (36-62)|
|B.||Performance at 85% Sensitivity|
|False Negative||True Positive|
|Radiologists||7282 (7126-7441)||54,427 (54,268-54,583)|
|ANN||3341 (3232-3454)||58,368 (58,256-58,477)|
The H-L statistic for our ANN was 12.46 (P = .13, df = 8). The precision of the predicted probabilities is shown graphically in Figure 2. Although the calibration curve of our ANN does not perfectly match the line of identity (the line at a 45° angle), the deviation is pictorially minimal.
We have demonstrated that our ANN can accurately estimate the risk of breast cancer by using a dataset that contains demographic data and prospectively collected mammographic findings. To our knowledge, this study uses 1 of the largest datasets of mammography findings to develop a CADx model. Our results demonstrate that ANNs may have the potential to aid radiologists in discriminating between benign and malignant breast diseases. When we compare discriminative accuracy by using AUC, sensitivity, and specificity, our ANN performs significantly better than all radiologists in aggregate. Although the difference between the AUCs of the radiologists and our ANN may appear to be small (0.026), this difference is both statistically (P < .001) and clinically significant because our ANN identified 44 more cancers and decreased the number of false positives by 3941 when compared with the radiologists at the specified sensitivity and specificity values. Note that these results would be similar for any other specified sensitivity and specificity values because the ROC curve of our ANN outperforms that of the radiologists at all threshold levels. On the other hand, the reason for obtaining a numerically small difference between the AUCs relates to the disproportionate number of benign findings (61,709) compared to malignant findings (510) in our dataset resulting in very high specificity at baseline and little room for improvement in this parameter.
Among statistically significant comparisons, our ANN demonstrates superior AUC, sensitivity, and specificity versus all but 1 radiologist, including the 2 highest-volume readers. Therefore, similar to other ANN models presented in the literature, our ANN has the potential to aid radiologists in classifying (discriminating) findings on mammograms by predicting the risk of malignancy. When compared with the previous CADx models developed by our research team (a logistic regression and a Bayesian network), the discrimination performance of our ANN was slightly higher (ANN AUC, 0.965; logistic regression AUC, 0.963; Bayesian network AUC, 0.960). On the other hand, no statistically significant difference was found between the ANN and the logistic regression (P = .57), or the ANN and the Bayesian network (P = .13).
However, our model is unique in several ways. In contrast to prior ANN models, which used a relatively small selected population of suspicious findings undergoing tissue sampling with biopsy as the reference standard,30, 31, 35-37, 39 we use a large consecutive dataset of mammography findings with tumor registry outcomes as the reference standard to train our ANN. Furthermore, contrary to previously developed CADx models in breast cancer-risk prediction, we expand the evaluation of CADx models beyond discrimination by measuring the accuracy of the estimated probabilities themselves by using calibration metrics.
Although discrimination or accurate classification is of primary interest for CADx models,11, 60 calibration is also crucial, especially when clinical decisions are being made for individual patients.11, 61 Individual decisions are made under uncertainty and, therefore, aided more effectively by accurate risk estimates. Because there is a trade off between discrimination and calibration,10 the selection of the primary performance measure should be based on the intended purpose of the model.11 In this study, similar to previous CADx models, we designed our ANN primarily for optimizing the discrimination ability. However, contrary to previous CADx studies, we also measured the calibration as the secondary objective. We showed that our ANN is well calibrated, as demonstrated by the low value of the H-L statistic, the corresponding high P value, and the favorable calibration curve; and, thus, our ANN can accurately estimate the risk of malignancy for individual patients. The ability of our ANN to assign accurate numeric probabilities is an important complement to its ability to discriminate between ultimate outcomes.61
We posit that the good calibration of our ANN is attributable to both the characteristics of our training set and attributes of our model. For example, the consecutive nature of our dataset of mammography findings and the use of a tumor registry match as a reference standard, which reflects a real-world population, may lead to accurate calibration. In addition, the use of a large number of hidden nodes in concert with training with a validation set to prevent overfitting may have enhanced calibration. In future work, we plan to analyze which parameters most profoundly influence calibration.
CADx models for breast cancer risk estimation have ignored calibration and have typically been developed and evaluated on the basis of their discrimination ability.31-39 Although calibration has not been formally assessed in previous CADx models, there is some evidence that these models are not well calibrated.31, 35, 42 Poor calibration may indicate that these models are not optimized for individual cases, ie, the predicted breast cancer risk for a single patient may be incorrect.
From a clinical standpoint, our ANN may be valuable because it provides an accurate post-test probability for malignancy. This post-test probability may be useful to communication among the radiologist, patient, and referring physician, which, in turn, may encourage making shared decisions.5-7 Each individual patient has a unique risk tolerance and comorbidities, and these factors should be considered when making decisions involving mammographic abnormalities. Risk assessments based on individual characteristics may also help promote the concept of personalized care in the diagnosis of breast cancer. Furthermore, our ANN is designed to increase the effectiveness of mammography by aiding radiologists and not by acting as a substitute. Our ANN quantifies the risk of breast cancer by using mammographic features assessed by the radiologist, so the ANN's performance depends largely on the radiologist's accurate observations and overall assessment (BI-RADS category).
Our ANN has the potential to be used as a decision-support tool, although it may face similar challenges that have, in the past, prevented the implementation of effective decision-support algorithms in clinical practice. To be used in the clinic, a decision-support tool must be seamlessly integrated into the clinical workflow, which can be challenging. We believe in the case of mammography, a decision-support tool would be most useful if directly linked to structured reporting software that radiologists use in daily practice, which would enable immediate feedback. On the other hand, the good performance of our ANN may not be preserved after the integration into clinical practice. Before clinical integration, it is important to consider the ways our ANN could fail, due to both inherent theoretical limitations and errors that may occur during the process of integration.62 In fact, numerous computer-aided diagnostic models that have performed well in evaluation studies have not made an impact on clinical practice.63-68 Furthermore, the optimal performance of our ANN would be required to gain the trust of clinicians to influence clinical practice. Unfortunately, the parameters of ANNs do not carry any real-life interpretation, and clinicians have trouble trusting decision-support algorithms that represent a “black box” without explanation capabilities. Although there is rule extraction software that converts a trained ANN to a more humanly understandable representation,69-71 integration of these various software programs with the ANN requires extra effort. Therefore, we recognize that substantial challenges remain in the implementation of ANNs for decision support at the point of care, and we emphasize the importance of these issues for future research and implementation.
There are 3 important implementation considerations. First, determining the number of effective hidden nodes in an ANN is crucial and may significantly affect its output performance. Unfortunately, there is no general rule to determine the effective number of hidden nodes that maximizes the network performance when presented with an unseen dataset (generalizability).47 Although some researchers have said that conventional wisdom suggests that when neural networks have excess hidden nodes they generalize poorly,48 several recent studies in the machine-learning literature have shown that ANNs with excess capacity (ie, with a large number of hidden nodes) generalize better than small networks (ie, networks with a small number of hidden nodes) when trained with backpropagation and early stopping.47-49 Therefore, we built an ANN with excess capacity and did not optimize the number of hidden nodes. Also, note that if we had optimized the number of hidden nodes to maximize the AUC, as other researches have, we would have achieved an even higher AUC than described here.
Second, selection of the primary performance measure is also crucial when building an ANN model. In our study, we built our ANN principally to maximize the discrimination accuracy because discrimination is of primary interest to optimize accurate diagnosis.11, 60 On the other hand, ANNs could also be trained for maximizing the calibration when the primary purpose is to stratify individuals into higher or lower risk categories of clinical importance. However, it should be noted that for a direct maximization of calibration, the estimated probabilities by the ANN should be compared with the true underlying probabilities,72