`High-performance medicine: the convergence of
`human and artificial intelligence
`Eric J. Topol 
`The use of artificial intelligence, and the deep-learning subtype in particular, has been enabled by the use of labeled big data, along
`with markedly enhanced computing power and cloud storage, across all sectors. In medicine, this is beginning to have an impact
`at three levels: for clinicians, predominantly via rapid, accurate image interpretation; for health systems, by improving workflow
`and the potential for reducing medical errors; and for patients, by enabling them to process their own data to promote health.
`The current limitations, including bias, privacy and security, and lack of transparency, along with the future directions of these
`applications will be discussed in this article. Over time, marked improvements in accuracy, productivity, and workflow will likely
`be actualized, but whether that will be used to improve the patient–doctor relationship or facilitate its erosion remains to be seen.
`Medicine is at the crossroad of two major trends. The first
`is a failed business model, with increasing expenditures
`and jobs allocated to healthcare, but with deteriorating key
`outcomes, including reduced life expectancy and high infant, child-
`hood, and maternal mortality in the United States1,2. This exem-
`plifies a paradox that is not at all confined to American medicine:
`investment of more human capital with worse human health out-
`comes. The second is the generation of data in massive quantities,
`from sources such as high-resolution medical imaging, biosensors
`with continuous output of physiologic metrics, genome sequenc-
`ing, and electronic medical records. The limits on analysis of such
`data by humans alone have clearly been exceeded, necessitating
`an increased reliance on machines. Accordingly, at the same time
`that there is more dependence than ever on humans to provide
`healthcare, algorithms are desperately needed to help. Yet the inte-
`gration of human and artificial intelligence (AI) for medicine has
`barely begun.
`Looking deeper, there are notable, longstanding deficiencies in
`healthcare that are responsible for its path of diminishing returns.
`These include a large number of serious diagnostic errors, mis-
`takes in treatment, an enormous waste of resources, inefficiencies
`in workflow, inequities, and inadequate time between patients and
`clinicians3,4. Eager for improvement, leaders in healthcare and com-
`puter scientists have asserted that AI might have a role in address-
`ing all of these problems. That might eventually be the case, but
`researchers are at the starting gate in the use of neural networks to
`ameliorate the ills of the practice of medicine. In this Review, I have
`gathered much of the existing base of evidence for the use of AI in
`medicine, laying out the opportunities and pitfalls.
`Artificial intelligence for clinicians
`Almost every type of clinician, ranging from specialty doctor to
`paramedic, will be using AI technology, and in particular deep
`learning, in the future. This largely involved pattern recognition
`using deep neural networks (DNNs) (Box 1) that can help interpret
`medical scans, pathology slides, skin lesions, retinal images, electro-
`cardiograms, endoscopy, faces, and vital signs. The neural net inter-
`pretation is typically compared with physicians’ assessments using a
`plot of true-positive versus false-positive rates, known as a receiver
`operating characteristic (ROC), for which the area under the curve
`(AUC) is used to express the level of accuracy (Box 1).
`Radiology. One field that has attracted particular attention for
`application of AI is radiology5. Chest X-rays are the most common
`type of medical scan, with more than 2 billion performed worldwide
`per year. In one study, the accuracy of one algorithm, based on a
`121-layer convolutional neural network, in detecting pneumonia in
`over 112,000 labeled frontal chest X-ray images was compared with
`that of four radiologists, and the conclusion was that the algorithm
`outperformed the radiologists. However, the algorithm’s AUC of
`0.76, although somewhat better than that for two previously tested
`DNN algorithms for chest X-ray interpretation5, is far from optimal.
`In addition, the test used in this study is not necessarily comparable
`with the daily tasks of a radiologist, who will diagnose much more
`than pneumonia in any given scan. To further validate the conclu-
`sions of this study, a comparison with results from more than four
`radiologists should be made. A team at Google used an algorithm
`that analyzed the same image set as in the previously discussed
`study to make 14 different diagnoses, resulting in AUC scores that
`ranged from 0.63 for pneumonia to 0.87 for heart enlargement or
`a collapsed lung6. More recently, in another related study, it was
`shown that a DNN that is currently in use in hospitals in India for
`interpretation of four different chest X-ray key findings was at least
`as accurate as four radiologists7. For the narrower task of detecting
`cancerous pulmonary nodules on a chest X-ray, a DNN that retro-
`spectively assessed scans from over 34,000 patients achieved a level
`of accuracy exceeding 17 of 18 radiologists8. It can be difficult for
`emergency room doctors to accurately diagnose wrist fractures,
`but a DNN led to marked improvement, increasing sensitivity from
`81% to 92% and reducing misinterpretation by 47% (ref. 9).
`Similarly, DNNs have been applied across a wide variety of
`medical scans, including bone films for fractures and estimation of
`aging10–12, classification of tuberculosis13, and vertebral compression
`fractures14; computed tomography (CT) scans for lung nodules15,
`liver masses16, pancreatic cancer17, and coronary calcium score18;
`brain scans for evidence of hemorrhage19, head trauma20, and acute
`referrals21; magnetic resonance imaging22; echocardiograms23,24;
`and mammographies25,26. A unique imaging-recognition study
`focusing on the breadth of acute neurologic events, such as stroke
`or head trauma, was carried out on over 37,000 head CT 3-D scans,
`which the algorithm analyzed for 13 different anatomical find-
`ings versus gold-standard labels (annotated by expert radiologists)
`and achieved an AUC of 0.73 (ref. 27). A simulated prospective,
`double-blind, randomized control trial was conducted with real
`cases from the dataset and showed that the deep-learning algorithm
`could interpret scans 150 times faster than radiologists (1.2 versus
`177 seconds). But the conclusion that the algorithm’s diagnostic
`accuracy in screening acute neurologic scans was poorer than human
`Box 1 | deep learning
`While the roots of AI date back over 80 years from concepts
`laid out by Alan Turing204,205 and Warren McCulloch and Walter
`Pitts206, it was not until 2012 that the subtype of deep learning
`was widely accepted as a viable form of AI207. A deep learning
`neural network consists of digitized inputs, such as an image
`or speech, which proceed through multiple layers of connected
`‘neurons’ that progressively detect features, and ultimately pro-
`vides an output. By analyzing 1.2 million carefully annotated
`images from over 15 million in the ImageNet database, a DNN
`achieved, for that point in time, an unprecedented low error
`rate for automated image classification. That report, along with
`Google Brain’s 10 million images from YouTube videos to accu-
`rately detect cats, laid the groundwork for future progress. With-
`in 5 years, in specific large data-labeled test sets, deep-learning
`algorithms for image recognition surpassed the human accuracy
`rate208,209, and, in parallel, suprahuman performance was demon-
`strated for speech recognition.
`The basic DNN architecture is like a club sandwich turned on
`its side, with an input layer, a number of hidden layers ranging
`from 5 to 1,000, each responding to different features of the
`image (like shape or edges), and an output layer. The layers are
`‘neurons,’ comprising a neural network, even though there is
`little support of the notion that these artificial neurons function
`similarly to human neurons. A key differentiating feature of deep
`learning compared with other subtypes of AI is its autodidactic
`quality; the neural network is not designed by humans, but rather
`the number of layers (Fig. 1) is determined by the data itself.
`Image and speech recognition have primarily used supervised
`learning, with training from known patterns and labeled input
`data, commonly referred to as ground truths. Learning from
`unknown patterns without labeled input data—unsupervised
`learning—has very rarely been applied to date. There are many
`types of DNNs and learning, including convolutional, recurrent,
`generative adversarial, transfer, reinforcement, representation,
`and transfer (for review see refs. 210,211). Deep-learning algorithms
`have been the backbone of computer performance that exceeds
`human ability in multiple games, including the Atari video
`game Breakout, the classic game of Go, and Texas Hold’em
`poker. DNNs are largely responsible for the exceptional progress
`in autonomous cars, which is viewed by most as the pinnacle
`technological achievement of AI to date. Notably, except in
`the cases of games and self-driving cars, a major limitation to
`interpretation of claims reporting suprahuman performance of
`these algorithms is that analytics are performed on previously
`generated data in silico, not prospectively in real-world clinical
`conditions. Furthermore, the lack of large datasets of carefully
`annotated images has been limiting across various disciplines in
`medicine. Ironically, to compensate for this deficiency, generative
`adversarial networks have been used to synthetically produce
`large image datasets at high resolution, including mammograms,
`skin lesions, echocardiograms, and brain and retina scans, that
`could be used to help train DNNs212–216.
`performance was sobering and indicates that there is much more
`work to do.
`For each of these studies, a relatively large number of labeled
`scans were used for training and subsequent evaluation, with
`AUCs ranging from 0.99 for hip fracture to 0.84 intracranial bleed-
`ing and liver masses to 0.56 for acute neurologic case screening. It
`is not possible to compare DNN accuracy from one study to the
`next because of marked differences in methodology. Furthermore,
`ROC and AUC metrics are not necessarily indicative of clini-
`cal utility or even the best way to express accuracy of the model’s
`performance28,29. Furthermore, many of these reports still only
`exist in preprint form and have not appeared in peer-reviewed pub-
`lications. Validation of the performance of an algorithm in terms of
`its accuracy is not equivalent to demonstrating clinical efficacy. This
`is what Pearse Keane and I have referred to as the ‘AI chasm’—that is,
`an algorithm with an AUC of 0.99 is not worth very much if it is
`not proven to improve clinical outcomes30. Among the studies that
`have gone through peer review (many of which are summarized
`in Table 1), the only prospective validation studies in a real-world
`setting have been for diabetic retinopathy31,32, detection of wrist
`fractures in the emergy room setting33, histologic breast cancer
`metastases34,35, very small colonic polyps36,37, and congenital cata-
`racts in a small group of children38. The field clearly is far from dem-
`onstrating very high and reproducible machine accuracy, let alone
`clinical utility, for most medical scans and images in the real-world
`clinical environment (Table 1).
`Pathologists have been much slower at adopting digitization of scans
`than radiologists39—they are still not routinely converting glass
`slides to digital images and use whole-slide imaging (WSI) to enable
`viewing of an entire tissue sample on a slide. Marked heterogene-
`ity and inconsistency among pathologists’ interpretations of slides
`has been amply documented, exemplified by a lack of agreement
`Hidden layers
`Input layer
`Output layer
`Fig. 1 | A deep neural network, simplified. Credit: Debbie Maizels/Springer
`in diagnosis of common types of lung cancer (Κ = 0.41–0.46)40.
`Deep learning of digitized pathology slides offers the potential to
`improve accuracy and speed of interpretation, as assessed in a few
`retrospective studies. In a study of WSI of breast cancer, with or
`without lymph node metastases, that compared the performance of
`11 pathologists with that of multiple algorithmic interpretations, the
`results varied and were affected in part by the length of time that the
`pathologists had to review the slides41. Some of the five algorithms
`performed better than the group of pathologists, who had varying
`expertise. The pathologists were given 129 test slides and had less
`than 1 minute for review per slide, which likely does not reflect nor-
`mal workflow. On the other hand, when one expert pathologist had
`no time limits and took 30 hours to review the same slide set, the
`results were comparable with the algorithm for detecting noninva-
`sive ductal carcinoma42.
`Table 1 | Peer-reviewed publications of Ai algorithms compared
`with doctors
`Titano et al. 27
`CT head, acute
`neurological events
`CT head for brain
`CT head for trauma
`CXR for metastatic lung
`CXR for multiple findings
`Mammography for breast
`Wrist X-ray*
`Breast cancer
`Lung cancer ( +  driver
`Brain tumors
`( +  methylation)
`Breast cancer metastases* Steiner et al.35
`Breast cancer metastases
`Liu et al.34
`Skin cancers
`Esteva et al.47
`Haenssle et al.48
`Skin lesions
`Han et al.49
`Diabetic retinopathy
`Gulshan et al.51
`Diabetic retinopathy*
`Abramoff et al.31
`Diabetic retinopathy*
`Kanagasingam et al.32
`Congenital cataracts
`Long et al.38
`Retinal diseases (OCT)
`De Fauw et al.56
`Macular degeneration
`Burlina et al.52
`Retinopathy of prematurity Brown et al.60
`AMD and diabetic
`Kermany et al.53
`Gastroenterology Polyps at colonoscopy*
`Polyps at colonoscopy
`Arbabshirani et al.19
`Chilamkurthy et al.20
`Nam et al.8
`Singh et al.7
`Lehman et al.26
`Lindsey et al.9
`Ehteshami Bejnordi et al.41
`Coudray et al.33
`Capper et al.45
`Mori et al.36
`Wang et al.37
`Madani et al.23
`Zhang et al.24
`Zebra Medical
`Bay Labs
`July 2018
`June 2018
`Neural Analytics May 2018
`Table 2 | FdA Ai approvals are accelerating
`FdA Approval
`September 2018
`Atrial fibrillation detection
`August 2018
`CT brain bleed diagnosis
`August 2018
`Breast density via
`Coronary calcium scoring
`Echocardiogram EF
`Device for paramedic stroke
`Diabetic retinopathy diagnosis
`MRI brain interpretation
`X-ray wrist fracture diagnosis
`CT stroke diagnosis
`Liver and lung cancer (MRI, CT)
`CT brain bleed diagnosis
`Atrial fibrillation detection via
`Apple Watch
`MRI heart interpretation
`April 2018
`April 2018
`March 2018
`February 2018
`February 2018
`January 2018
`November 2017
`January 2017
`and the algorithm led to the best accuracy, and the algorithm mark-
`edly sped up the review of slides35. This study is particularly notable,
`as the synergy of the combined pathologist and algorithm interpreta-
`tion was emphasized instead of the pervasive clinician-versus-algo-
`rithm comparison. Apart from classifying tumors more accurately by
`data processing, the use of a deep-learning algorithm to sharpen out-
`of-focus images may also prove useful46. A number of proprietary
`algorithms for image interpretation have been approved by the Food
`and Drug Administration (FDA), and the list is expanding rapidly
`(Table 2), yet there have been few peer-reviewed publications from
`most of these companies. In 2018, the FDA published a fast-track
`approval plan for AI medical algorithms.
`Dermatology. For algorithms classifying skin cancer by image
`analysis, the accuracy of diagnosis of deep-learning networks has
`been compared with that of dermatologists. In a study using a
`large training dataset of nearly 130,000 photographic and derma-
`scopic digitized images, 21 US board-certified dermatologists were
`at least matched in performance by an algorithm, which had an
`AUC of 0.96 for carcinoma47 and of 0.94 for melanoma specifically.
`Subsequently, the accuracy of melanoma skin cancer diagnosis by a
`group of 58 international dermatologists was compared with a con-
`volutional neural network; the mean ROCs were 0.79 versus 0.86,
`respectively, reflecting an improved performance of the algorithm
`compared with most of the physicians48. A third study carried out
`algorithmic assessment of 12 skin diseases, including basal cell car-
`cinoma, squamous cell carcinoma, and melanoma, and compared
`this with 16 dermatologists, with the algorithm achieving an AUC
`of 0.96 for melanoma49. None of these studies were conducted in the
`clinical setting, in which a doctor would perform physical inspec-
`tion and shoulder responsibility for making an accurate diagnosis.
`Notwithstanding these concerns, most skin lesions are diagnosed
`by primary care doctors, and problems with inaccuracy have been
`underscored; if AI can be reliably shown to simulate experienced
`dermatologists, that would represent a significant advance.
`Ophthalmology. There have been a number of studies comparing
`performance between algorithms and ophthalmologists in diagnosing
`Prospective studies are denoted with an asterisk.
`Other studies have assessed deep-learning algorithms for clas-
`sifying breast cancer43 and lung cancer40 without direct compari-
`son with pathologists. Brain tumors can be challenging to subtype,
`and machine learning using tumor DNA methylation patterns via
`sequencing led to markedly improved classification compared with
`pathologists using traditional histological data44,45. DNA meth-
`ylation generates extensive data and at present is rarely performed
`in the clinic for classification of tumors, but this study suggests
`another potential for AI to provide improved diagnostic accuracy in
`the future. A deep-learning algorithm for lung cancer digital pathol-
`ogy slides not only was able to accurately classify tumors, but also
`was trained to detect the pattern of several specific genomic driver
`mutations that would not otherwise be discernible by pathologists33.
`The first prospective study to test the accuracy of an algorithm
`classifying digital pathology slides in a real clinical setting was an
`assessment of the identification of presence of breast cancer micro-
`metastases in slides by six pathologists compared with a DNN (that
`had been retrospectively validated34). The combination of pathologists
`different eye conditions. After training with over 128,000 retinal
`fundus photographs labeled by 54 ophthalmologists, a neural net-
`work was used to assess over 10,000 retinal fundus photographs
`from more than 5,000 patients for diabetic retinopathy, and the
`neural network’s grading was compared with seven or eight oph-
`thalmologists for all-cause referable diagnoses (moderate or worse
`retinopathy or macular edema; scale: none, mild, moderate, severe,
`or proliferative). In two separate validation sets, the AUC was 0.99
`(refs. 50,51). In a study in which retinal fundus photographs were
`used for the diagnosis of age-related macular degeneration (AMD),
`the accuracy for DNN algorithms ranged between 88% and 92%,
`nearly as high as for expert ophthalmologists52. Performance of a
`deep-learning algorithm for interpreting retinal optical coher-
`ence tomography (OCT) was compared with ophthalmologists for
`diagnosis of either of the two most common causes of vision loss:
`diabetic retinopathy or AMD. After the algorithm was trained on a
`dataset of over 100,000 OCT images, validation was performed in
`1,000 of these images, and performance was compared with six oph-
`thalmologists. The algorithm’s AUC for OCT-based urgent referral
`was 0.999 (refs. 53–55).
`Another deep-learning OCT retinal study went beyond the diag-
`nosis of diabetic retinopathy or macular degeneration. A group of
`997 patients with a wide range of 50 retinal pathologies was assessed
`for urgent referral by an algorithm (using two different types of
`OCT devices that produce 3-D images) and results were compared
`with those from experts: four retinal specialists and four optom-
`etrists, with an AUC for accuracy of urgent referral triage to replace
`false alarm of 0.992. The algorithm did not miss a single urgent
`referral case. Notably, the eight clinicians agreed on only 65% of
`the referral decisions. Errors on the correct referral decision were
`reduced for both types of clinicians by integrating the fundus
`photograph and notes on the patient, but the algorithm’s error rate
`(without notes or fundus photographs) of 3.5% was as good or
`better than all eight experts56. One unique aspect of this study was
`the transparency of the two neural networks used, one for mapping
`the eye OCT scans into a tissue schematic and the other for the
`classifier of eye disease. The user (patient) can watch a video that
`shows what portions of his or her scan were used to reach the algo-
`rithm’s conclusions along with the level of confidence it has for the
`diagnosis. This sets a new bar for future efforts to unravel the ‘black
`box’ of neural networks.
`In a prospective trial conducted in primary care clinics, 900
`patients with diabetes but no known retinopathy were assessed by
`a proprietary system (an imaging device combined with an algo-
`rithm) made by IDx (Iowa City, IA) that obtained retinal fundus
`photographs and OCT and by established reading centers with
`expertise in interpreting these images30,31. The algorithm was used
`at primary care clinics up until the clinical trial was autodidactic
`and thus locked for testing, but it achieved a sensitivity of 87% and
`specificity of 91% for the 819 patients (91% of the enrolled cohort)
`with analyzable images. This trial led to FDA approval of the IDx
`device and algorithm for autonomous detection, that is, without
`the need for a clinician, of ‘more than mild’ diabetic retinopathy.
`The regulatory oversight in dealing with deep-learning algorithms
`is tricky because it does not currently allow continued autodidactic
`functionality but instead necessitates fixing the software to behave
`like a non-AI diagnostic system30. Notwithstanding this point along
`with the unknown extent of uptake of the device, the study repre-
`sents a milestone as the first prospective assessment of AI in the
`clinic. The accuracy results are not as good as the aforementioned
`in silico studies, which should be anticipated. A small prospective
`real-world assessment of a DNN for diabetic retinopathy in primary
`care clinics, with eye exams performed by nurses, led to a high false-
`positive diagnosis rate32.
`While the studies of retinal OCT and fundus images have thus far
`focused on eye conditions, recent work suggests that these images
`can provide a window to the brain for early diagnosis of dementia,
`including Alzheimer’s disease57.
`The potential use of retinal photographs also appears to tran-
`scend eye diseases per se. Images from over 280,000 patients were
`assessed by DNN for cardiovascular risk factors, including age,
`gender, systolic blood pressure, smoking status, hemoglobin A1c,
`and likelihood of having a major adverse cardiac event, with vali-
`dation in two independent datasets. The AUC for gender at 0.97
`was notable, indicating that the algorithm could identify gender
`accurately from the retinal photo, but the others were in the range
`of 0.70, suggesting that there may be a signal that, through further
`pursuit, could be useful for monitoring patients for control of their
`risk factors58,59.
`Other less common eye conditions that have been assessed by
`neural networks include congenital cataracts38 and retinopathy of
`prematurity in newborns60, both with accuracy comparable with
`that of eye specialists.
`Cardiology. The major images that cardiologists use in practice are
`electrocardiograms (ECG) and echocardiograms, both of which
`have been assessed with DNNs. There is a nearly 40-year history
`of machine-read ECGs using rules-based algorithms with notable
`inaccuracy61. When deep learning was used to diagnose heart attack
`in a small retrospective dataset of 549 ECGs, a sensitivity of 93%
`and specificity of 90% were reported, which was comparable with
`cardiologists62. Over 64,000 one-lead ECGs (from over 29,000
`patients) were assessed for arrhythmia by a DNN and six cardiolo-
`gists, with comparable accuracy across 14 different electrical con-
`duction disturbances63. For echocardiography, a small set of 267
`patient studies (consisting of over 830,000 still images) were classi-
`fied into 15 standard views (such as apical 4-chamber or subcostal)
`by a DNN and by cardiologists. The overall accuracy for single still
`images was 92% for the algorithm and 79% for four board-certified
`echocardiographers, but this does not reflect the real-world reading
`of studies, which are in-motion video loops23. An even larger retro-
`spective study of over 8,000 echocardiograms showed high accu-
`racy for classification of hypertrophic cardiomyopathy (AUC, 0.93),
`cardiac amyloid (AUC, 0.87), and pulmonary artery hypertension
`(AUC, 0.85)24.
`Gastroenterology. Finding diminutive (< 5 mm) adenomatous or
`sessile polyps at colonoscopy can be exceedingly difficult for gastro-
`enterologists. The first prospective clinical validation of AI was per-
`formed in 325 patients who collectively had 466 tiny polyps, with an
`accuracy of 94% and negative predictive value of 96% during real-
`time, routine colonoscopy36,64. The speed of AI optical diagnosis was
`35 seconds, and the algorithm worked equally well for both novice
`and expert gastroenterologists, without the need for injecting dyes.
`The findings of enhanced speed and accuracy were replicated in
`another independent study37. Such results are thematic: machine
`vision, at high magnification, can accurately and quickly interpret
`specific medical images as well as or better than humans.
`Mental health. The enormous burden of mental health, such as the
`350 million people around the world battling depression74, is espe-
`cially noteworthy, as there is potential here for AI to lend support to
`the affected patients and the vastly insufficient number of clinicians.
`Various tools that are in development include digital tracking of
`depression and mood via keyboard interaction, speech, voice, facial
`recognition, sensors, and use of interactive chatbots75–80. Facebook
`posts have been shown to predict the diagnosis of depression later
`documented in electronic medical records81.
`Machine learning has been explored for predicting success-
`ful antidepressant medication82, characterizing depression83–85,
`predicting suicide83,86–88, and predicting bouts of psychosis in
`for IVF
`sick newborns
`Voice medical
`coach via a smart
`speaker (like Alexa)
`dx of heart
`attack, stroke
`Assist reading
`of scans,
`slides, lesions
`cancer, identify
`patient safety
`Fig. 2 | Examples of Ai applications across the human lifespan. dx, diagnosis; IVF, in vitro fertilization K+, potassium blood level. Credit: Debbie Maizels/
`Springer Nature
`In addition to data from electronic health records, imaging has
`been integrated to enhance predictive accuracy98. Multiple stud-
`ies have attempted to predict biological age110,111, and this has been
`shown to best be accomplished using DNA methylation–based
`biomarkers112. With respect to the accuracy of algorithms for pre-
`diction of biological age, the incompleteness of data input is note-
`worthy, since a large proportion of unstructured data—the free text
`in clinician notes that cannot be ingested from the medical record—
`has not been incorporated, and neither have many other modalities
`such as socioeconomic, behavioral, biologic ‘-omics’, or physiologic
`sensor data. Further, concerns have been raised about the potential
`The use of AI algorithms has been described in many other clini-
`cal settings, such as facilitating stroke, autism or electroencepha-
`lographic diagnoses for neurologists65,66, helping anesthesiologists
`avoid low oxygenation during surgery67, diagnosis of stroke or heart
`attack for paramedics68, finding suitable clinical trials for oncolo-
`gists69, selecting viable embryos for in vitro fertilization70, help mak-
`ing the diagnosis of a congenital condition via facial recognition71
`and pre-empting surgery for patients with breast cancer72. Examples
`of the breadth of AI applications across human lifespan is shown in
`Fig. 2.There is considerable effort across many startups and estab-
`lished tech companies to develop natural language processing to
`replace the need for keyboards and human scribes for clinic vis-
`its73. The list of companies active in this space includes Microsoft,
`Google, Suki, Robin Healthcare, DeepScribe,, Saykara,
`Sopris Health, Carevoice, Orbita, Notable, Sensely and Augmedix.
`Artificial intelligence and health systems
`Being able to predict key outcomes could, theoretically, make the
`use of hospital palliative care resources more efficient and precise.
`For example, if an algorithm could be used to estimate the risk of a
`patient’s hospital readmission that would otherwise be undetectable
`given the usual clinical criteria for discharge, steps could be taken
`to avert discharge and attune resources to the underlying issues.
`For a critically ill patient, a very high likelihood of short-term sur-
`vival might help this patient and their family and doctor make deci-
`sions regarding resuscitation, insertion of an endotracheal tube for
`mechanical ventilation, and other invasive measures. Similarly, it is
`possible that deciding which patients might benefit from palliative
`care and determining who is at risk of developing sepsis or septic
`shock could be ameliorated by AI predictive tools. Using electronic
`health record data, machine- and deep-learning algorithms have been
`able to predict many important clinical parameters, ranging from
`Alzheimer’s disease to death (Table 3)86,90–107. For example, in a recent
`study, reinforcement learning was retrospectively carried out on two
`large datasets to recommend the use of vasopressors, intravenous
`fluids, and/or medications and the dose of the selected treatment for
`patients with sepsis; the treatment selected by the ‘AI Clinician’ was
`on average reliably more effective than that chosen by humans108.
`Both the size of the cohorts studied and the range of AUC accuracy
`reported have been quite heterogeneous, and all of these reports are
`retrospective and yet to be validated in the real-world clinical setting.
`Nevertheless, there are many companies that are already marketing
`such algorithms, such as Careskore, which is providing health sys-
`tems with estimated of risk of readmission and mortality based on
`EHR data109. Beyond this issue, there are the differences between the
`prediction metric for a cohort and an individual prediction metric.
`If a model’s AUC is 0.95, which most would qualify as very accurate

